Hello all parallel coders! I need your insights...

I have a big loop that I have parallelized and I need to see if the way I have the data is the optimal:

integer, parameter :: num_elem=40, num_domains=10000 integer :: IPVT(num_elem,num_domains), info, i, counter, chunk double precision :: A(num_elem,num_elem,num_domains), Bx(num_elem,num_domains), & By(num_elem,num_domains), Tx(num_elem,num_domains), Ty(num_elem,num_domains), & TB(4,num_domains), Cx(num_elem), Cy(num_elem) double precision :: f(num_elem,num_domains),x(num_elem,num_domains) logical :: converged call initialize(Cy,Cx,counter,chunk,x,f,converged) do while(.not. converged) !$omp parallel do default(none) & !$omp private(i, info) & !$omp shared(num_domains,A,IPVT,Ty,Tx,Bx,By,Cy,Cx,TB)& !$omp reduction(+:counter) & !$omp schedule(dynamic,chunk) do i=1,num_domains call calculate_domain(A(:,:,i),IPVT(:,i),Ty(:,i),Tx(:,i),Bx(:,i),By(:,i),TB(:,i),x(:,i),f(:,i),Cy,Cx) call solve_domain(A(:,:,i),IPVT(:,i),Ty(:,i),Tx(:,i),Bx(:,i),By(:,i),TB(:,i),x(:,i),f(:,i)) counter=counter+1 enddo call check_convergence(x,f,converged) enddo

-In this loop many domains are calculated by each thread.

-Inside calculate_domain(): A,IPVT,Ty,Tx,TB,f,Bx and By are intent(out) while Cy,Cx and x are intent(in).

-Inside solve_domain(): A,IPVT,Ty,Tx,TB,f,Bx and By are intent(in) while x is intent(inout).

-The schedule is dynamic because the work inside each subdomain is not the same.

I get some speedup but not the expected. This part of the code sums for 80% of the total program. I would expect a maximum theoretic speedup of 5x. I get 2.5-3x (I vary the number of threads from 2 to 48, on a 48-core machine. The max speedup is at 20-22 cores).

Thoughts I have:

1) Use thread private data. This will mean multiplying the memory used by the num_threads.

2) Use allocatable arrays to be sure each domain is sized as needed and not at the size of the biggest domain.

3) Use static schedule by manually dividing the domains

4) Put the data of each domain in a type structure and then have an array of structures. That is:

type domain_data double precission :: A(num_elem,num_elem), Bx(num_elem) ... end type domain_data type(domain_data), dimension(num_domains) :: all_domains_data

5) Any combination of the above.

Can you give me your thoughts? Someone more experienced maybe?

Thanks in advanced,

Petros