I have been puzzled with the behavior of the threading. In my example below, if the parameters and data array size are small, I can usually get the cpu usage consistent with the # of threads I specified in nThread. When I sress tested with a much larger data structure size - 10-20GB memory, which is something I eventually need, the cpu usage dropped dramatically to about 15-16% (on a 8-core computer). In this example, I have to allocate/deallocate arrays inside the loop because the massive size of the arrays. Does this allocate/deallcate cause the problem? If so, why was this not obvious in the small case, but caused problem in the larger case?
Any suggestion would be much appreciated. !$OMP PARALLEL PRIVATE(iLooper) Firstprivate(pSize) NUM_THREADS(nThread) !$OMP DO SCHEDULE(Dynamic) DO iLooper = 1, UniqCT1-1 ALLOCATE(Ejd(iLooper)%unit( Noofarcs )) ALLOCATE(Pred(iLooper)%unit( noofarcs )) ALLOCATE(pathtmp(iLooper)%unit(maxnu_pa)) CALL RETRIEVE_VEH_PATH(Arg_OriginSet(iLooper), & Arg_DestSet(iLooper), & Arg_TimeSet(iLooper), & iLooper,1,pSize) DEALLOCATE(Ejd(iLooper)%unit) DEALLOCATE(Pred(iLooper)%unit) ENDDO !$OMP END DO !$OMP END PARALLELCalmagC
CPU not fully utilized under different memory usage situations