I am threading a large bit of code that repeatedly calls a routine (in the present case, 100's of times). Within that routine I have used an !$omp parallel do structure to thread a double do loop. Threading directives surround the outer loop. I use a "schedule(dynamic)" clause to balance work load (says here), and specify the use of 4 cores. I could use some help interpreting VTune performance data. The time requited by "_kmp_barrier" is by far the largest block of time. For example, VTune reports:
- _kmp_barrier 17..53
- _kmp_x86_pause 4.52
- my routine 4.05
VTune also reports a histogram showing wall time and the number of cores running simultaneously. As I might expect, the largest time is for 1 core running (the master thread). But the wall times for 2, 3, 4 cores running simultaneously are not about the same. '2' is 5x larger than '3', and '4' is virtually no time. I would think the wall time for 2 and 3 cores running simultaneously would be small, and the wall time for 4 cores running simultaneously would be 2nd largest, after '1'.
Does this mean the work load is wildly imbalanced?
What is so much time taken up by the omp barrier?