We're trying to move a stochastic Fortran program to OpenMP with XE 2013 in Win 7 using Visual Studio. Basically, we want to run many copies of a program, after the initial read-in of tables while sharing a couple of the large (basically read-only) tables between the threads. In the simplest configuration, two large do loops, with subroutines and modules, are completely enclosed in just a parallel region, firstprivate, except for a couple of shared arrays. In this version, after entering the parallel region, the threads never leave it. We've tried more complicated uses of OpenMP but, in all cases, we get only modest improvements...i.e., a number of threads running but very lilttle improvement in total accomplishment in wall time compared to only one thread.
Finally, we're getting substantial improvement (e.g., four threads-worth of work in only twice the wall time of one thread) HOWEVER, it only occurs if the code is run in VTune locks and waits (x64, in either debug or release mode). If, immediately after running in this mode, we run the same program (cntl-f5) w/o VTune, it reverts to very lttle gain (i.e., the same amount of work in the same amount of wall time, no matter how many threads are running).
Here's the question: What is VTune Locks and Waits doing that is speeding things up?
We judge the success by the quantity and quality of the output so nothing is being missed in the execution. Also, Locks and Waits is not completely consistent, sometimes it, too, runs slowly. Whichever mode (fast or slow) it starts in, it continues indefinately. This has been observed on both i7 and an E5 machines. There may be an issue of the order of running w/ and w/o Locks and Waits and compiling but we have not been able to pin down any consistant behavior in that regard.
We're hoping that whatever Locks and Waits has discovered, we can use to achieve the speed ups we need to move this to a MIC.