I tried this thread on Vtune forum w/o much success. This is driving us nuts so I'm hoping for some help in this forum:
"We're trying to move a stochastic Fortran program to OpenMP with XE 2013 in Win 7 using Visual Studio. Basically, we want to run many copies of a program, after the initial read-in of tables while sharing a couple of the large (basically read-only) tables between the threads. In the simplest configuration, two large do loops, with subroutines and modules, are completely enclosed in just a parallel region, firstprivate, except for a couple of shared arrays. In this version, after entering the parallel region, the threads never leave it. We've tried more complicated uses of OpenMP but, in all cases, we get only modest improvements...i.e., a number of threads running but very lilttle improvement in total accomplishment in wall time compared to only one thread.
Finally, we're getting substantial improvement (e.g., four threads-worth of work in only twice the wall time of one thread) HOWEVER, it only occurs if the code is run in VTune locks and waits (x64, in either debug or release mode). If, immediately after running in this mode, we run the same program (cntl-f5) w/o VTune, it reverts to very lttle gain (i.e., the same amount of work in the same amount of wall time, no matter how many threads are running).
Here's the question: What is VTune Locks and Waits doing that is speeding things up?
We judge the success by the quantity and quality of the output so nothing is being missed in the execution. Also, Locks and Waits is not completely consistent, sometimes it, too, runs slowly. Whichever mode (fast or slow) it starts in, it continues indefinately. This has been observed on both an i7 and an E5 machines. There may be an issue of the order of running w/ and w/o Locks and Waits and compiling but we have not been able to pin down any consistant behavior in that regard.
We're hoping that whatever Locks and Waits has discovered, we can use to achieve the speed ups we need to move this to a MIC."
There was some chitchat about environment variables that didn't seem to resolve anything. These are my most recent results:
"I upgraded from the initial 2013 release to the second (119). The code now works correctly on a 3rd gen I7 but, when I move the executable (either debug or release) w/o recompiling to an E5 or an older Xeon, it runs even slower than before. Recompiling on the E5 w/ the latest compiler and Vtune gives the same results as before...very poor performance except, sporadically, when run w/ locks and waits. I don't see how we're going to have any confidence in moving this to some MICs until we can straighten this out. What ideas do you have as we're pretty much running out of ideas here?"
I'm hoping someone in this forum can shed some light on this. The results on the i7 seem very consistent with what we expected when we started this conversion (Eight threads, producing eight times times the data in only three times the wall time.) However, almost no gain in the E5-2690 or a slightly older Xeon. If we write short sample codes everything seems as exp[ected with very large gains in productivity wrt wall time.
Here is a sample set of compiler options...we've tried a bunch of variations:
/nologo /assume:buffered_io /Qopenmp /Qopenmp-report1 /Qpar-report2 /Qvec-report2 /module:"x64\Release\\" /object:"x64\Release\\" /Fd"x64\Release\vc100.pdb" /check:bounds /libs:dll /threads /Qmkl:parallel /c