I'm working with the latest Intel Fortran compiler and a legacy Fortran program written as a single thread application, and got VTune for a trial. It showed, not surprisingly, that only one or two of the 8 logical CPUs (in a 4 core i7 machine) are being used. By turning on the compiler parallelization feature and dropping the parallelization threshold to 25, VTune shows much better utilization of multiple cores (average went from 2.35 to 6.39).
However, the analysis also shows a great deal of time spent by kmp_fork_call and NtDelayExection which are nothing explicitly called by the program. I haven't been able to find much out about what these are, why they're being called, and what's calling them. But I do know that execution time has increased by about 50%. Setting the parallelization threshold to anything other than 100 results in a performance hit, and setting it at 100 gives the same results as turning parallelization off.
Can I assume that this means there's no way to take advantage of the multiple processors except by reorganizing the program code for multiple thread operation -- which isn't practical? It's evident that the compiler's attempt at identifying and implementing multiple threads is doing more harm than good.
Please let me know if this would be more appropriately posted in the VTune, Fortran compiler, or some other sub-forum.