| January 23, 2011 9:30 AM PST | |
|
The auto-parallelization feature of the Intel C++ Compiler automatically translates serial portions of the input program into semantically equivalent multithreaded code. Automatic parallelization determines the loops that are good work sharing candidates, performs the dataflow analysis to verify correct parallel execution, and partitions the data for threaded code generation as is needed in programming with OpenMP directives. The OpenMP and Auto-parallelization applications provide the performance gains from shared memory on multiprocessor systems, IA-32 and Intel 64. The following table lists the options that enable Auto-parallelization: /Qparallel: This option is useful for loops whose computation work volume cannot be determined at compile-time. The threshold is usually relevant when the loop trip count is unknown at compile-time. The compiler applies a heuristic that tries to balance the overhead of creating multiple threads versus the amount of work available to be shared amongst the threads. The n is an integer whose value is the threshold for the auto-parallelization of loops. Possible values are 0 through 100. If n is 0, loops get auto-parallelized always, regardless of computation work volume. If n is 100, loops get auto-parallelized when performance gains are predicted based on the compiler analysis data. Loops get auto-parallelized only if profitable parallel execution is almost certain. The intermediate 1 to 99 values represent the percentage probability for profitable speed-up. For example, n=50 directs the compiler to parallelize only if there is a 50% probability of the code speeding up if executed in parallel. Also, to be "100%" sure that a loop will benefit from parallelization, the compiler needs to know the iteration count at compile time. For a "99%" or lower threshold, knowing the iteration count at compile time is not a requirement. This leads to a big difference in the number of loops parallelized at 99% compared to 100%. For many apps, 99% is a better setting, but for some apps with a lot of short loops, 99% will slow them down. The following example, int_sin.c, does not auto parallelize when we use /Qpar-threshold:100 using command line below : C: >icl -c /Qparallel /Qpar-report3 /Qpar-threshold:100 int_sin.c |
This article applies to: Intel® C++ Compiler for Linux* Knowledge Base, Intel® C++ Compiler for Mac OS X* Knowledge Base, Intel® C++ Compiler for Windows* Knowledge Base, Intel® Parallel Composer Knowledge Base
For more complete information about compiler optimizations, see our Optimization Notice.
Comments (5) 
| March 31, 2009 8:21 PM PDT
Dny
|
Very nice article which illustrate effect of -par-threshold with -parallel. Can we get more articles for other icc flags ? |
| March 31, 2009 9:26 PM PDT
Om Sachan (Intel)
|
Thanks for feedback. We write to clarify customer question on Intel tools. Please let us know your specific question on any Intel compiler option. |
| April 1, 2009 8:56 PM PDT
Dny
|
Hello Sir, I tried experiment of using -parallel with -par-threshold:[n] flag but I observed very different behavior than expected. I used simple square matrix addition code for this. for ( i =0 ; i < size ; i++){ for (j=0; j < size;j++){ c[i][j]=a[i][j] + b[i][j]; } } As per my understanding icc vectorized the inner loop and parallelize outer loop. When I pass use fixed matrix size at compiler time size=5000 and used command icc -parallel <cpp file name> Compiler comments says MatAdd2.cpp(46): (col. 2) remark: LOOP WAS AUTO-PARALLELIZED. MatAdd2.cpp(47): (col. 3) remark: LOOP WAS VECTORIZED. MatAdd2.cpp(47): (col. 3) remark: LOOP WAS VECTORIZED. It means for loop of 'i' is parallelized and loop j is vectorized which is expected behavior. It gives me some improvement in runtime. But When I pass matrix size from command line ind used command icc -parallel -par-threshold:50 <cpp file name> compiler comments says MatAdd.cpp(47): (col. 3) remark: LOOP WAS AUTO-PARALLELIZED. MatAdd.cpp(47): (col. 3) remark: LOOP WAS VECTORIZED. MatAdd.cpp(47): (col. 3) remark: LOOP WAS VECTORIZED. it means it vectorized as well as parallelize the same 'j' loop only . Can you please comment on these different behavior? Thanking you, Digambar |
| April 4, 2009 11:16 PM PDT
srimks
|
How can ICC-v11.0 perform both auto-parallelization & auto-vectorization for same line of code? Probably, outer loop seems to be good candidate for auto-parallelization when "par-threshold:50" is being set. It seems you might have used ICC optimization level of "-O2" while compiling your code which on default tries to perform auto-vectorization finding inner-loop not a good candidate for vectorization rather vectorizing outer-loop only. Somehow, this behaviour of Intel C++ Compiler seems to be weird. You can define #pragma novector" for outer loop forcing the compiler not to perform vectorization for OUTER LOOP and calling "pragma distribute point" or "pragma vector always" or "pragma unroll" for INNER LOOP. Could you let me know the integer parameter for size? ~BR Mukkaysh Srivastav |



Raghava Darisi (Intel)
420