Here are the questions and answers from the Future-Proof Your Applications's Performance With Vectorization technical presentation held February 15, 2012.
Question and Answers:
Q: Can you recommend to me books about theory of vectorization?
A: Yes, please take a look at The Software Vectorization Handbook from Intel Press.
Q: Would you please give us the brief explanation about the difference between ""-O2"" and ""-O3"" options?
A: “-O2” is the default level and enables many performance optimizations, including vectorization. “-O3” applies the same optimizations plus some additional more aggressive ones, that may be appropriate for applications doing floating-point operations or working with large datasets. For more information, see the Intel Compiler Quick-Reference Guide.
Q: Can I use vectorization if I am already using multi-threading and/or MPI in my application?
Yes. Vectorization is one type of parallelism that occurs within a single CPU core. It can be combined with other types of parallelism such as multi-threading or cluster-level (MPI) parallelism. An application can take advantage of 1, 2 or all 3 of these types.
Q: How does vectorization increase performance when the arrays have different looping indices and different dimensionalities? Your simple example was with common dimensions and one-to-one and onto mapping of the math operation?
Accessing an array using a pointer or complex subscript can inhibit vectorization. Accessing arrays with different dimensionalities may as well, depending on the memory access pattern and whether it is aligned and contiguous. The Intel Compiler vec-report and gap-report can help identify whether changes can be made to a loop to improve vectorization. For more information on using these reports, see the 6-step process. For more information on factors that influence vectorization, see A Guide to Vectorization With Intel® C++ Compilers, sections 4 and 5.
Q: Will vectorization help further performance gain on multi core/cpu hardware or it is just meant for single cpu?
Vectorization improves performance of a loop executing on a single CPU, however, it can be used in multi-threaded code, where it would improve performance of each CPU where it can be applied.
Q: Can I get the slides?
A: We are not planning to post the slides for this presentation. The slides were designed to be presented rather than only read. However the webinar can be viewed at the link above, or, the resources and examples presented in the webinar are available in the Vectorization Toolkit.
Q: Good morning: Does Intel compilers support fully MPICH2? Thank you by your answer.
You can use the Intel compilers to compile MIPIH2 but we don’t provide instructions on how to do this. There is much peer to peer support in our forums if you need help.
Q: Instead of R[1:N]=L[1:N]+M[1:N], we can use R=L+N, Does it have the vectorization aspect inside ?
For Fortran you can specify an array assignment either way and the compiler can attempt vectorization.
Q: Any future plans to enhance Itanium to enable vectorization?
A: Unfortunately, we cannot comment on future processor plans.
Q: Intel MIC will have many cores - 50 cores - Will it have such vectorize extensions/ hardware each core? Or the vectorization will be done across different cores?
A: Your code that uses the high level extensions we covered today on Xeon will scale forward to registers/instructions on the Intel® Many Integrated Core (Intel® MIC) architecture .
Q: What happened to the forall statement?
A: It is part of the Fortran standard and supported by our compiler. It does not have the same restrictions as Do CONCURRENT, so it may be more difficult for the compiler to vectorize due to loop dependencies.
Q: This way of writing is good! But you are not controlling the size of the arrays! So after all you can forget to check the length of both arrays and you will get later as the compiler error and not at the time of the code writing
r[0:n] = src1[0:n] + src2[0:n];
A: The checks can be done at runtime by the compiler.
Q: Does "Intel® Fortran Composer XE 2011 for Linux" already contain the Vectorization Toolkit?
A: No the Toolkit was created as dynmic living document to track vectorization resources and is posted on the web only.
Q: Is "!DIR$ VECTOR" equivalent to "!DIR$ SIMD"
A: "!DIR$ VECTOR" is a hint to the compiler to vectorize (the compiler can ignore this hint due to effiency or dependency concerns), "!DIR$ SIMD" tells the compiler vectorization is mandatory if it can be done.
Q: is there any potential conflict for MPI code to use vectorization?
Q: I have a some question regarding to data parallelism. Which compiler options must be used for the native FORTRAN parallel operators? For example, we have an OpenMP directives :
$OMP DO DO I=1, N
A(I) = B(I) + C(I)
......For another side, the same expression is: A = B + C can be presented as SIMD in Fortran 90/95. How it can be parallelized in SMP or MultiCore systems? Which options do we need?
A: The compiler will try to vectorize both occurrences if it can. If you want to override heristic/dependency considerations by the compiler you can use "!DIR$ SIMD
Q: Is OS X Lion supported?
A: Yes, for Intel® Composer XE.
Q: Can the code path decision be put at the start of the program?
A: Yes when you use the –ax or /Qax switch it determines the processor at the start of your program, but the code paths branch only at the point where it would make a difference in optimized code.
Q: When compiling an executable that is to be run on a wide variety of platforms, is it necessary to compile multiple executables -- with and without AVX, for example -- or is the support determined automatically at runtime?
A: No, one compilation with the appropriate switches to target multiple plaforms only needs to be run once. More details on cpu targeting here: /en-us/articles/performance-tools-for-software-developers-intel-compiler-options-for-sse-generation-and-processor-specific-optimizations
Q: will compiler vectorize by inlining functions, e.g. if I have a Fortran "PURE" function acting on scalars, which are part of an array in a DO loop?
A: The compiler may inline functions. We would have try the exact test code to see if it inlines.
Q: Is OpenCL code well vectorized by Intel? Have OpenCL any benefits over Intel Compiler C++/Fortran
A: You can see the video posted here for more detail on vectorization of OpenCL by Intel: /en-us/articles/auto-vectorization-of-opencl-code-with-the-intel-opencl-sdk
Q: Is there support for sparsely populated vectors/matrices?
A: Not for auto-vectorization, but Cilk Plus array notation does have a gather/scatter syntax that can be used for this. The syntax is:
a[0:n] = b[array_of_indices[0:n]]
Where array_of_indices contains each index that should be gathered from.
Q: Is there support for multidimensional loops?
Q: Is there support for long double?
Q: Is there a roadmap to support also Java?
A: No, but we will take your feedback into consideration for future planning.
Q: Are vectorized loops operating on integers faster for 32-bit apps than for 64-bit ones? (packs more integers into a 128-bit register)
A: No, because ints are a standard size (32-bits) regardless of the target architecture in this case. vectorized loops on ints would be faster than vectorized loops on __int64s though, as a contrasting example.
Q: How does the selection of a floating-model (/fp:fast, precise or strict) impact the ability of the compiler to vectorize Array Notations and Elemental Functions?
A: Vectorization will be limited by the use of precise or strict floating point models as these models require a reproducible order of operations that vectorization cannot always provide.