This chapter covers topics in vectorization. Vectorization is a form of data-parallel programming. In this, the processor performs the same operation simultaneously on N data elements of a vector ( a one-dimensional array of scalar data objects such as floating point objects, integers, or double precision floating point objects).
Skill and knowledge of vectorization is absolutely ESSENTIAL to gain performance on The Intel® Many Integrated Core Architecture (Intel® MIC Architecture), such as the Intel® Xeon Phi™ Coprocessor. Vectorization of an application can give as much as 8x ( double precision ) or 16x (single precision float ) speedup on the Intel® Xeon Phi™ Coprocessor in the perfect case. Your application may not reach these potential speedups, but what should be clear is that if your code has no vectorization then it will not run efficiently on The Intel® Many Integrated Core Architecture (Intel® MIC Architecture).
As a first step, it is essential to understand vectorization - what it is and how to use -vec-report to determine where the compiler is able to vectorize your application. Use option -qopt-report with Intel compiler version 15.0 and later. It is also important to know which sections of code the compiler cannot vectorize and why it cannot.
Since it is not realistic to expect the compiler to do all the work for vectorization, you need to know compiler pragmas and directives to assist the compiler with vectorization. One key technique to aid efficient vectorization is data alignment. This chapter describes how to control data alignment and assist the compiler to recognize aligned data.
C/C++ users should consider adding array vector syntax provided in Intel® Cilk™ Plus to assist the compiler recognize vectorizable expressions ( this is similar to Fortran's array syntax ). Also in this chapter, pointer aliasing and it's effects on optimization and vectorization is discussed, as are ways to tell the compiler that pointer arguments are not aliased.
Finally, this chapter also presents novel approaches to outer loop vectorization.
Note that OpenMP 4.0 includes new directives to enable vectorization. The Intel Compiler supports the following OpenMP 4.0 features in the latest releases:
- OpenMP 4.0 Features in Intel Fortran Composer XE 2013
- OpenMP 4.0 Features in Intel C++ Composer XE 2013
- Updated Support for OpenMP* 4.0 Features Added in Composer XE 2013 SP1 (C++ and Fortran)
- OpenMP* 4.0 Features in Intel Compiler 15.0
The following subchapters provide more information on vectorization topics. Click the links below to access these topics.
The following topics should be considered required reading:
The following topics present some optional techniques to take vectorization to the next level:
Outer Loop Vectorization via Intel® Cilk™ Plus Array Notations (for C/C++ Users)
Tradeoffs between array-notation long-vector and short-vector coding (for C/C++ Users)
- Other common Vectorization Techniques
In this chapter, various vectorization methods and optimizations were presented. You will not get good performance on The Intel® Many Integrated Core Architecture (Intel® MIC Architecture) such as the Intel® Xeon Phi™ coprocessor without good vectorization. It is essential that you understand the following:
Compiler option -qopt-report (-vec-report is deprecated in version 15.0) should be used to determine what portions of your application is vectorizing, is not vectorizing, and why it is not vectorizing.
You should have learned how critical data alignment is for vectorization, and how to force alignment of your data.
Pointer aliasing and how to tell the compiler that pointers are not aliasing the same data
Elemental functions and how they assist vectorization, and how Intel® Cilk™ Plus array notation can assist vectorization.
Also in this chapter were optional techniques in outer loop optimizations, vectorized random number generation, and tradeoffs between long-vector and short-vector coding. Refer to the article Intel Vectorization Tools for additional resources.
It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon Phi™ architecture. The paths provided in this guide reflect the steps necessary to get best possible application performance.
The next chapter, Advanced MIC Optimizations, presents some advanced optimizations to get most performance out of The Intel® Many Integrated Core Architecture (Intel® MIC Architecture)