One of my performance focus areas for this year is vectorization. I am excited to start creating more content and spreading the message about this technology, as it has been a little bit underappreciated in the past. So to kick things off, I am going to launch a blog series and a 1-hour overview webinar.
First, information about the webinar.
Update: The Feb 15th session has been closed due to strong demand. We are now hosting a second session on Feb 22nd at 9AM. Register here: https://www1.gotomeeting.com/register/662285889
I will be hosting this with my colleague Wendy Doerner on Feb 15th 22nd at 9AM PST. We will cover how to get started with vectorization, including examples and resources. To register or view the abstract, use the link above. If you attend the event live, you will also have the opportunity to request a followup from one of our vectorization experts!
For the blog series I will answer 3 questions:
What is Vectorization?
Who Can Use It?
What Are the Benefits?
Today I will start with the first question: What is Vectorization?
Vectorization is a method for achieving parallelism inside a single processor core. Vectorizing is done by using special instructions called SIMD (Single Instruction, Multiple Data) operations. SIMD instructions, and the hardware that goes along with them, have been present in Intel processors for over a decade. (Remember those commercials in the mid-90s with people dancing in bunny suits promoting MMX™ Technology? MMX was a set of SIMD instructions). The way that SIMD instructions work is that they operate on several pieces of data in parallel.
In the typical (non-vectorized) case, when you add together 2 variables, they will each be stored in their own CPU register. If you perform an operation on them, such as addition, the 2 register quantities are added and the result stored back into a register. Using a SIMD instruction, you can fill a register with multiple variables to be added, which is called "packing" the register. With the most recent SIMD instruction set, Intel® Advanced Vector Instructions (Intel® AVX), which are available on Intel® Microarchitecture Codename Sandy Bridge processors, you can pack up to 16 data elements into one register. The number of elements allowed depends on the size of the element - in Intel® AVX, for example, registers are 256 bits wide, so each can hold 8 32-bit floats or 4 64-bit floats. These data elements can all be combined with another packed register full of elements, allowing you to perform multiple operations on multiple pieces of data at once. For instance, adding 2 packed SIMD registers would produce multiple results, which would be stored into a packed register as well. Being able to do these operations at once rather than one right after the other can result in significant performance gains for the right type of code.
And addition is not the only operation possible on a packed register! Each set of SIMD instructions includes many different operations, with more being added in upcoming processor generations.
But that takes us to the next topic, Who can use vectorization, which we'll cover in the next blog. Feel free to ask questions in the comments of this blog series too, I might turn the questions into future entries. Thanks for reading!
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804