Optimizing Big Data processing with Haswell 256-bit Integer SIMD instructions

Big Data requires processing huge amounts of data. Intel Advanced Vector Extensions 2 (aka AVX2) promoted most Intel AVX 128-bits integer SIMD instruction sets to 256-bits. Intel AVX brought 256-bits floating-point SIMD instructions, but it didn't include 256-bits integer SIMD instructions. Intel AVX2 allows you to operate with the AVX 256-bits wide YMM register for integer data types. In this post, I’ll explain how developers can speedup big data processing with the new 256-bits integer SIMD instructions.

When you work with Big Data projects, you have to perform operations on huge amounts of data, and hence, the promotion of Intel AVX 128-bit integer SIMD instruction sets to 256-bit allows you to work with twice the amount of packed data with a single instruction. Whenever I talk about taking advantage of SIMD instructions, I hear developers’ complaints indicating that they don’t want to complicate their lives with new intrinsics and with additional requirements to pack and unpack data. I’ll provide a simple example that takes advantage of Cilk Plus array notation and the Intel C++ Compiler options to generate code that uses Intel AVX2 SIMD instructions and I’ll show the differences in the generated assembler code.

In this case, I’ll use serial code and I won’t take advantage of parallelism. However, notice that you can add parallelism optimization combined with vectorization to boost performance. I haven’t added parallelism to focus on the assembly code generated with the different compiler options and keep the example simple.

The following steps allow you to create a project that compiles with Intel® C++ Compiler in Visual Studio 2013 and use the compiler options that don’t perform optimizations and generate an output file with the assembler output. I’ll take advantage of the integration of Intel Parallel Studio XE 2013 with Visual Studio 2013. I will also provide the command line arguments for Intel C++ Compiler in each build.

1. Use the Launch Intel® Parallel Studio XE 2013 with VS 2013 shortcut to launch the IDE.

2. Create a Windows console application.

3. Select Project | Intel® Composer XE 2013 SP1 | Use Intel® C++ Compiler.

4. Now, right click on the project name in Solution Explorer and select Properties.

5. Go to Optimization | Optimization and select Disabled (/0d).

6. Go to Optimization | Enable Intrinsic Functions and select No.

7. Go to Code Generation [Intel C++] | Intel Processor-Specific Optimization and select None.

8. Go to Output Files | Assembler Output and select Assembly, Machine Code and Source (/FAcs).

9. Go to Diagnostics [Intel C++] | Vectorizer Diagnostic Level and select Loops Successfully and Unsuccessfully Vectorized (2) (/Qvec-report2).

10. Go to Diagnostics [Intel C++] | Optimization Diagnostic Level and select Minimum (/Qopt-report:1).

11. Go to Command Line | Additional Options, enter /S and click OK. Make sure you use uppercase S.

12. Select Release in the active solution configuration dropdown.

The following lines show the C++ code for a simple Windows console application that generates an array of 51,200 short int signed numbers (signednumbers) and then uses Cilk Plus array notation to fill another array of short int (unsignednumbers) with the result of the abs operation of each number in the signednumbers array.

#include <math.h>
#include <iostream>

const int numbers = 51200;

using namespace std;

int main()
{
 short int* signednumbers = new short int[numbers];
 short int* unsignednumbers = new short int[numbers];

 for (auto i = 0; i < numbers; i++) {
  signednumbers[i] = -(i % 200);
 }

 unsignednumbers[0:numbers] = abs(signednumbers[0:numbers]);

 cout << "Unsigned numbers:\n";
 cout << "First: " << unsignednumbers[0] << "\n";
 cout << "Last:" << unsignednumbers[numbers - 1] << "\n";

 cin.ignore();

 return 0;
}

 

Because I’m using Intel C++ Compiler, I can work with Cilk Plus array notation in my C++ code within Visual Studio 2013. With the appropriate compiler settings, the following line will use SIMD instructions to calculate the abs value for each number in the array and assign it to the destination array.

unsignednumbers[0:numbers] = abs(signednumbers[0:numbers]);

You might use the following lines to achieve similar results. However, I want to demonstrate that Cilk Plus array notation reducer boilerplate code, is easy to read and with a few additional compiler settings, you can take full advantage of vectorization in Haswell CPUs.

#pragma simd
for (auto j = 0; j < numbers; j++) {
    unsignednumbers[j] = abs(signednumbers[j]);
}

Build the solution and open the Release folder for the solution in Windows Explorer. You will find a file with the .cod extension. The compiler generated this code listing file with assembly, machine code and source because I’ve specified the /FAcs and /S options. Open this file and search for the following line:

;;;          unsignednumbers[0:numbers] = abs(signednumbers[0:numbers]);

You will see the following lines until you see another ;;; prefix with the assembly generated from the Cilk Plus array notation that assigns the absolute values of signednumbers to unsignednumbers.

;;;  unsignednumbers[0:numbers] = abs(signednumbers[0:numbers]);

        mov       eax, DWORD PTR [?numbers@@4HB]
$LN46:
        mov       DWORD PTR [-52+ebp], eax      
$LN47:
        mov       DWORD PTR [-48+ebp], 0        
$LN48:
        mov       eax, DWORD PTR [-48+ebp]      
$LN49:
        mov       edx, DWORD PTR [-52+ebp]      
$LN50:
        cmp       eax, edx                      
$LN51:
        jge       .B1.9         ; Prob 50%      
$LN52:
                                ; LOE ebx ebp esi edi esp
.B1.7:                          ; Preds .B1.8 .B1.6
$LN53:
        push      eax                           
$LN54:
        mov       eax, DWORD PTR [-48+ebp]      
$LN55:
        imul      eax, eax, 2                   
$LN56:
        add       eax, DWORD PTR [-68+ebp]      
$LN57:
        movsx     eax, WORD PTR [eax]           
$LN58:
        movsx     eax, ax                       
$LN59:
        mov       DWORD PTR [esp], eax          
$LN60:
        call      _abs                          
$LN61:
                                ; LOE eax ebx ebp esi edi esp
.B1.23:                         ; Preds .B1.7
$LN62:
        mov       DWORD PTR [-44+ebp], eax      
$LN63:
        add       esp, 4                        
$LN64:
                                ; LOE ebx ebp esi edi esp
.B1.8:                          ; Preds .B1.23
$LN65:
        mov       eax, DWORD PTR [-44+ebp]      
$LN66:
        mov       edx, DWORD PTR [-48+ebp]      
$LN67:
        imul      edx, edx, 2                   
$LN68:
        add       edx, DWORD PTR [-60+ebp]      
$LN69:
        mov       WORD PTR [edx], ax            
$LN70:
        mov       eax, 1                        
$LN71:
        add       eax, DWORD PTR [-48+ebp]      
$LN72:
        mov       DWORD PTR [-48+ebp], eax      
$LN73:
        mov       eax, DWORD PTR [-48+ebp]      
$LN74:
        mov       edx, DWORD PTR [-52+ebp]      
$LN75:
        cmp       eax, edx                      
$LN76:
        jl        .B1.7         ; Prob 50%      
$LN77:
                                ; LOE ebx ebp esi edi esp
.B1.9:                          ; Preds .B1.8 .B1.6
$LN78:

 

Don’t worry, you don’t need to understand the assembler lines because you will just need to compare them with the lines generated in the next builds. In this build, I used the Cilk Plus array notation but I selected options that don’t produce optimized code, and therefore, the generated loop requires too many instructions per number and doesn’t use either AVX or AVX2 SIMD instructions to calculate the absolute value for many short ints at the same time.

Now, follow these steps to change the optimization options and take advantage of AVX:

1. Right click on the project name in Solution Explorer and select Properties.

2. Go to Optimization | Optimization and select Maximize Speed (/O2).

3. Go to Optimization | Enable Intrinsic Functions and select Yes (/Oi).

4. Go to Code Generation [Intel C++] | Intel Processor-Specific Optimization and select Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions (Intel(R) AVX) support (/QxAVX). Then, click OK.

The previous steps added the following command line options: /O2 /Oi /QxAVX.

Rebuild the solution. The build output will display a message indicating that the loop generated with the Cilk Plus array notation was vectorized. The line will be similar to the following line:

1>C:\Users\gaston\Documents\Visual Studio 2013\Projects\IntelAVX2\IntelAVX2\main.cpp(17): message : LOOP WAS VECTORIZED

 

Open the Release folder for the solution in Windows Explorer and open the code listing file (.cod extension). Search for the following line as you did with the previous build:

;;;          unsignednumbers[0:numbers] = abs(signednumbers[0:numbers]);

You will see the following lines until you see another ;;; prefix with the assembly generated from the Cilk Plus array notation that assigns the absolute values of signednumbers to unsignednumbers taking advantage of AVX SIMD instructions.

;;;  unsignednumbers[0:numbers] = abs(signednumbers[0:numbers]);

        vpabsw    xmm0, XMMWORD PTR [esi+eax*2]
$LN53:
        vpabsw    xmm1, XMMWORD PTR [16+esi+eax*2]    
$LN54:
        vpabsw    xmm2, XMMWORD PTR [32+esi+eax*2]    
$LN55:
        vpabsw    xmm3, XMMWORD PTR [48+esi+eax*2]    
$LN56:
        vmovdqu   XMMWORD PTR [edi+eax*2], xmm0
$LN57:
        vmovdqu   XMMWORD PTR [16+edi+eax*2], xmm1    
$LN58:
        vmovdqu   XMMWORD PTR [32+edi+eax*2], xmm2    
$LN59:
        vmovdqu   XMMWORD PTR [48+edi+eax*2], xmm3    
$LN60:
        add       eax, 32                      
$LN61:
        cmp       eax, 51200                   
$LN62:
        jb        .B1.6         ; Prob 90%     
$LN63:
                                ; LOE eax esi edi
.B1.7:                          ; Preds .B1.6
$LN64:

In the previous lines, you will notice the usage of the following AVX instructions: vpabsw and vmovdqu. The vpabsw instruction computes the absolute value of the signed integer data elements of a given vector. The vmovdqu instruction moves values from an integer vector to an unaligned memory location. These instructions use the 128-bits xmm0, xmm1, xmm2 and xmm3 registers. So, each call to vpabsw computes the absolute value of the 8 packed 16-bit integers (8 x 16 = 128 bits).

As you might guess, with a simple change in the compiler options, I will be able to enhance the resulting vectorized loop to use the Haswell AVX2 instructions that work with 256-bits registers for integer operations. You just need to follow these steps:

1. Right click on the project name in Solution Explorer and select Properties.

2. Go to Code Generation [Intel C++] | Intel Processor-Specific Optimization and select Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) (/QxCORE-AVX2). Then, click OK.

The previous steps replaced the command line option /QxAVX with /QxCORE-AVX2.

Rebuild the solution. As in the previous build, the new build output will display a message indicating that the loop generated with the Cilk Plus array notation was vectorized.

Open the Release folder for the solution in Windows Explorer and open the code listing file (.cod extension). Search for the following line as you did with the previous builds:

;;;          unsignednumbers[0:numbers] = abs(signednumbers[0:numbers]);

You will see the following lines until you see another ;;; prefix with the assembly generated from the Cilk Plus array notation that assigns the absolute values of signednumbers to unsignednumbers taking advantage of AVX2 SIMD instructions.

;;;  unsignednumbers[0:numbers] = abs(signednumbers[0:numbers]);

        vpabsw    ymm0, YMMWORD PTR [esi+eax*2]     
$LN145:
        vpabsw    ymm1, YMMWORD PTR [32+esi+eax*2]  
$LN146:
        vpabsw    ymm2, YMMWORD PTR [64+esi+eax*2]  
$LN147:
        vpabsw    ymm3, YMMWORD PTR [96+esi+eax*2]  
$LN148:
        vmovdqu   YMMWORD PTR [edi+eax*2], ymm0     
$LN149:
        vmovdqu   YMMWORD PTR [32+edi+eax*2], ymm1  
$LN150:
        vmovdqu   YMMWORD PTR [64+edi+eax*2], ymm2  
$LN151:
        vmovdqu   YMMWORD PTR [96+edi+eax*2], ymm3  
$LN152:
        add       eax, 64                           
$LN153:
        cmp       eax, 51200                        
$LN154:
        jb        .B1.16        ; Prob 90%          
$LN155:
                                ; LOE eax esi edi
.B1.17:                         ; Preds .B1.16
$LN156:

In the previous lines, you will notice the usage of the same instructions that appeared in the AVX optimized assembler output: vpabsw and vmovdqu. However, in this case, both vpabsw and vmovdqu are using the 256-bits ymm0, ymm1, ymm2 and ymm3 registers. The code has been optimized to use the AVX2 instructions promoted to work with 256-bits instead of 128-bits. So, each call to vpabsw computes the absolute value of the 16 packed 16-bit integers (16 x 16 = 256 bits). Each call to either vpabsw or vmovdqu is working with twice the amount of data than the AVX version. You can compare the code that uses the 128-bits AVX instructions with the code that uses the 256-bits AVX2 instructions and you will easily understand the difference in the amount of data processed in each loop cycle. You don’t need to be an assembler expert to understand the difference. You can process twice the amount of data per SIMD instruction with the appropriate Intel C++ Compiler options and targetting Haswell CPUs.

I’ve specified different compiler options to generate the assembler code and make it easy to understand what’s going on under the hoods. There are dozens of additional ways of optimizing code that has to process huge amounts of data and to take advantage of the additional instructions included in Haswell CPUs. I’ve just used this simple example to demonstrate how a single line of Cilk Plus code can generate an extremely optimized serial loop. You can read more about Intel Parallel Studio XE and all its components here.

 

For more complete information about compiler optimizations, see our Optimization Notice.