Write your first program with Haswell new instructions

It has been almost a year since the Haswell new instructions have been announced by Intel. Even though silicon is not generally available yet, the tools are now ready. Time to discuss how to write your first program for Haswell.

Getting ready for Haswell on Windows

Microsoft has just released Visual Studio 2012 Release Candidate, which has support for the new instructions that the Haswell architecture will provide. The Release Candidate is available http://www.microsoft.com/visualstudio/11/en-us on their website free of charge. When you have downloaded and installed it on your system, you are ready to go. In this tutorial, we will create a console application, to select

FILE / New/ Project…

and enter a project name like “extract_bits”:




Getting ready for Haswell on Linux

On Linux, you will need at least GCC 4.7.0. If there is no installation package available for your Linux distribution, it is not too hard to build the compiler yourself. The detailed installation description is available at the GCC site, but essentially this is what you typically need to do:



    1. Download the archive from one of the GNU mirrors.

    1. Extract the archive with tar xjvf gcc-4.7.0.tar.bz2” or ”tar xzvf gcc-4.7.0.tar.gz

    1. mkdir gcc-4.7.0-obj

    1. cd gcc-4.7.0-obj

    1. ../gcc-4.7.0/configure --program-suffix=-4.7.0

    1. make BOOT_CFLAGS='-O' bootstrap

    1. make install



However, the compiler will not be sufficient to create programs with the new instructions for Haswell. You also need an assembler that supports the new instructions. Otherwise, you will get errors like this:


# gcc-4.7.0 -mbmi2 extract_bits.cpp -o extract_bits
/tmp/cclqsyPG.s: Assembler messages:
/tmp/cclqsyPG.s:271: Error: no such instruction: `shlx %eax,%edx,%eax'
/tmp/cclqsyPG.s:304: Error: no such instruction: `pext %edx,%eax,%eax'

The assembler is part of the binutils package, so download and install binutils version 2.22 or later from http://ftp.gnu.org/gnu/binutils/. Again, installation should be straight forward:


    1. tar xjvf binutils-2.22.tar.bz2

    1. cd binutils-2.22/

    1. ./configure

    1. make

    1. make install



Bit Manipulation


For this short tutorial, we will use one of the new bit manipulation instructions. The routine that we are going to implement will extract every forth bit in an array of unsigned integers. Ee are using 0xFEFEFEFE as input. In binary form, this is



1111 1110 1111 1110 1111 1110 1111 1110

The red digits are the ones that will be extracted. The result is therefore 10101010 or 0xAA, which is only one byte. The result is therefore written in “unsigned char”. A standard implementation would read an integer, and then iteratively mask out the lowest bit and shift the value by four bits to the right:


unsigned int value = input[pos];
unsigned char result = 0;
for (size_t bitPos=8; bitPos>0; bitPos--)
{
result |= (value & 1) << bitPos;
value >>= 4;
}
output1_char[pos] = result;

I don’t bother unrolling the loop, since this is something the compiler can do automatically. Instead, let’s focus on a new instruction “pext” that can replace this loop. pext extracts bits from an integer according to all bits that are set in the second parameter. (See Intel® Advanced Vector Extensions Programming Reference page 7.20) Since the compiler is not able to perform this optimization automatically, we will have to tell it directly. The good news is that you do not have to write assembly for using the new instructions.
The compilers provide so-called “intrinsics” that look and behave like normal function calls, but directly map to specific instructions. The advantage of intrinsics is that you do not have to worry about parameter passing or register assignment. This is all handled by the compiler. Nevertheless you get the full
advantage of the new instructions.


In our programming example, the call to the intrinsic would look like this:


unsigned int value = input[pos];
unsigned char result = (unsigned char) _pext_u32(value, 0x11111111);
output2_char[pos] = result;

For using the intrinsics, you need to include the header file “immintrin.h”. The whole program that initializes the input array, runs both versions, and verifies the result, is below:


#include "immintrin.h"
#include

int main()
{
size_t const length=65536;
unsigned int input[length], output1[length], output2[length];

// initialization
for (size_t i = 0; i < length; i++)
input[i] = 0xFEFEFEFE;

// standard implementation
char * output1_char = (char*) output1;
for (size_t pos=0; pos {
unsigned int value = input[pos];
unsigned char result = 0;
for (size_t bitPos=8; bitPos>0; bitPos--)
{
result |= (value & 1) << bitPos;
value >>= 4;
}
output1_char[pos] = result;
}

// implementation using new bit-manipulation instructions
char * output2_char = (char*) output2;
for (size_t pos=0; pos<length; ++pos)
{
unsigned int value = input[pos];
unsigned char result = (unsigned char) _pext_u32(value, 0x11111111);
output2_char[pos] = result;
}

// verify result
for (size_t pos=0; pos<length/4; ++pos)
{
if (output1[pos]!=0xAAAAAAAA)
{
std::cout << "output1[" << std::dec << pos << "]=0x"
<< std::hex << output1[pos] << std::endl;
return -1;
}
if (output2[pos]!=0xAAAAAAAA)
{
std::cout << "output2[" << std::dec << pos << "]=0x"
<< std::hex << output2[pos] << std::endl;
return -1;
}
}


std::cout << "Success!" << std::endl;
return 0;
}


On Windows, you can simply compile your program. On Linux, the option “-mbmi2” tells the compiler that it should generate the bit-manipulation instructions:

g++-4.7.0 -mbmi2 extract_bits.cpp -o extract_bits

In case you are using the vector instructions vrom AVX2, the option –mavx2 is
required instead.



Running your program without Haswell


If you execute the program, you will get an illegal instruction exception, either as a Windows pop-up or on the command line in Linux:



./extract_bits

Illegal instruction

This is no surprise when you do not have a processor yet that supports these instructions. For testing our application, we will therefore use an emulator. For obvious reasons, you will not get the performance benefit of the new instructions, but at least we can test if our program works correctly. Download  the Intel® Software Development Emulator http://software.intel.com/en-us/articles/intel-software-development-emulator/


On Windows, after unpacking the archive, execute sde-win.bat It will start a command line where the Haswell instructions are supported. Navigate to your program and start it.


On Linux, extract the archive and run sde by providing
your program as a parameter:



<path-to-sde>/sde -- ./extract_bits

Success!



This sample source code is released under the Intel Sample Source Code License Agreement.”
Einzelheiten zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.
Kategorien: