How to Use Intrinsics

by Joseph D. Wieber, Jr and Gary M. Zoppetti


Introduction

In a previous article, we introduced the intrinsics provided by the Intel® C++ Compiler.  Intrinsics allow a programmer to utilize single instruction, multiple data (SIMD) instructions to perform multiple operations in parallel. In this article, we present five implementations of a dot product algorithm, four of which illustrate the use of intrinsic functions. Following each implementation, we discuss the intrinsic functions that were used, as well as the benefit derived from each.

We begin our discussion by looking at the standard representation of the dot product algorithm. Next, we describe the parallelized algorithm that was implemented using intrinsic functions. Then, we present the standard implementation, followed by the intrinsic implementations. In the appendix, we present the full source code used to generate the performance measurements given in this article.

Before we begin, we note that in an attempt to preserve clarity we have taken the following two measures. First, the intrinsic implementations assume that the size of the input arrays is a multiple of four (SIZE % 4 == 0). Second, the functions assume the arrays contain useful data.  Thus, we do not provide error checking or recovery within the functions.


How to Use Intrinsics

Using the intrinsics is no different than using any other C/C++ library. The programmer includes the correct header file for the type of intrinsic to be used, and then calls the desired intrinsic function. As with other C/C++ functions, programmers must observe the parameter and return types of the intrinsic function. 

There are five header files used with the different types of intrinsics; each one gives the programmer access to a specific technology. They are listed in the following chart.

 Technology  Header file
 MMX™ technology  mmintrin.h
 Streaming SIMD Extensions (SSE)  xmmintrin.h
 Streaming SIMD Extensions 2 (SSE2)  emmintrin.h
 Itanium® Processor (native)  ia64intrin.h
 getReg() and setReg()  ia64regs.h

 

For a comprehensive list of Intrinsics functions see the Intel® C++ Compiler User's Guide.


The Dot Product Algorithm

We now present the dot product algorithm. Given two arrays of size n, the dot product is obtained by multiplying the elements of equal indices together, then summing the resulting products.


 
For the intrinsics versions, we vary the algorithm slightly to make use of SIMD instructions. In the representation above, we calculate a running sum of products, as we traverse both n-element arrays one element at a time. In the intrinsics representation, we process the arrays in chunks of four elements (we view the arrays as a sequence of 4-vectors). Thus, we multiply and sum vectors of four elements. To obtain the scalar result, we sum the components of the final 4-vector contained in part_sum.


Implementations

Standard version:

void

dotmul_intrins1 (short A[], short B[], short &C, const int SIZE)

{

register int k;

short sarr[4];


register __int64 a, b;

register __m64 catch_mul, part_sum;

catch_mul = 0;

part_sum = 0;

for (k = 0; k < SIZE; k += 4)

{

a = __ld8_acq ((void*) &A[k]);

b = __ld8_acq ((void*) &B[k]);

catch_mul = _m64_pmpyshr2 (_m_from_int64 (a),

_m_from_int64 (b), 0);

part_sum = _m_paddw (part_sum, catch_mul);

}


__st8_rel (&sarr[0], _m_to_int64 (part_sum));

C = sarr[0] + sarr[1] + sarr[2] + sarr[3];

}

 

For our first implementation, we used load and store intrinsics to fetch and store 4-vectors in memory.  The __ldx_acq (where x is 1, 2, 4, or 8) is the only load intrinsic provided for integer types. It forces a strict memory access ordering with acquire semantics, meaning that any subsequent instructi on that accesses memory must wait for the load to complete before it can begin. In conjunction with the store (__stx_rel, where x is 1, 2, 4, or 8), these instructions were too costly because they unnecessarily constrained the compiler’s code scheduler.  Because of the acquire semantics of the load, the compiler could not pipeline the loop.

On lines 13 and 14, we use the __ld8_acq intrinsic to load 4-vectors (consisting of 16-bit signed values) into a and b, respectively. On line 15, we use the _m64_pmpyshr2 intrinsic to multiply the corresponding 16-bit elements of a and b. It performs the multiplication in parallel. For the first two parameters, we call a conversion intrinsic, _m_from_int64, to cast the __int64 elements to the __m64 data type needed for the parallel multiply intrinsic. The conversions are only necessary for type checking; they do not translate into assembly instructions. The third parameter is a shift count. The _m64_pmpyshr2 intrinsic performs a right shift on each of the four 16-bit products in the __m64 structure.  Since we do not want the results shifted, we pass a zero as the third parameter. On line 17, we use the _m_paddw intrinsic to perform a parallel addition on the corresponding elements of part_sum and catch_mul. The four running sums are kept in part_sum. To obtain the final result, we need to sum the four elements of part_sum. Hence, on line 20, we take the packed data contained in part_sum and store it in sarr[4] (we do this because there is not an intrinsic that sums the four packed 16-bit elements of a single __m64 or __int64 variable). On line 21, we walk sarr[4] and sum the elements to obtain the final result. While this implementation does yield a speedup of approximately 1.66, it has inefficiencies (like the unnecessary memory ordering constraint) that we remove in the following implementations.

void

dotmul_intrins2 (short A[], short B[], short &C, const int SIZE)

{

register int k;

short sarr[4];

register __m64 *a_ptr, *b_ptr, catch_mul, part_sum;

catch_mul = 0;

part_sum = 0;

for (k = 0; k < SIZE; k += 4)

{

a_ptr = (__m64*) &A[k];

b_ptr = (__m64*) &B[k];

catch_mul = _m64_pmpyshr2 (*a_ptr, *b_ptr, 0);

part_sum = _m_paddw (part_sum, catch_mul);

}


__st8_rel (&sarr [0], _m_to_int64 (part_sum));

C = sarr[0] + sarr[1] + sarr[2] + sarr[3];

}

 

 

void

dotmul_intrins3 (short A[], short B[], short &C, const int SIZE)

{

register int k;

register __int64 sum;

register __m64 *a_ptr, *b_ptr, catch_mul (0), part_sum (0);


for (k = 0; k < SIZE; k += 4)

{

a_ptr = (__m64*) &A[k];

b_ptr = (__m64*) &B[k];

catch_mul = _m64_pmpyshr2 (*a_ptr, *b_ptr, 0);

part_sum = _m_paddw (part_sum, catch_mul);

}


sum = _m64_extr (_m_to_int64 (part_sum), 0, 16)

+ _m64_extr (_m_to_int64 (part_sum), 16, 16)

+ _m64_extr (_m_to_int64 (part_sum), 32, 16)

+ _m64_extr (_m_to_int64 (part_sum), 48, 16);


C = (short) _m_to_int (_m_from_int64 (sum));

}

 

 

void

dotmul_intrins4 (short A[], short B[], short &C, const int SIZE)

{


union

{

__m64 part_sum;

short part[4];

} s;


register int k;


register __m64 *a_ptr, *b_ptr, catch_mul (0);


for (k = 0; k < SIZE; k += 4)

{

a_ptr = (__m64*) &A[k];

b_ptr = (__m64*) &B[k];

catch_mul = _m64_pmpyshr2 (*a_ptr, *b_ptr, 0);

s.part_sum = _m_paddw (s.part_sum, catch_mul);

}


C = s.part[0] + s.part[1] + s.part[2] + s.part[3];

}

 


Conclusion

Although the compiler generates efficient code, it does not generate SIMD instructions. However, the intrinsics provide an interface to SIMD assembly instructions that tell the compiler precisely what parallel operations to generate. Thus, we are able to achieve speedups of more than 4 on a dot product computation by using intrinsics. For performance critical loops that process independent streams of data, we have demonstrated that intrinsics are a valuable tool to achieve significant speedups.


About the Authors

Joseph D. Wieber, Jr. is a dual major in his senior year at Millersville University. His primary major is Computer Science; his secondary major is Business Administration with an option in Management. His main interests are high performance computing, computer networks, computer architecture, and programming languages.

Gary M. Zoppetti earned the Ph.D. (2001) and M.S. (1997) degrees in Computer and Information Sciences from the University of Delaware, and the B.S. (1992) degree in Mathematics and Computer Science from California University of Pennsylvania. As an undergraduate, he participated in the National Student Exchange program to further his studies in computer science, physics, and mathematics at the University of Utah. Prior to joining the faculty at Millersville University in 2002, he held an assistant professorship at the University of Delaware. Dr. Zoppetti's research interests include high performance computing; programming languages, compilers, and runtime systems; and computer architecture.


Appendix

 

/********************************************************************************

*Joseph D. Wieber, Jr. *

*Gary M. Zoppetti *

* *
*Source used to test performance of some intrinsics functions *

********************************************************************************/


#include "iostream"

#include "iomanip"

#include "cstdlib"

#include "ctime"


#include "ia64intrin.h" // includes mmintrin.h

#include "xmmintrin.h" // for SSE


using namespace std;


/********************************************************************************

* DECLARATIONS *

********************************************************************************/


void dotmul_naive (short A [], short B [], short &C, const int SIZE);

void dotmul_intrins1 (short A [], short B [], short &C, const int SIZE);

void dotmul_intrins2 (short A [], short B [], short &C, const int SIZE);

void dotmul_intrins3 (short A [], short B [], short &C, const int SIZE);

void dotmul_intrins4 (short A [], short B [], short &C, const int SIZE);

void print_matrix (short A[], const int SIZE);

void fill_matrices (short A[], short B[], const int SIZE);


/********************************************************************************

* GLOBALS *

********************************************************************************/


const int MAX = 1e+8;

short A[MAX], B[MAX];


/********************************************************************************

* MAIN *

*******************************************************************************/


int

main (int argc, char *argv[])

{

int i;

int SIZE;

short C (0), D (0);


if (argc != 2)

{

cout << "Usage: " << argv[0] << " <size>" << endl;

exit (-1);

}


SIZE = atoi (argv[1]);

if

(SIZE % 4 != 0)

{

cout << "SIZE must be a multiple of 4." << endl;

exit (-2);

}


//one fill many multiplies

fill_matrices (A, B, SIZE);


time_t start = clock ();

dotmul_naive (A, B, D, SIZE);

time_t end = clock ();


double elap = double (end - start) / CLOCKS_PER_SEC;

cout << "For a vector of " << SIZE << " shorts, dotmul_naive Time = "

<< setprecision (4) << elap << "s." << endl;

cout << "*****************************************************************

";


////////////////////////////////////////////////////////////////////////

start = clock ();

dotmul_intrins1 (A, B, C, SIZE);

end = clock ();


double elap1 = double (end - start) / CLOCKS_PER_SEC;

cout << "For a vector of " << SIZE << " shorts, dotmul_intrins1 time = "

<< setprecision (4) << elap1 << "s." << endl;

cout << "

Speedup ==> " << elap / elap1 << endl;

if (C != D)

{

cout << "intrins1 yields a different result

naive result: " << D

<< "

intrins1 result: " << C << endl;

}

else

cout << "Results for intrins1 verified

";

cout << "*****************************************************************

";


////////////////////////////////////////////////////////////////////////

start = clock ();

dotmul_intrins2 (A, B, C, SIZE);

end = clock ();


double elap2 = double (end - start) / CLOCKS_PER_SEC;

cout << "For a vector of " << SIZE << " shorts, dotmul_intrins2 time = "

<< setprecision (4) << elap2 << "s." << endl;

cout << "

Speedup ==> " << elap / elap2 << endl;

if (C != D)

{

cout << "intrins2 yields a different result

naive result: " << D

<< "

intrins2 result: " << C << endl;

}

else

cout << "Results for intrins2 verified

";

cout << "*****************************************************************

";


////////////////////////////////////////////////////////////////////////

start = clock ();

dotmul_intrins3 (A, B, C, SIZE);

end = clock ();


double elap3 = double (end - start) / CLOCKS_PER_SEC;

cout << "For a vector of " << SIZE << " shorts, dotmul_intrins3 time = "

<< setprecision (4) << elap3 << "s." << endl;

cout << "

Speedup ==> " << elap / elap3 << endl;

if (C != D)

{

cout << "intrins3 yields a different result

 naive result: " << D

<< "

intrins3 result: " << C << endl;

}

else

cout << "Results for intrins3 verified

";

cout << "*****************************************************************

";


////////////////////////////////////////////////////////////////////////

start = clock ();

dotmul_intrins4 (A, B, C, SIZE);

end = clock ();


double elap4 = double (end - start) / CLOCKS_PER_SEC;

cout << "For a vector of " << SIZE << " shorts, dotmul_intrins4 time = "

<< setprecision

 (4) << elap4 << "s." << endl;

cout << "

Speedup ==> " << elap / elap4 << endl;

if (C != D)

{

cout << "intrins4 yields a different result

 naive result: " << D

<< "

intrins4 result: " << C << endl;

}

else

cout << "Results for intrins4 verified

";

cout << "*****************************************************************

";


return (0);

}



/********************************************************************************

* DEFINITIONS *

********************************************************************************/


void

dotmul_naive (short A [], short B [], short &C, const int SIZE)

{

register int k;

C = 0;

for (k = 0; k < SIZE; k++)

C += A[k] * B[k];

}


/********************************************************************************/

void

dotmul_intrins1 (short A [], short B [], short &C, const int SIZE)

{

register int k;

short sarr [4];


register __int64 a, b;

register __m64 catch_mul, part_sum;

catch_mul = 0;

part_sum = 0;

for (k = 0; k < SIZE; k += 4)

{

a = __ld8_acq ((void*) &A[k]);

b = __ld8_acq ((void*) &B[k]);

catch_mul = _m64_pmpyshr2 (_m_from_int64 (a), _m_from_int64 (b), 0);

part_sum = _m_paddw (part_sum, catch_mul);

}


__st8_rel (&sarr [0], _m_to_int64 (part_sum));

C = sarr [0] + sarr [1] + sarr [2] + sarr [3];

}


/********************************************************************************/

void

dotmul_intrins2 (short A [], short B [], short &C, const int SIZE)

{

register int k;

short sarr [4];

register __m64 *a_ptr, *b_ptr, catch_mul, part_sum;

catch_mul = 0;

part_sum = 0;

for (k = 0; k < SIZE; k += 4)

{

a_ptr = (__m64*) &A[k];

b_ptr = (__m64*) &B[k];

catch_mul = _m64_pmpyshr2 (*a_ptr, *b_ptr, 0);

part_sum = _m_paddw (part_sum, catch_mul);

}


__st8_rel (&sarr [0], _m_to_int64 (part_sum));

C = sarr [0] + sarr [1] + sarr [2] + sarr [3];

}




/********************************************************************************/

void

dotmul_intrins3 (short A [], short B [], short &C, const int SIZE)

{

register int k;

register __int64 sum;

register __m64 *a_ptr, *b_ptr, catch_mul (0), part_sum (0);


for (k = 0; k < SIZE; k += 4)

{

a_ptr = (__m64*) &A[k];

b_ptr = (__m64*) &B[k];

catch_mul = _m64_pmpyshr2 (*a_ptr, *b_ptr, 0);

part_sum = _m_paddw (part_sum, catch_mul);

}


sum = _m64_extr (_m_to_int64 (part_sum), 0, 16)

+ _m64_extr (_m_to_int64 (part_sum), 16, 16)

+ _m64_extr (_m_to_int64 (part_sum), 32, 16)

+ _m64_extr (_m_to_int64 (part_sum), 48, 16);


C = (short) _m_to_int (_m_from_int64 (sum));

}


/********************************************************************************/

void

dotmul_intrins4(short A [], short B [], short &C, const int SIZE)

{


union

{

__m64 part_sum;

short part[4];

} s;


register int k;

register __m64 *a_ptr, *b_ptr, catch_mul (0);


for (k = 0; k < SIZE; k += 4)

{

a_ptr = (__m64*) &A[k];

b_ptr = (__m64*) &B[k];

catch_mul = _m64_pmpyshr2 (*a_ptr, *b_ptr, 0);

s.part_sum = _m_paddw (s.part_sum, catch_mul);

}


C = s.part[0] + s.part[1] + s.part[2] + s.part[3];


}


/********************************************************************************/

void

fill_matrices (short A[], short B[], const int SIZE)

{

int i;

int r1, r2;


srand (0); // keep it consistent for testing

for (i = 0; i < SIZE; i++)

{

A[i] = rand () % 9 + 1;

B[i] = rand () % 9 + 1;

}

}


/********************************************************************************/


void

print_matrix (short A[], const int SIZE)

{

int i;


for (i = 0; i < SIZE; i++)

{

cout << setw (10) << A[i];

}


cout << endl;

cout << endl;

}

/********************************************************************************/

 


For more complete information about compiler optimizations, see our Optimization Notice.