**by Joseph D. Wieber, Jr and Gary M. Zoppetti**

### Introduction

In a previous article, we introduced the intrinsics provided by the Intel® C++ Compiler. Intrinsics allow a programmer to utilize single instruction, multiple data (SIMD) instructions to perform multiple operations in parallel. In this article, we present five implementations of a dot product algorithm, four of which illustrate the use of intrinsic functions. Following each implementation, we discuss the intrinsic functions that were used, as well as the benefit derived from each.

We begin our discussion by looking at the standard representation of the dot product algorithm. Next, we describe the parallelized algorithm that was implemented using intrinsic functions. Then, we present the standard implementation, followed by the intrinsic implementations. In the appendix, we present the full source code used to generate the performance measurements given in this article.

Before we begin, we note that in an attempt to preserve clarity we have taken the following two measures. First, the intrinsic implementations assume that the size of the input arrays is a multiple of four (SIZE % 4 == 0). Second, the functions assume the arrays contain useful data. Thus, we do not provide error checking or recovery within the functions.

### How to Use Intrinsics

Using the intrinsics is no different than using any other C/C++ library. The programmer includes the correct header file for the type of intrinsic to be used, and then calls the desired intrinsic function. As with other C/C++ functions, programmers must observe the parameter and return types of the intrinsic function.

There are five header files used with the different types of intrinsics; each one gives the programmer access to a specific technology. They are listed in the following chart.

Technology |
Header file |

MMX™ technology | mmintrin.h |

Streaming SIMD Extensions (SSE) | xmmintrin.h |

Streaming SIMD Extensions 2 (SSE2) | emmintrin.h |

Itanium® Processor (native) | ia64intrin.h |

getReg() and setReg() | ia64regs.h |

For a comprehensive list of Intrinsics functions see the Intel® C++ Compiler User's Guide.

### The Dot Product Algorithm

We now present the dot product algorithm. Given two arrays of size n, the dot product is obtained by multiplying the elements of equal indices together, then summing the resulting products.

For the intrinsics versions, we vary the algorithm slightly to make use of SIMD instructions. In the representation above, we calculate a running sum of products, as we traverse both n-element arrays one element at a time. In the intrinsics representation, we process the arrays in chunks of four elements (we view the arrays as a sequence of 4-vectors). Thus, we multiply and sum vectors of four elements. To obtain the scalar result, we sum the components of the final 4-vector contained in *part_sum*.

### Implementations

**Standard version:**

void dotmul_intrins1 (short A[], short B[], short &C, const int SIZE) { register int k; short sarr[4]; register __int64 a, b; register __m64 catch_mul, part_sum; catch_mul = 0; part_sum = 0; for (k = 0; k < SIZE; k += 4) { a = __ld8_acq ((void*) &A[k]); b = __ld8_acq ((void*) &B[k]); catch_mul = _m64_pmpyshr2 (_m_from_int64 (a), _m_from_int64 (b), 0); part_sum = _m_paddw (part_sum, catch_mul); } __st8_rel (&sarr[0], _m_to_int64 (part_sum)); C = sarr[0] + sarr[1] + sarr[2] + sarr[3]; } |

For our first implementation, we used load and store intrinsics to fetch and store 4-vectors in memory. The __ldx_acq (where x is 1, 2, 4, or 8) is the only load intrinsic provided for integer types. It forces a strict memory access ordering with acquire semantics, meaning that any subsequent instructi on that accesses memory must wait for the load to complete before it can begin. In conjunction with the store (__stx_rel, where x is 1, 2, 4, or 8), these instructions were too costly because they unnecessarily constrained the compiler’s code scheduler. Because of the acquire semantics of the load, the compiler could not pipeline the loop.

On lines 13 and 14, we use the __ld8_acq intrinsic to load 4-vectors (consisting of 16-bit signed values) into *a* and *b*, respectively. On line 15, we use the _m64_pmpyshr2 intrinsic to multiply the corresponding 16-bit elements of *a* and *b*. It performs the multiplication in parallel. For the first two parameters, we call a conversion intrinsic, _m_from_int64, to cast the __int64 elements to the __m64 data type needed for the parallel multiply intrinsic. The conversions are only necessary for type checking; they do not translate into assembly instructions. The third parameter is a shift count. The _m64_pmpyshr2 intrinsic performs a right shift on each of the four 16-bit products in the __m64 structure. Since we do not want the results shifted, we pass a zero as the third parameter. On line 17, we use the _m_paddw intrinsic to perform a parallel addition on the corresponding elements of *part_sum* and *catch_mul*. The four running sums are kept in *part_sum*. To obtain the final result, we need to sum the four elements of part_sum. Hence, on line 20, we take the packed data contained in part_sum and store it in sarr[4] (we do this because there is not an intrinsic that sums the four packed 16-bit elements of a single __m64 or __int64 variable). On line 21, we walk sarr[4] and sum the elements to obtain the final result. While this implementation does yield a speedup of approximately 1.66, it has inefficiencies (like the unnecessary memory ordering constraint) that we remove in the following implementations.

void dotmul_intrins2 (short A[], short B[], short &C, const int SIZE) { register int k; short sarr[4]; register __m64 *a_ptr, *b_ptr, catch_mul, part_sum; catch_mul = 0; part_sum = 0; for (k = 0; k < SIZE; k += 4) { a_ptr = (__m64*) &A[k]; b_ptr = (__m64*) &B[k]; catch_mul = _m64_pmpyshr2 (*a_ptr, *b_ptr, 0); part_sum = _m_paddw (part_sum, catch_mul); } __st8_rel (&sarr [0], _m_to_int64 (part_sum)); C = sarr[0] + sarr[1] + sarr[2] + sarr[3]; } |

void dotmul_intrins3 (short A[], short B[], short &C, const int SIZE) { register int k; register __int64 sum; register __m64 *a_ptr, *b_ptr, catch_mul (0), part_sum (0); for (k = 0; k < SIZE; k += 4) { a_ptr = (__m64*) &A[k]; b_ptr = (__m64*) &B[k]; catch_mul = _m64_pmpyshr2 (*a_ptr, *b_ptr, 0); part_sum = _m_paddw (part_sum, catch_mul); } sum = _m64_extr (_m_to_int64 (part_sum), 0, 16) + _m64_extr (_m_to_int64 (part_sum), 16, 16) + _m64_extr (_m_to_int64 (part_sum), 32, 16) + _m64_extr (_m_to_int64 (part_sum), 48, 16); C = (short) _m_to_int (_m_from_int64 (sum)); } |

void dotmul_intrins4 (short A[], short B[], short &C, const int SIZE) { union { __m64 part_sum; short part[4]; } s; register int k; register __m64 *a_ptr, *b_ptr, catch_mul (0); for (k = 0; k < SIZE; k += 4) { a_ptr = (__m64*) &A[k]; b_ptr = (__m64*) &B[k]; catch_mul = _m64_pmpyshr2 (*a_ptr, *b_ptr, 0); s.part_sum = _m_paddw (s.part_sum, catch_mul); } C = s.part[0] + s.part[1] + s.part[2] + s.part[3]; } |

### Conclusion

Although the compiler generates efficient code, it does not generate SIMD instructions. However, the intrinsics provide an interface to SIMD assembly instructions that tell the compiler precisely what parallel operations to generate. Thus, we are able to achieve speedups of more than 4 on a dot product computation by using intrinsics. For performance critical loops that process independent streams of data, we have demonstrated that intrinsics are a valuable tool to achieve significant speedups.

### About the Authors

**Joseph D. Wieber, Jr.** is a dual major in his senior year at Millersville University. His primary major is Computer Science; his secondary major is Business Administration with an option in Management. His main interests are high performance computing, computer networks, computer architecture, and programming languages.

**Gary M. Zoppetti** earned the Ph.D. (2001) and M.S. (1997) degrees in Computer and Information Sciences from the University of Delaware, and the B.S. (1992) degree in Mathematics and Computer Science from California University of Pennsylvania. As an undergraduate, he participated in the National Student Exchange program to further his studies in computer science, physics, and mathematics at the University of Utah. Prior to joining the faculty at Millersville University in 2002, he held an assistant professorship at the University of Delaware. Dr. Zoppetti's research interests include high performance computing; programming languages, compilers, and runtime systems; and computer architecture.

### Appendix

/******************************************************************************** *Joseph D. Wieber, Jr. * *Gary M. Zoppetti * * * *Source used to test performance of some intrinsics functions * ********************************************************************************/ #include "iostream" #include "iomanip" #include "cstdlib" #include "ctime" #include "ia64intrin.h" // includes mmintrin.h #include "xmmintrin.h" // for SSE using namespace std; /******************************************************************************** * DECLARATIONS * ********************************************************************************/ void dotmul_naive (short A [], short B [], short &C, const int SIZE); void dotmul_intrins1 (short A [], short B [], short &C, const int SIZE); void dotmul_intrins2 (short A [], short B [], short &C, const int SIZE); void dotmul_intrins3 (short A [], short B [], short &C, const int SIZE); void dotmul_intrins4 (short A [], short B [], short &C, const int SIZE); void print_matrix (short A[], const int SIZE); void fill_matrices (short A[], short B[], const int SIZE); /******************************************************************************** * GLOBALS * ********************************************************************************/ const int MAX = 1e+8; short A[MAX], B[MAX]; /******************************************************************************** * MAIN * *******************************************************************************/ int main (int argc, char *argv[]) { int i; int SIZE; short C (0), D (0); if (argc != 2) { cout << "Usage: " << argv[0] << " <size>" << endl; exit (-1); } SIZE = atoi (argv[1]); if (SIZE % 4 != 0) { cout << "SIZE must be a multiple of 4." << endl; exit (-2); } //one fill many multiplies fill_matrices (A, B, SIZE); time_t start = clock (); dotmul_naive (A, B, D, SIZE); time_t end = clock (); double elap = double (end - start) / CLOCKS_PER_SEC; cout << "For a vector of " << SIZE << " shorts, dotmul_naive Time = " << setprecision (4) << elap << "s." << endl; cout << "***************************************************************** "; //////////////////////////////////////////////////////////////////////// start = clock (); dotmul_intrins1 (A, B, C, SIZE); end = clock (); double elap1 = double (end - start) / CLOCKS_PER_SEC; cout << "For a vector of " << SIZE << " shorts, dotmul_intrins1 time = " << setprecision (4) << elap1 << "s." << endl; cout << " Speedup ==> " << elap / elap1 << endl; if (C != D) { cout << "intrins1 yields a different result naive result: " << D << " intrins1 result: " << C << endl; } else cout << "Results for intrins1 verified "; cout << "***************************************************************** "; //////////////////////////////////////////////////////////////////////// start = clock (); dotmul_intrins2 (A, B, C, SIZE); end = clock (); double elap2 = double (end - start) / CLOCKS_PER_SEC; cout << "For a vector of " << SIZE << " shorts, dotmul_intrins2 time = " << setprecision (4) << elap2 << "s." << endl; cout << " Speedup ==> " << elap / elap2 << endl; if (C != D) { cout << "intrins2 yields a different result naive result: " << D << " intrins2 result: " << C << endl; } else cout << "Results for intrins2 verified "; cout << "***************************************************************** "; //////////////////////////////////////////////////////////////////////// start = clock (); dotmul_intrins3 (A, B, C, SIZE); end = clock (); double elap3 = double (end - start) / CLOCKS_PER_SEC; cout << "For a vector of " << SIZE << " shorts, dotmul_intrins3 time = " << setprecision (4) << elap3 << "s." << endl; cout << " Speedup ==> " << elap / elap3 << endl; if (C != D) { cout << "intrins3 yields a different result naive result: " << D << " intrins3 result: " << C << endl; } else cout << "Results for intrins3 verified "; cout << "***************************************************************** "; //////////////////////////////////////////////////////////////////////// start = clock (); dotmul_intrins4 (A, B, C, SIZE); end = clock (); double elap4 = double (end - start) / CLOCKS_PER_SEC; cout << "For a vector of " << SIZE << " shorts, dotmul_intrins4 time = " << setprecision (4) << elap4 << "s." << endl; cout << " Speedup ==> " << elap / elap4 << endl; if (C != D) { cout << "intrins4 yields a different result naive result: " << D << " intrins4 result: " << C << endl; } else cout << "Results for intrins4 verified "; cout << "***************************************************************** "; return (0); } /******************************************************************************** * DEFINITIONS * ********************************************************************************/ void dotmul_naive (short A [], short B [], short &C, const int SIZE) { register int k; C = 0; for (k = 0; k < SIZE; k++) C += A[k] * B[k]; } /********************************************************************************/ void dotmul_intrins1 (short A [], short B [], short &C, const int SIZE) { register int k; short sarr [4]; register __int64 a, b; register __m64 catch_mul, part_sum; catch_mul = 0; part_sum = 0; for (k = 0; k < SIZE; k += 4) { a = __ld8_acq ((void*) &A[k]); b = __ld8_acq ((void*) &B[k]); catch_mul = _m64_pmpyshr2 (_m_from_int64 (a), _m_from_int64 (b), 0); part_sum = _m_paddw (part_sum, catch_mul); } __st8_rel (&sarr [0], _m_to_int64 (part_sum)); C = sarr [0] + sarr [1] + sarr [2] + sarr [3]; } /********************************************************************************/ void dotmul_intrins2 (short A [], short B [], short &C, const int SIZE) { register int k; short sarr [4]; register __m64 *a_ptr, *b_ptr, catch_mul, part_sum; catch_mul = 0; part_sum = 0; for (k = 0; k < SIZE; k += 4) { a_ptr = (__m64*) &A[k]; b_ptr = (__m64*) &B[k]; catch_mul = _m64_pmpyshr2 (*a_ptr, *b_ptr, 0); part_sum = _m_paddw (part_sum, catch_mul); } __st8_rel (&sarr [0], _m_to_int64 (part_sum)); C = sarr [0] + sarr [1] + sarr [2] + sarr [3]; } /********************************************************************************/ void dotmul_intrins3 (short A [], short B [], short &C, const int SIZE) { register int k; register __int64 sum; register __m64 *a_ptr, *b_ptr, catch_mul (0), part_sum (0); for (k = 0; k < SIZE; k += 4) { a_ptr = (__m64*) &A[k]; b_ptr = (__m64*) &B[k]; catch_mul = _m64_pmpyshr2 (*a_ptr, *b_ptr, 0); part_sum = _m_paddw (part_sum, catch_mul); } sum = _m64_extr (_m_to_int64 (part_sum), 0, 16) + _m64_extr (_m_to_int64 (part_sum), 16, 16) + _m64_extr (_m_to_int64 (part_sum), 32, 16) + _m64_extr (_m_to_int64 (part_sum), 48, 16); C = (short) _m_to_int (_m_from_int64 (sum)); } /********************************************************************************/ void dotmul_intrins4(short A [], short B [], short &C, const int SIZE) { union { __m64 part_sum; short part[4]; } s; register int k; register __m64 *a_ptr, *b_ptr, catch_mul (0); for (k = 0; k < SIZE; k += 4) { a_ptr = (__m64*) &A[k]; b_ptr = (__m64*) &B[k]; catch_mul = _m64_pmpyshr2 (*a_ptr, *b_ptr, 0); s.part_sum = _m_paddw (s.part_sum, catch_mul); } C = s.part[0] + s.part[1] + s.part[2] + s.part[3]; } /********************************************************************************/ void fill_matrices (short A[], short B[], const int SIZE) { int i; int r1, r2; srand (0); // keep it consistent for testing for (i = 0; i < SIZE; i++) { A[i] = rand () % 9 + 1; B[i] = rand () % 9 + 1; } } /********************************************************************************/ void print_matrix (short A[], const int SIZE) { int i; for (i = 0; i < SIZE; i++) { cout << setw (10) << A[i]; } cout << endl; cout << endl; } /********************************************************************************/ |

## Comments (2)

TopThomas Willhalm... said on

The include files for more recent processors are:

To learn more about current SIMD instruction, I recommend the AVX2 portal or the Intel® AVX and CPU Instructions forum.

Anonymous said on

There is a discussion of the first implementation, followed by a conclusion on how the latter versions work much better. What happened to the section describing the latter implementations? It appears it was just lost or not copied...

## Add a Comment

Top(For technical discussions visit our developer forums. For site or software product issues contact support.)

Please sign in to add a comment. Not a member? Join today