By Christopher Owens (Intel), Added
by Kipp Owens, Applications Engineer &
Rajiv Parikh, Sr. Applications Engineer
Software Solutions Group, Intel Corporation.
Abstract
This paper shows how to speed up a commonly used pseudo-random number generation algorithm easily by taking advantage of the Streaming SIMD Extensions 2 (SSE2)instruction set on the Intel® Pentium® 4 processor. The paper includes code that utilizes SSE2 intrinsics (Intrinsics) for generating pseudo-random integers. Some quantitative analysis is also presented for comparison with the original algorithm.
Introduction
Random number generators are used in many applications today. They are useful in applications ranging from a simplistic application for buying a lottery ticket to complex applications such as cryptography. The most commonly used algorithm is called Linear Congruential Generator (LCG). It is also the algorithm used in most C library rand() functions. This paper will provide a brief overview of the original LCG algorithm and implementation. It assumes the reader is familiar with general C and assembly programming. Next, it will show the SSE based solution to 'vectorize' the algorithm. The paper will also compare the original algorithm with the new algorithm, showing some tradeoffs, and applications. All performance data presented was collected on an Intel® Pentium® 4 Processor 3.06 GHz with HT Technology enabled. The Intel® C compiler 7.0 under Microsoft* Visual Studio* 6.0 was used for all compilations. All SSE2 coding was implemented using Intrinsics (avoiding pure assembly use), as supported by the Intel C compiler.
LCG Algorithm
The classic LCG algorithm is widely used in many applications. The most general form is:
Where X_{n} depends on X_{n-1}, scalars a and c, and the final value is modulo m. Usually m is a power of two, so a simple mask can be used. This returns a random value of n = log _{2} m bits. The selection of values a and c for any given n is a subject of many large studies, which will not be discussed here. A list of values can be found in several of the listed references. We chose to use the same a and c values for fast_rand() as the standard math library routine rand() uses (discussed in Cycle Length Analysis section below).
The algorithm for the rand() function in the C library is, however, still a slight modification to the above. The equation is the same but the value it returns to the calling function is shifted right by 16 bits to reduce low order bits correlation and masked by 7FFFh, eliminating the sign bit. The fast_rand() function shown below implements the same scalar LCG function that rand() uses. It uses unsigned 32 bit integer, but to be compatible with rand() The range is reduced by shifting and masking out the upper bit. Listing 1 shows the code for the fast_rand().
Listing 1: Fast_rand() function code in C
static unsigned int g_seed; //Used to seed the generator. inline void fast_srand( int seed ) { g_seed = seed; } //fastrand routine returns one integer, similar output value range as C lib. inline int fastrand() { g_seed = (214013*g_seed+2531011); return (g_seed>>16)&0x7FFF; } |
All the code in this paper is compiled using the Intel® C/C++ compiler version 7.0, with no optimization flags. The rand_sse() function implements a vectorized version of this fast_rand() function, where the integer math operations are done in fours, using the SIMD architecture. For this, however, the function requires a pointer to the array as a parameter because C doesn't allow returning an array of values. This usage model is different from the above routines. Listing 2 shows the code for the rand_sse() in C (using Intrinsics) along with its assembly comments. Notice also the compatibility flag for applications that require similar numeric range and reduced low order bits correlation as the rand() function, with the shift right and mask operation.
Listing 2: Listing of SSE2 implementation of RNG
///////////////////////////////////////////////////////////////////////////// // The Software is provided "AS IS" and possibly with faults. // Intel disclaims any and all warranties and guarantees, express, implied or // otherwise, arising, with respect to the software delivered hereunder, // including but not limited to the warranty of merchantability, the warranty // of fitness for a particular purpose, and any warranty of non-infringement // of the intellectual property rights of any third party. // Intel neither assumes nor authorizes any person to assume for it any other // liability. Customer will use the software at its own risk. Intel will not // be liable to customer for any direct or indirect damages incurred in using // the software. In no event will Intel be liable for loss of profits, loss of // use, loss of data, business interruption, nor for punitive, incidental, // consequential, or special damages of any kind, even if advised of // the possibility of such damages. // // Copyright (c) 2003 Intel Corporation // // Third-party brands and names are the property of their respective owners // /////////////////////////////////////////////////////////////////////////// // Random Number Generation for SSE / SSE2 // Source File // Version 0.1 // Author Kipp Owens, Rajiv Parikh //////////////////////////////////////////////////////////////////////// #ifndef RAND_SSE_H #define RAND_SSE_H #include "emmintrin.h" #define COMPATABILITY //define this if you wish to return values similar to the standard rand(); void srand_sse( unsigned int seed ); void rand_sse( unsigned int* ); __declspec( align(16) ) static __m128i cur_seed; void srand_sse( unsigned int seed ) { cur_seed = _mm_set_epi32( seed, seed+1, seed, seed+1 ); } inline void rand_sse( unsigned int* result ) { __declspec( align(16) ) __m128i cur_seed_split; __declspec( align(16) ) __m128i multiplier; __declspec( align(16) ) __m128i adder; __declspec( align(16) ) __m128i mod_mask; __declspec( align(16) ) __m128i sra_mask; __declspec( align(16) ) __m128i sseresult; __declspec( align(16) ) static const unsigned int mult[4] = { 214013, 17405, 214013, 69069 }; __declspec( align(16) ) static const unsigned int gadd[4] = { 2531011, 10395331, 13737667, 1 }; __declspec( align(16) ) static const unsigned int mask[4] = { 0xFFFFFFFF, 0, 0xFFFFFFFF, 0 }; __declspec( align(16) ) static const unsigned int masklo[4] = { 0x00007FFF, 0x00007FFF, 0x00007FFF, 0x00007FFF }; adder = _mm_load_si128( (__m128i*) gadd); multiplier = _mm_load_si128( (__m128i*) mult); mod_mask = _mm_load_si128( (__m128i*) mask); sra_mask = _mm_load_si128( (__m128i*) masklo); cur_seed_split = _mm_shuffle_epi32( cur_seed, _MM_SHUFFLE( 2, 3, 0, 1 ) ); cur_seed = _mm_mul_epu32( cur_seed, multiplier ); multiplier = _mm_shuffle_epi32( multiplier, _MM_SHUFFLE( 2, 3, 0, 1 ) ); cur_seed_split = _mm_mul_epu32( cur_seed_split, multiplier ); cur_seed = _mm_and_si128( cur_seed, mod_mask); cur_seed_split = _mm_and_si128( cur_seed_split, mod_mask ); cur_seed_split = _mm_shuffle_epi32( cur_seed_split, _MM_SHUFFLE( 2, 3, 0, 1 ) ); cur_seed = _mm_or_si128( cur_seed, cur_seed_split ); cur_seed = _mm_add_epi32( cur_seed, adder); #ifdef COMPATABILITY // Add the lines below if you wish to reduce your results to 16-bit vals... sseresult = _mm_srai_epi32( cur_seed, 16); sseresult = _mm_and_si128( sseresult, sra_mask ); _mm_storeu_si128( (__m128i*) result, sseresult ); return; #endif _mm_storeu_si128( (__m128i*) result, cur_seed); return; } |
Characterization Of Each LCG RNG
In this section, some of the high level characteristics of both implementations will be examined. First is the cycle length. This is the number of values returned before the entire sequence starts repeating. Second is 'uniformity' of the distribution of the numbers. This roughly measures whether the numbers are skewed towards any particular value or set of values. It is measured with respect to the output value range. Third, the speed of generating and filling a large array will be measured and compared to evaluate the throughput. It is this last characteristic that should improve with the rand_sse() implementation using SSE2.
Cycle Length Analysis
The cycle length of any LCG random number generator is completely dependent upon the constants one chooses for equation 1 noted above. The cycle will never exceed the modulus (m). For our application (using 32-bit numbers) we chose m to be 2^32-1 (the largest 32-bit unsigned integer possible). This also allowed us to let the natural overflow of the 32-bit unsigned integer to act as the modulus. With a modulus selected, choosing the constants a and c becomes a bit easier. To reach the theoretical maximum of m numbers per cycle, one must follow very precise rules when selecting a and c.
The cycle of an LCG random number generator of the form X_{n} = (a X_{n-1} + c) mod m will only be of length m if and only if the following three conditions are met:
- c is relatively prime to m
- a-1 is a multiple of p, where p is every prime number that divides m
- a-1 is a multiple of 4 when m is a multiple of 4
fast_rand() is implemented using the first set of constants in Table 1 below. The result is a random number generator that produces a sequence of random numbers that only repeats every 2^32 numbers generated (m+1 in this case, because the number zero is included). In the case of rand_sse() we could use the same formula and constants for each of the four simultaneous calculations completed using SIMD, but that would result in the same 2^32 length cycle. On the other hand, if four sets of constants were used, all of which satisfied the rules above, one could extend the length of the generator's cycle to the sum of the four cycles contained within it. Fortunately, there is more than one set of constants to choose from that fit within our constraints. We selected the four in the table below.
Table 1: Constants and resulting cycles
a | c | m | cycle |
214013 | 2531011 | 2^32-1 | 2^32 |
17405 | 10395331 | 2^32-1 | 2^32 |
214013 | 13737667 | 2^32-1 | 2^32 |
69069 | 1 | 2^32-1 | 2^32 |
With four different LCG functions running–each with a cycle of 2^32–the cycle length for 32-bit random numbers is extended to 4*2^32 or 2^34 for rand_sse(). Or, if one desired, one could use rand_sse() to generate 64-bit or 128-bit random numbers (with cycles of 2^33 and 2^32 respectively).
Uniformity Analysis
So the cycle length of rand_sse() is four times larger than that of fast_rand(), but is it "as good?" There are numerous tests that have been invented to check the value of random number generators. While these tests do warrant a closer look, they are beyond the scope of this paper. That said, in the attempt to show the quality of rand_sse(), we have included two of the most basic of such tests – 1-D and 2-D distribution.
1-D Distribution
The distribution of numbers produced by an ideal random number generator should be uniform from 0 to its maximum number produced– in our case this is m. Using statistics one would expect, therefore, a mean m/2 and a standard deviation of m/(2*3).
To test how close both the fast_rand() and rand_sse() functions come to producing this distribution, we produced 1 million random numbers with each generator (fast_rand() and rand_sse()) and then sorted them into one hundred equal bins. The resulting histogram of each function can be seen in Figure 1 and Figure 2 below and the measured mean and standard deviation along with the percent error of each function can be found in Table 2.
Figure 1: 1-D Distribution of fast_rand()
Figure 2: 1-D Distribution of rand_sse()
Table 2: Statistics for fast_rand() and rand_sse()
mean | standard deviation | |
ideal uniform distribution | (2^32)/2 = 2147483648 | (2^32)/(2*3) = 1239850262 |
rand_sse | 2146969795 | 1239930971 |
rand_sse %error | 0.024% | 0.007% |
fast_rand | 2146454860 | 1239659448 |
fast_rand %error | 0.048% | 0.015% |
2-D Distribution
2-D distribution is really just an extension of 1-D. Two random numbers are generated to represent an x and a y coordinate pair. Ideally, the random numbers should be spread evenly throughout the 2-D space ranging from 0 to m in both the x and y directions. Shown in histogram form, the ideal 2-D distribution would appear as a perfect cube. Below both the fast_rand() and rand_sse() 2-D histograms can be seen to approach this state.
Figure 3: 2-D Distribution of fast_rand()
Figure 4: 2-D Distribution of rand_sse()
Throughput Analysis
Lastly, and for many, most importantly, is the speed of the random number generator. Speed was tested by timing how long each random number generator took to produce one billion random numbers. The functions tested were the standard math library function rand(), fast_rand(), and rand_sse(). The results of each test including acceleration relative to rand() are listed in Table 3 below.
Table 3: Time to compute one billion random numbers
Function | Time (sec) | Speedup |
rand() | 10.03 | 1.00 |
fast_rand() | 4.99 | 2.01 |
rand_sse() | 1.83 | 5.48 |
Conclusion
Based on a raw speed comparison, rand_sse() is the clear winner, computing one billion random numbers 2.73 times faster than fast_rand() and 5.48 times faster than the standard rand() function. This performance is achieved without compromising on cycle length or uniformity. In fact, in both cases, rand_sse() shows a relative improvement over the scalar implementations.
In addition to its significantly longer cycle and significantly quicker production than fast_rand() and rand(), rand_sse() is also more flexible. One could use rand_sse() to produce 32-bit, 64-bit, 96-bit, or 128-bit random values.
With its significantly faster performance and flexibility over the scalar implementations, rand_sse() could prove useful in a wide variety of applications today.
Related Resources
Reference Materials
- Intel® C++ Compiler for Windows* Product Information
- Intel® 64 and IA-32 Architectures Software Developer's Manuals
- Numerical Recipes in C, Second Edition, Press, et al., ISBN 0-521-43108-5
- Computation Physics/Carleton University/ Random Numbers*
- A collection of selected pseudorandom number generators with linear structures, Karl Entacher; Linear Congruential Generator*
- Parallel Monte Carlo Methods for Derivative Security Pricing* (PDF), Giorgio Pauletto
- NIST/SEMATECH e-Handbook of Statistical Methods*
Developer Centers
- Pentium® 4 Processor Performance Optimization
- Intel® Xeon® Processor Performance Optimization
About the Authors
Rajiv D. Parikh, Sr. Applications Engineer, Intel Corporation, Folsom, CA. Mr. Parikh has held various technical positions at Intel since joining in 1993. His focus has been multi-processor systems and software engineering. He has a Masters and Bachelors of Science in Computer and Electrical Engineering from Virginia Tech.
Kipp Owens, Applications Engineer, Intel Corporation, Folsom, CA. Mr. Owens has worked predominantly in the area of software optimization. He has a Bachelors of Science in Electrical Engineering from Purdue University.
Adicionar comentário
Topo(Para ver discussões técnicas, visite os nossos fóruns de desenvolvedores. Para falar sobre problemas do site ou de produtos de software entre em contato com o suporte técnico.)
Faça login para adicionar um comentário. Não é membro? Inscreva-se hoje mesmo!