New counter-based Random Number Generators in Intel® Math Kernel Library

       Intel® MKL 11.3 provides ARS-5 and Philox4x32-10, new counter-based basic random number generators (BRNGs), introduced in [1]. ARS-5 is ARS with 5 rounds, and Philox4x32-10 is Philox with 10 rounds over 4 32-bit inputs (see more details in [1]). While providing good statistical properties and good performance, they have small state size and have fantastically rich ability for multi-streaming (by using different 128- or 64-bit keys for different independent streams) and sub-streaming (by using for example contiguous blocks of 128-bit counters), efficiently scaled on many compute threads, that makes them ideally suited for future exa-scale era of computing.

In a high level, counter based BRNGs can be described as follows. Let’s define any keyed family of BRNGs as a structure (K, ZJ; S, f, U, g), where

  • K – key space;
  • ZJ = {0, 1, ..., J-1}, where J is integer output multiplicity, i.e. size of state in 32-bit integer values;
  • S -- state space;
  • U -- output space;
  • f: S -> S -- state transition function, si = f(si-1);
  • g: K x ZJ x S -> U -- output function, outki = gki mod J(s[i/J]).

       For conventional BRNGs, f is usually complicated, g is simple. For counter based BRNGs, g is complicated, f is simple counter f(s) = (s+1) mod 2^p, where p is state size in bits. Such simple f means absence of computational dependence between consecutive states. In case of ARS-5 and Philox4x32-10, size of state and output is 128 bits (or 4 32-bit integers). Size of key is 128 bits for ARS-5 and 64 bits for Philox4x32-10. ARS-5 is available only on microarchitectures with AES-NI instructions, and in case if the application would be launched on microarchitecture which doesn’t support AES-NI instructions, then VSL RNG routines will return VSL_RNG_ERROR_ARS5_NOT_SUPPORTED status code.

Here is typical usage example of ARS-5 in C language:

Here is typical usage example of Philox4x32-10 in C language

Below we show performance of ARS-5 and Philox4x32-10 in Intel MKL 11.3, and for comparison we show also performance of some BRNGs with best statistical properties that VSL RNG provides, i.e.:

  • MRG32K3A, one of best from LCG/MCG/MRG-like families (see [2] on their description and [3] on their availability in MKL VSL RNG), and having quick skip-ahead method for multi-streaming,
  • MT2203, one from Mersenne Twister family (see [3]) that is easily multi-streamed in VSL RNG for 6024 independent streams.

        The benchmark generates 2^25 double precision random numbers in total. It is run on different numbers of used threads, to see how it scales. In the benchmark, we avoid memory footprint as much as possible by using blocking technique, so that each Open MP thread generates its own part of random numbers block-by-block, next block of results overrides previous one. Block size was 8*1024 elements on Intel® Xeon Phi® and 1024 elements on Intel® Xeon®. Blocks are allocated on stack in each Open MP thread. Performance is measured in number of random values or samples generated per one second, 1 Gsamples/s means 10^9 samples per second.

REFERENCES

[1] John K. Salmon, Mark A. Moraes, Ron O. Dror, and David E. Shaw. 2011. Parallel random numbers: as easy as 1, 2, 3. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11). ACM, New York, NY, USA, , Article 16 , 12 pages. DOI=10.1145/2063384.2063405 http://doi.acm.org/10.1145/2063384.2063405

[2] P. L’Ecuyer, “Random Number Generation”, chapter 3 of the Handbook of Computational Statistics, J. E. Gentle, W. Haerdle, and Y. Mori, eds., Second Edition, Springer-Verlag, 2012, 35-71

[3] Notes for Intel® MKL Vector Statistical Library, available at https://software.intel.com/en-us/mkl-vsnotes

For more complete information about compiler optimizations, see our Optimization Notice.