# Performance Benefits of Half Precision Floats

By Patrick Christian Konsor, published on August 15 , 2012

Half precision floats are 16-bit floating-point numbers, which are half the size of traditional 32-bit single precision floats, and have lower precision and smaller range. When high precision is not required, half-floats can be a useful format for storing floating-point numbers because they require half the storage space and half the memory bandwidth. Using the new half-float conversion instructions introduced with the 3rd generation Intel® Core™ processor family, it is possible to gain better performance in applications that load or store floating-point values in certain scenarios where 16-bit precision is sufficient. In particular, half-floats may provide better performance than 32-bit floats when the 32-bit float data does not fit into the L1 cache.

## 1. Introduction to Half Precision Floats

Half precision floats are 16-bit floating-point numbers, half the size of traditional 32-bit single precision floats; half precision floats are also known as binary16 in the IEEE standard, float16, FP16, or simply half-floats. Because they are half the size, half precision floats have a smaller range and lower precision than single or double precision floats, and for this reason half-floats are not considered ideal for computation. However, half-floats also require half the storage and bandwidth of 32-bit floats, which can make them ideal for storing floating-point values in many situations where precision isn’t critical. Additionally, unlike 8-bit or 16-bit integer formats, half-floats have a dynamic range, meaning they have relatively high precision for floating-point values near zero, but have low precision for integers far from zero.

Half precision floats can express values in the range ±65,504. The precision of the values ranges from as fine grained as 0.0000000596046 for values nearest zero, up to 32 for values in the range 32,768 – 65,536 (meaning a value in this range will be rounded down to the nearest multiple of 32). If you are dealing with values larger than ±65,536 it is recommended that you do not use half-floats. For more information on the specifics of the half-float format, see the IEEE 754-2008 Standard.

## 2. Using Half-Floats

Because the half precision floating-point format is a storage format, the only operation performed on half-floats is conversion to and from 32-bit floats. The 3rd generation Intel® Core™ processor family introduced two half-float conversion instructions: `vcvtps2ph` for converting from 32-bit float to half-float, and vcvtph2ps for converting from half-float to 32-bit float.

You can utilize these instructions without writing assembly by using the corresponding intrinsics instructions: `_mm256_cvtps_ph` for converting from 32-bit float to half-float, and `_mm256_cvtph_ps` for converting from half-float to 32-bit float (`_mm_cvtps_ph` and `_mm_cvtph_ps` for 128-bit vectors). If you use the Intel® Compiler, these intrinsics are supported even when compiling for processors that don’t support the half-float conversion instructions. In this case the conversions will be done in software via optimized library calls. Even though the performance will be slower than the native hardware instructions, this approach offers a consistent source code implementation across processor products. To direct the Intel® Compiler to produce the conversion instructions for execution on 3rd generation Intel® Core™ family (or later), you can either compile the entire file with the –xCORE-AVX-I flag (`/QxCORE-AVX-I` on Windows*), or use the Intel® specific optimization pragma with `target_arch=CORE-AVX-I` for the individual function(s) (see Figure below).

Figure 1. Example of using the Intel® specific optimization pragma to direct the Intel® compiler to utilize the half-float conversion instructions. The resulting assembly is also shown.

``````#pragma intel optimization_parameter target_arch=CORE-AVX-I
__declspec(noinline) void float2half(float* floats, short* halfs) {
__m256i half_vector = _mm256_cvtps_ph(float_vector, 0);
_mm256_store_si256 ((__m256i*)halfs, half_vector);
}
``````
``````vmovups (%rax), %ymm0
vcvtps2ph \$0x0, %ymm0, (%rbx)
``````

The `vcvtps2ph` instruction and `_mm[256]_cvtps_ph` intrinsic instruction have an immediate byte argument for rounding control, which is encoded as follows:

Table 1. Immediate byte encoding for half-float conversion instructions.

 Bits Value (binary) Description imm[1:0] 00 Round to nearest even 01 Round down 10 Round up 11 Truncate imm[2] 0 Use imm[1:] for rounding 1 Use MXCSR.RC for rounding

## 3. Performance Benefits

3.1. Background

Half precision floats have several inherent advantages over 32-bit floats when accessing memory: 1) they are half the size and thus may fit into a lower level of cache with a lower latency, 2) they take up half the cache space, which frees up cache space for other data in your program, and 3) they require half the memory bandwidth, which frees up that bandwidth for other operations in your program. Half precision floats also have advantages when stored to disk because they require half the storage space and disk IO. The disadvantage of half precision floats is that they must be converted to/from 32-bit floats before they’re operated on. However, because the new instructions for half-float conversion are very fast, they create several situations in which using half-floats for storing floating-point values can produce better performance than using 32-bit floats.

Consider the common generic operation where floating-point data is repeatedly loaded from an array, operated on, and stored back to the array. This generic operation is shown in Figure below using 32-bit floats. With Intel® AVX, 256-bits (8 single-precision floats) can be loaded/stored in one operation. Note that if the size were not divisible by 8, a prologue and epilogue would be needed to handle the remaining elements.

Figure 2. Generic operation on single-precision data.

``````float* array; size_t size;
for (size_t i = 0; i < size; i += 8) {
__m256 vector = _mm256_load_ps(array + i);
// computation(s) on vector
_mm256_store_ps(array + i, vector);
}
``````

This same operation on the same number of array elements can be implemented with half-floats by using 128-bit loads/stores (8 half-floats) and adding instructions for converting to/from 32-bit float, shown in Figure below.

Figure 3. Generic operation on half-precision data, using 128-bit loads and stores.

``````uint16_t* array; size_t size;
for (size_t i = 0; i < size; i += 8) {
__m256 vector = _mm256_cvtph_ps(_mm_load_si128((__m128i*)(array + i)));
// computation(s) on vector
_mm_store_si128((__m128i*)(array + i), _mm256_cvtps_ph(vector, 0 /*rounding*/));
}
``````

Using half-floats to perform this generic operation can provide a performance benefit over 32-bit floats in certain situations, depending on the size of the array.

3.2. Results

Using half-floats provides a performance benefit over 32-bit floats when 32-bit float data does not fit into the L1 cache. Specifically, half-floats provide an average speedup of 1.05x when 32-bit data would fit in the L2 cache, an average speedup of 1.3x when 32-bit data would fit in the L3 cache, and an average speedup of 1.6x when the 32-bit data would fit into memory. Additionally, while half-floats may not provide a direct performance benefit when 32-bit data would fit into the L1 cache, you may still experience an auxiliary benefit when using half-floats in your program because half-floats will use half as much space, which allows for significantly more of your programs data to reside in L1.

The graph below shows speedups for array sizes from 8K array elements up to 8M elements, where each 32-bit float element is 4 bytes, and each half-float element is 2 bytes.

These performance statistics are based on tests that were run on a 1.8 GHz 3rd generation Intel® Core™ processor with a 32 KB L1 data cache, 256 KB L2 cache, 4 MB L3 cache, and 4 GB of memory. The code was compiled with Intel® Compiler 12.1 with these flags: `-O3 -QxCORE_AVX_I`.

Figure 4. Graph of the speedups of half-floats over 32-bit floats.

## 4. Recommendations

Assuming that the precision and rage of half precision floats is acceptable for you, it is recommend that you use half precision floats if 32-bit data would not fit into the L1 cache when accessing data in a manner similar to our generic linear load-operate-store usage pattern. We also recommend that you always align your arrays to the proper alignment, which is 16 bytes for 128-bit data and 32 bytes for 256-bit data.

## 5. Summary

Half precision floats are 16-bit floating-point numbers, which are half the size of traditional 32-bit single precision floats, and have lower precision and smaller range. When high precision is not required, half-floats can be a useful format for storing floating-point numbers because they require half the storage space and half the memory bandwidth. Using the new half-float conversion instructions introduced with the 3rd generation Intel® Core™ processor family, it is possible to gain better performance in applications that store floating-point values in certain usage scenarios.

Patrick Konsor is an Application Engineer on the Apple Enabling Team at Intel in Santa Clara, and specializes in software optimization. Patrick received his BS in computer science from the University of Wisconsin-Eau Claire. He enjoys cycling in his free time (go Schlecks!).

1

#### Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserverd for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804