Performance Benefits of Half Precision Floats

Half precision floats are 16-bit floating-point numbers, which are half the size of traditional 32-bit single precision floats, and have lower precision and smaller range. When high precision is not required, half-floats can be a useful format for storing floating-point numbers because they require half the storage space and half the memory bandwidth. Using the new half-float conversion instructions introduced with the 3rd generation Intel® Core™ processor family, it is possible to gain better performance in applications that load or store floating-point values in certain scenarios where 16-bit precision is sufficient. In particular, half-floats may provide better performance than 32-bit floats when the 32-bit float data does not fit into the L1 cache.

1. Introduction to Half Precision Floats

Half precision floats are 16-bit floating-point numbers, half the size of traditional 32-bit single precision floats; half precision floats are also known as binary16 in the IEEE standard, float16, FP16, or simply half-floats. Because they are half the size, half precision floats have a smaller range and lower precision than single or double precision floats, and for this reason half-floats are not considered ideal for computation. However, half-floats also require half the storage and bandwidth of 32-bit floats, which can make them ideal for storing floating-point values in many situations where precision isn’t critical. Additionally, unlike 8-bit or 16-bit integer formats, half-floats have a dynamic range, meaning they have relatively high precision for floating-point values near zero, but have low precision for integers far from zero.

Half precision floats can express values in the range ±65,504. The precision of the values ranges from as fine grained as 0.0000000596046 for values nearest zero, up to 32 for values in the range 32,768 – 65,536 (meaning a value in this range will be rounded down to the nearest multiple of 32). If you are dealing with values larger than ±65,536 it is recommended that you do not use half-floats. For more information on the specifics of the half-float format, see the IEEE 754-2008 Standard.

2. Using Half-Floats

Because the half precision floating-point format is a storage format, the only operation performed on half-floats is conversion to and from 32-bit floats. The 3rd generation Intel® Core™ processor family introduced two half-float conversion instructions: vcvtps2ph for converting from 32-bit float to half-float, and vcvtph2ps for converting from half-float to 32-bit float.

You can utilize these instructions without writing assembly by using the corresponding intrinsics instructions: _mm256_cvtps_ph for converting from 32-bit float to half-float, and _mm256_cvtph_ps for converting from half-float to 32-bit float (_mm_cvtps_ph and _mm_cvtph_ps for 128-bit vectors). If you use the Intel® Compiler, these intrinsics are supported even when compiling for processors that don’t support the half-float conversion instructions. In this case the conversions will be done in software via optimized library calls. Even though the performance will be slower than the native hardware instructions, this approach offers a consistent source code implementation across processor products. To direct the Intel® Compiler to produce the conversion instructions for execution on 3rd generation Intel® Core™ family (or later), you can either compile the entire file with the –xCORE-AVX-I flag (/QxCORE-AVX-I on Windows*), or use the Intel® specific optimization pragma with target_arch=CORE-AVX-I for the individual function(s) (see Figure below).

Figure 1. Example of using the Intel® specific optimization pragma to direct the Intel® compiler to utilize the half-float conversion instructions. The resulting assembly is also shown.

#pragma intel optimization_parameter target_arch=CORE-AVX-I
__declspec(noinline) void float2half(float* floats, short* halfs) {
	__m256 float_vector = _mm256_load_ps(floats);
	__m256i half_vector = _mm256_cvtps_ph(float_vector, 0);
	_mm256_store_si256 ((__m256i*)halfs, half_vector);
}
vmovups (%rax), %ymm0
vcvtps2ph $0x0, %ymm0, (%rbx)

The vcvtps2ph instruction and _mm[256]_cvtps_ph intrinsic instruction have an immediate byte argument for rounding control, which is encoded as follows:

Table 1. Immediate byte encoding for half-float conversion instructions.

Bits Value (binary) Description
imm[1:0] 00 Round to nearest even
01 Round down
10 Round up
11 Truncate
imm[2] 0 Use imm[1:] for rounding
1 Use MXCSR.RC for rounding

3. Performance Benefits

3.1. Background

Half precision floats have several inherent advantages over 32-bit floats when accessing memory: 1) they are half the size and thus may fit into a lower level of cache with a lower latency, 2) they take up half the cache space, which frees up cache space for other data in your program, and 3) they require half the memory bandwidth, which frees up that bandwidth for other operations in your program. Half precision floats also have advantages when stored to disk because they require half the storage space and disk IO. The disadvantage of half precision floats is that they must be converted to/from 32-bit floats before they’re operated on. However, because the new instructions for half-float conversion are very fast, they create several situations in which using half-floats for storing floating-point values can produce better performance than using 32-bit floats.

Consider the common generic operation where floating-point data is repeatedly loaded from an array, operated on, and stored back to the array. This generic operation is shown in Figure below using 32-bit floats. With Intel® AVX, 256-bits (8 single-precision floats) can be loaded/stored in one operation. Note that if the size were not divisible by 8, a prologue and epilogue would be needed to handle the remaining elements.

Figure 2. Generic operation on single-precision data.

float* array; size_t size;
for (size_t i = 0; i < size; i += 8) {
	__m256 vector = _mm256_load_ps(array + i);
	// computation(s) on vector
	_mm256_store_ps(array + i, vector);
}

This same operation on the same number of array elements can be implemented with half-floats by using 128-bit loads/stores (8 half-floats) and adding instructions for converting to/from 32-bit float, shown in Figure below.

Figure 3. Generic operation on half-precision data, using 128-bit loads and stores.

uint16_t* array; size_t size;
for (size_t i = 0; i < size; i += 8) {
	__m256 vector = _mm256_cvtph_ps(_mm_load_si128((__m128i*)(array + i)));
	// computation(s) on vector
	_mm_store_si128((__m128i*)(array + i), _mm256_cvtps_ph(vector, 0 /*rounding*/));
}

Using half-floats to perform this generic operation can provide a performance benefit over 32-bit floats in certain situations, depending on the size of the array.

3.2. Results

Using half-floats provides a performance benefit over 32-bit floats when 32-bit float data does not fit into the L1 cache. Specifically, half-floats provide an average speedup of 1.05x when 32-bit data would fit in the L2 cache, an average speedup of 1.3x when 32-bit data would fit in the L3 cache, and an average speedup of 1.6x when the 32-bit data would fit into memory. Additionally, while half-floats may not provide a direct performance benefit when 32-bit data would fit into the L1 cache, you may still experience an auxiliary benefit when using half-floats in your program because half-floats will use half as much space, which allows for significantly more of your programs data to reside in L1.

The graph below shows speedups for array sizes from 8K array elements up to 8M elements, where each 32-bit float element is 4 bytes, and each half-float element is 2 bytes.

These performance statistics are based on tests that were run on a 1.8 GHz 3rd generation Intel® Core™ processor with a 32 KB L1 data cache, 256 KB L2 cache, 4 MB L3 cache, and 4 GB of memory. The code was compiled with Intel® Compiler 12.1 with these flags: -O3 -QxCORE_AVX_I.

Figure 4. Graph of the speedups of half-floats over 32-bit floats.

4. Recommendations

Assuming that the precision and rage of half precision floats is acceptable for you, it is recommend that you use half precision floats if 32-bit data would not fit into the L1 cache when accessing data in a manner similar to our generic linear load-operate-store usage pattern. We also recommend that you always align your arrays to the proper alignment, which is 16 bytes for 128-bit data and 32 bytes for 256-bit data.

5. Summary

Half precision floats are 16-bit floating-point numbers, which are half the size of traditional 32-bit single precision floats, and have lower precision and smaller range. When high precision is not required, half-floats can be a useful format for storing floating-point numbers because they require half the storage space and half the memory bandwidth. Using the new half-float conversion instructions introduced with the 3rd generation Intel® Core™ processor family, it is possible to gain better performance in applications that store floating-point values in certain usage scenarios.

6. About the Author

Patrick Konsor is an Application Engineer on the Apple Enabling Team at Intel in Santa Clara, and specializes in software optimization. Patrick received his BS in computer science from the University of Wisconsin-Eau Claire. He enjoys cycling in his free time (go Schlecks!).

7. Notices

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Intel, the Intel logo, VTune, Cilk and Xeon are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others

Copyright© 2012 Intel Corporation. All rights reserved.

Optimization Notice

Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2®, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804
Categories:
For more complete information about compiler optimizations, see our Optimization Notice.