In today’s world, many applications, in one way or another, involve graphics. High resolution graphical and game applications may require a huge amount of disk space and memory to store graphics data. Half precision floating format can specifically reduce the amount of graphics data and the memory bandwidth required for an application; however, half precision floating point format can only be used to store data, not to operate on the data. In order to perform operations with such data, a half precision floating point value needs to be converted back a single precision floating point value. This blog will talk about where the half precision floating point format is used and how Intel has newly introduced half precision floating-point (float 16) conversion new instructions that are used to optimize the half-to-single and single-to-half conversion processes.
What is Half-Precision Floating-Point Format?
Half precision floating point is a 16-bit binary floating-point format. It is half the size of traditional 32-bit single precision floats. More information about half-precision floating-point format can be found at .
Where is Half-Precision Floating-Point Format Useful?
This format is used in many graphics environments like OpenEXR, JPEG XR, and OpenGL and so on.
OpenEXR is a high dynamic-range (HDR) image file format developed by Industrial Light & Magic for use in computer imaging applications. OpenEXR was used in movies like Harry Potter and the Sorcerer Stone, Men in Black II and so on. More information about OpenEXR can be found at .
JPEG XR , per Wikipedia, is a still-image compression standard and file format for continuous tone photographic images, based on technology originally developed and patented by Microsoft* under the name HD Photo (formerly Windows Media Photo). More information about jpeg XR can be found at .
OpenGL is the cross-platform application program interface for defining 2-D and 3-D graphic images. Before OpenGL, any company developing a graphical application typically had to rewrite the graphics part of it for each operating system. Since OpenGL is cross-platform, an application can create the same effects in any operating system using any OpenGL-adhering graphics adapter. More information about OpenGL can be found at .
Use Cases for Half-Precision Floating-Point Format
In this section, we will talk about how half-precision floating-point format can be used in digital imaging applications like Computed Tomography (CT) scan. CT, also known as Computed Axial Tomography (CAT), is an x-ray procedure. Multiple images are taken during a CAT scan, and a computer reconstructs them into complete, cross-sectional pictures ("slices") of soft tissue, bone and so on. More information about CT scanning can be found at .
CT has four major steps:
1) Scanning to generate images in memory
2) Saving images to disk
3) Loading images to memory
4) Reconstructing based on images.
By utilizing half-precision floating-point format in steps 2 and 3, the amount of disk space and memory bandwidth required is reduced to half, respectively. Also step 4 has 3 major sub-steps: convolution, matrix transpose and backprojection. Backprojection is the main step in reconstructing images. Here we only concern backprojection step since it involves loading images and computing images. As images are loaded from the disk to the memory, they are still in half-precision floating-point format. In the convolution step, after the load, images Tey need to be converted back to single-precision (32-bit) floating format before they can be reconstructed. The backprojection step is computationally very intensive. More information about backprojection can be found at .
In order to speed up the conversion processes, Intel® introduces new instructions in new generations of Intel® processors.
Intel® Half-Precision Floating-Point Format Conversion Instructions
New Intel® processors like Intel® Xeon® processor E5-2600 v2 family have two new instructions to convert the half-precision (16-bit) data to single-precision (32-bit) data and vice versa.
VCVTPS2PH: Converting data in single-precision floating-point format to half-precision floating point format.
VCVTPH2PS: Converting data in half-precision floating-point format to single-precision floating point format.
More information about these instructions can be found at  and 
In order to recognize which Intel® processors support these instructions, execute the instruction CPUID  with register EAX set to 1. If bit 29 of the value in register ECX is 1 then the processor supports these instructions.
The two new instructions are assembly language instructions. Not all applications are using assembly language. Therefore, Intel also introduces two equivalent instructions call intrinsic instructions that can be used in C/C++ language. They are:
Converting from single precision to half precision
_mm256_cvtps_ph (for 256-bit vector)
_mm_cvtps_ph (for 128-bit vector)
Converting from half precision to single precision
_mm256_cvtph_ps (for 256-bit vector)
_mm_cvtph_ps (for 128-bit vector)
In the case of CT above, if we want to use intrinsic instructions then we need to first use the 128-bit load intrinsic instruction, _mm_load_si128, to load 8 half-precision values and then use _mm256_cvtph_ps to convert 8 half precision values to 8 single precision to do the computation. After finish computing, use _mm256_cvtps_ph to convert them back to half-precision values and use _mm_store_si128 to store them to the disk.
Details on how to use these instructions can be found at ,  and .
Utilizing half-precision floating-point format helps reduce data size down to half to store to the disk. Note that half-precision floating-point format is useful with applications that are tolerable with some amount of data precision loss due to the conversion between half precision and single precision. Intel® new half-precision floating-point conversion instructions help speed up the conversion process from half-precision to single-precision and vice-versa.
 Intel® 64 and IA-32 Architectures Optimization Reference Manual
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserverd for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804