Choose FP16, FP32 or int8 for Deep Learning Models

ID 660191
Updated 4/4/2020
Version Latest
Public

author-image

By

mustang
pi3 with ncs 2 and clear case
Raspberry Pi* 3A+ with NCS

Deep learning neural network models are available in multiple floating point precisions. For Intel® OpenVINO™ toolkit, both FP16 (Half) and FP32 (Single) are generally available for pre-trained and public models.
 
This article explores these floating point representations in more detail, and answer questions such as which precision are compatible with different hardware.

Compatible sizes for CPU, GPU, HDDL-R, or NCS2 target hardware devices

Central Processing Unit CPU

CPU supports FP32, Int8 

CPU plugin -  Intel Math Kernel Library for Deep Neural Networks (MKL-DNN) and OpenMP.

 
Graphics Processing Unit GPU GPU supports FP16, FP32 FP16 preferred  

8 Vision Processing Units (MYRIAD)

HDDL-R

FP32 and FP16  

Vision Processing Unit (MYRIAD)

VPU

VPU supports FP16  

Why do some precisions work with a certain type of hardware and not with the others?  

When developing for Intel® Neural Compute Stick 2 (Intel® NCS 2), Intel® Movidius VPUs, and Intel® Arria® 10 FPGA, you want to make sure that you use a model that uses FP16 precision.

The Open Model Zoo, provided by Intel and the open-source community as a repository for publicly available pre-trained models, has nearly three dozen FP16 models that can be used right away with your applications. If these don’t meet your needs, or you want to download one of the models that are not already in an IR format, then you can use the Model Optimizer to convert your model for the Inference Engine and Intel® NCS 2.

How can choosing the right floating point precision improve performance with my application?  

Half-precision floating point numbers (FP16) have a smaller range. FP16 can result in better performance where half-precision is enough.

Advantages of FP16

  • FP16 improves speed (TFLOPS) and performance
  • FP16 reduces memory usage of a neural network
  • FP16 data transfers are faster than FP32

Area

 

Description

Memory Access

FP16 is half the size

Cache

Take up half the cache space - this frees up cache for other data  

Bandwidth

Requires half the memory bandwidth - this free up the bandwidth for other operations in the app

Storage and Disk IO

Requires half the storage space and disk IO.

Disadvantages

The disadvantage of half precision floats is that they must be converted to/from 32-bit floats before they’re operated on. However, because the new instructions for half-float conversion are very fast, they create several situations in which using half-floats for storing floating-point values can produce better performance than using 32-bit floats.

Using the Model Optimizer creates a more compact model for inference. This is done by merging convolutions

Calibration tool and Int8

The inference engine calibration tool is a Python* command line tool located in the following directory:

~/openvino/deployment_tools/tools 

The Calibration tool is used to calibrate a FP32 model in low precision 8 bit integer mode while keeping the input data of this model in the original precision. The calibration tool reads the FP32 model , calibration dataset and creates a low precision model. This differentiates from the orginal model in the following ways:

1. Per channel statistics are defined.

2. Quantization_level layer attribute is defined. The attribute defines precision which is used during inference.

8-bit Inference Topologies

The 8-bit inference feature was validated on the following topologies listed below.

Classification models:

  • Caffe Inception v1, Inception v4
  • Caffe ResNet-50 v1, ResNet-101 v1
  • Caffe MobileNet
  • Caffe SqueezeNet v1.0, SqueezeNet v1.1
  • Caffe VGG16, VGG19
  • TensorFlow Inception v3, Inception v4, Inception ResNet v2
  • Caffe DenseNet-121, DenseNet-161, DenseNet-169, DenseNet-201

Object detection models:

  • Caffe SSD_SqueezeNet
  • Caffe SSD_MobileNet
  • Caffe SSD_Vgg16_300
  • TensorFlow SSD Mobilenet v1, SSD Mobilenet v2

Semantic segmentation models:

  • Unet2D

For more information about running inference with int8, visit Use the Calibration tool article.