Getting Started with AutoMixedPrecisionMkl

By Niranjan Hasabnis PREETHI VENKATESH, and Rachel Oberman

AutoMixedPrecisionMkl is a grappler pass that automatically converts a model written in FP32 data type to operate in BFloat16 data type. To be precise, it scans the data-flow graph corresponding to the model, and looks for nodes in the graph (also called as operators) that can operate in BFloat16 type, and inserts FP32ToBFloat16 and vice-versa Cast nodes in the graph appropriately. This feature will be supported starting TensorFlow* 2.3. We will demonstrate this with examples illustrating:

  • How to convert the graph to BFloat16 on-the-fly as you train the model

  • How to convert a pre-trained fp32 model to BFloat16

Let’s consider a simple neural network consisting of a typical pattern of Conv2D with addition of bias and output of Conv2D clipped using ReLU. Inputs x and w of Conv2D and input b of bias_add are TensorFlow* Variables. Notice that the default type of Variable in TensorFlow* is FP32, so this neural network model operates completely in FP32 data type. TensorFlow* data-flow graph corresponding to this model looks like below.

Notice that the default type of Variable in TensorFlow* is FP32, so this neural network model operates completely in FP32 data type. TensorFlow* data-flow graph corresponding to this model looks like this

Before running the examples build or install the latest TensorFlow* with BFloat16 support

Once TensorFlow 2.3 is released on anaconda channel follow the below instructions to simply create an environment and install Tensorflow 2.3

conda create -n tf_mkl_2 tensorflow==2.3 python=3.7 -c intel 
source activate tf_mkl_2 

(or)

Build Tensorflow from source based on master branch with the following flags

bazel build --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 --copt=-O3 --copt=-Wformat --copt=-Wformat-security --copt=-fstack-protector --copt=-fPIC --copt=-fpic --linkopt=-znoexecstack --linkopt=-zrelro --linkopt=-znow --linkopt=-fstack-protector --config=mkl --define build_with_mkl_dnn_v1_only=true --copt=-DENABLE_INTEL_MKL_BFLOAT16 --copt=-march=native //tensorflow/tools/pip_package:build_pip_package

(or)

Use Intel's tf2.2-nightly container which has BFloat16 support: docker image

Let's create a sample workload conv2D_fp32.py with a Conv2d and Relu layers.

import tensorflow as tf 
from tensorflow.core.protobuf import rewriter_config_pb2  

# Disable Eager execution mode 
tf.compat.v1.disable_eager_execution()  

def conv2d(x, w, b, strides=1): 
    # Conv2D wrapper, with bias and relu activation 
    x = tf.nn.conv2d(x, w, strides=[1, strides, strides, 1], padding='SAME') 
    x = tf.nn.bias_add(x, b) 
    return tf.nn.relu(x)  

X = tf.Variable(tf.compat.v1.random_normal([784])) 
W = tf.Variable(tf.compat.v1.random_normal([5, 5, 1, 32])) 
B = tf.Variable(tf.compat.v1.random_normal([32])) 
x = tf.reshape(X, shape=[-1, 28, 28, 1])  

with tf.compat.v1.Session(config=tf.compat.v1.ConfigProto()) as sess: 
    sess.run(tf.compat.v1.global_variables_initializer()) 
    sess.run([conv2d(x, W, B)]) 
python conv2D_fp32.py 

This will train a regular fp32 conv2d graph.

1. Steps to Train a BFloat16 Model

Porting the above workload to BFloat16 data type using AutoMixedPrecisionMkl requires adding following lines of Python* code to this model:

graph_options=tf.compat.v1.GraphOptions( 
        rewrite_options=rewriter_config_pb2.RewriterConfig( 
            auto_mixed_precision_mkl=rewriter_config_pb2.RewriterConfig.ON)) 

In a nutshell, AutoMixedPrecisionMkl grappler pass can be controlled using RewriterConfig proto of GraphOptions proto. Possible values to initialize auto_mixed_precision_mkl are rewriter_config_pb2.RewriterConfig.ON and rewriter_config_pb2.RewriterConfig.OFF.

The default value is rewriter_config_pb2.RewriterConfig.OFF.

Add the code snippet given above just before intializing a TensorFlow* session with the graph_options to include the grappler pass.

Below is the complete code for the neural network model that is ported to BFloat16 type using AutoMixedPrecisionMkl is below.

import tensorflow as tf 
from tensorflow.core.protobuf import rewriter_config_pb2 

tf.compat.v1.disable_eager_execution()

def conv2d(x, w, b, strides=1): 
    # Conv2D wrapper, with bias and relu activation
    x = tf.nn.conv2d(x, w, strides=[1, strides, strides, 1], padding='SAME') 
    x = tf.nn.bias_add(x, b) 
    return tf.nn.relu(x)  

X = tf.Variable(tf.compat.v1.random_normal([784])) 
W = tf.Variable(tf.compat.v1.random_normal([5, 5, 1, 32])) 
B = tf.Variable(tf.compat.v1.random_normal([32])) 
x = tf.reshape(X, shape=[-1, 28, 28, 1])  

graph_options=tf.compat.v1.GraphOptions( 
        rewrite_options=rewriter_config_pb2.RewriterConfig( 
            auto_mixed_precision_mkl=rewriter_config_pb2.RewriterConfig.ON))  

with tf.compat.v1.Session(config=tf.compat.v1.ConfigProto( 
        graph_options=graph_options)) as sess: 
    sess.run(tf.compat.v1.global_variables_initializer()) 
    sess.run([conv2d(x, W, B)]) 

Notice that graph_options variable created by turning AutoMixedPrecisionMkl ON is passed to ConfigProto that is eventually passed to tf.Session API.

Create and run the Python* file with the above code conv2D_bf16.py

python conv2D_bf16.py 

Console output

2020-06-12 08:44:07.306404: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1980] Running auto_mixed_precision_mkl graph optimizer 
2020-06-12 08:44:07.306987: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1345] No whitelist ops found, nothing to do 
2020-06-12 08:44:07.309752: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1980] Running auto_mixed_precision_mkl graph optimizer 
2020-06-12 08:44:07.309866: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1345] No whitelist ops found, nothing to do 
2020-06-12 08:44:07.326082: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1980] Running auto_mixed_precision_mkl graph optimizer 
2020-06-12 08:44:07.326278: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1924] Converted 2/11 nodes to bfloat16 precision using 0 cast(s) to bfloat16 (excluding Const and Variable casts) 
2020-06-12 08:44:07.327683: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1980] Running auto_mixed_precision_mkl graph optimizer 
2020-06-12 08:44:07.327814: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1924] Converted 0/14 nodes to bfloat16 precision using 0 cast(s) to bfloat16 (excluding Const and Variable casts) 

The data-flow graph after porting the model to BFloat16 type looks like below.

he data-flow graph after porting the model to BFloat16 type looks like this

Notice that 2 operators, Conv2D+BiasAdd and ReLU in the graph are automatically converted to operate in BFloat16 type. Also note that appropriate Cast nodes are inserted in the graph to convert TensorFlow* tensors from FP32 type to BFloat16 type and vice-versa.

2. Steps to Convert a Pretrained fp32 Model to BFloat16

In the previous section we saw how the AutoBFloat16Convertor can automatically convert certain nodes to BFloat16 while training a sample model. This section will cover how to convert a pre-trained fp32 model to BFloat16.

2.1 Modify the conv2D_fp32.py to save the trained model

import tensorflow as tf 
import tensorflow.python.saved_model 
from tensorflow.python.saved_model import tag_constants 
from tensorflow.python.saved_model.signature_def_utils_impl import predict_signature_def 

# Disable Eager execution mode 
tf.compat.v1.disable_eager_execution() 

def conv2d(x, w, b, strides=1): 
    # Conv2D wrapper, with bias and relu activation 
    x = tf.nn.conv2d(x, w, strides=[1, strides, strides, 1], padding='SAME',name="myInput") 
    x = tf.nn.bias_add(x, b) 
    return tf.nn.relu(x, name="myOutput")  

X = tf.Variable(tf.compat.v1.random_normal([784], name="myInput")) 
W = tf.Variable(tf.compat.v1.random_normal([5, 5, 1, 32])) 
B = tf.Variable(tf.compat.v1.random_normal([32])) 
x = tf.reshape(X, shape=[-1, 28, 28, 1]) 

export_dir='./model'  

with tf.compat.v1.Session(config=tf.compat.v1.ConfigProto()) as sess: 
    sess.run(tf.compat.v1.global_variables_initializer()) 
    y=conv2d(x,W,B) 
    sess.run([y]) 
    builder = tf.compat.v1.saved_model.builder.SavedModelBuilder(export_dir) 
    signature = predict_signature_def(inputs={'myInput': X}, 
                                  outputs={'myOutput': y}) 
    builder.add_meta_graph_and_variables(sess=sess, 
                                     tags=["myTag"], 
                                     signature_def_map={'predict': signature}) 
    builder.save() 

2.2 Run the Python* script and check for the saved_model.pb stored at ./model location

python conv2D_fp32.py

2.3 Run the AutoMixedPrecisionMkl on Saved Model

2.3.1 Create gen_bf16_pb.py Python* Script

Initialize the graph_options with AutoMixedPrecisionMkl, load and convert the saved fp32 model.

from argparse import ArgumentParser 
from tensorflow.core.protobuf import config_pb2 
from tensorflow.core.protobuf import rewriter_config_pb2 
from tensorflow.python.grappler import tf_optimizer 
from tensorflow.python.tools import saved_model_utils 
import tensorflow as tf 
import time   

parser = ArgumentParser() 
parser.add_argument("input_dir", help="Input directory containing saved_model.pb.", type=str) 
parser.add_argument("as_text", help="Output graph in text protobuf format." 
                    "If False, would dump in binary format", type=bool) 
parser.add_argument("output_dir", help="Directory to store output graph.", type=str) 
parser.add_argument("output_graph", help="Output graph name. (e.g., foo.pb," 
                    "foo.pbtxt, etc)", type=str) 
args = parser.parse_args()   

graph_options = tf.compat.v1.GraphOptions(rewrite_options=rewriter_config_pb2.RewriterConfig(
                               auto_mixed_precision_mkl=rewriter_config_pb2.RewriterConfig.ON)) 
optimizer_config = tf.compat.v1.ConfigProto(graph_options=graph_options) 
metagraph_def = saved_model_utils.get_meta_graph_def(args.input_dir, "myTag") 
output_graph = tf_optimizer.OptimizeGraph(optimizer_config, metagraph_def) 
tf.io.write_graph(output_graph, args.output_dir, args.output_graph, 
                  as_text=args.as_text) 

2.3.2 Run the Conversion Script

python gen_bf16_pb.py ./model True ./model saved_model_bf16.pbtxt

Console output:

2020-06-15 11:45:19.612781: I tensorflow/core/grappler/devices.cc:78] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 (Note: TensorFlow was not compiled with CUDA or ROCm support) 
2020-06-15 11:45:19.612978: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session 
2020-06-15 11:45:19.663030: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3300000000 Hz 
2020-06-15 11:45:19.675988: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5592f0fe00f0 initialized for platform Host (this does not guarantee that XLA will be used). devices: 
2020-06-15 11:45:19.676016: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-15 11:45:19.696569: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1924] Converted 2/59 nodes to bfloat16 precision using 0 cast(s) to bfloat16 (excluding Const and Variable casts) 
2020-06-15 11:45:19.700334: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1924] Converted 0/49 nodes to bfloat16 precision using 0 cast(s) to bfloat16 (excluding Const and Variable casts) 

Again, notice 2 operators are converted into BFloat16 type. The script also supports dumping the model into protobuf binary file (.pb) or text file (.pbtxt).

Controlling the Operators that will be Ported to BFloat16 type Automatically

We provide some more details about AutoMixedPrecisionMkl for interested readers. An important point to note is that not all of TensorFlow’s* operators for CPU backend support BFloat16 type - this could be because either the support is missing (and is a WIP) or that the BFloat16 version of an operator may not offer much performance improvement over the FP32 version.

Furthermore, BFloat16 type for certain operators could lead to numerical instability of a neural network model. So we categorize TensorFlow* operators that are supported by MKL backend in BFloat16 type into 1) if they are always numerically stable, and 2) if they are always numerically unstable, and 3) if their stability could depend on the context. AutoMixedPrecisionMkl pass uses a specific White, Black and Grey list of operators respectively to capture these operators. The exact lists could be found in auto_mixed_precision_list.h file in TensorFlow* github repository.

We would like to mention that the default values of these lists already capture the most common BFloat16 usage models and to also ensure numerical stability of the model. There is, however, a way to add or remove operators from any of these lists as by setting environment variables that control these lists. For instance, executing

export TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_BLACKLIST_ADD=Conv2D 

before running the model would add Conv2D operator to Black list. While, executing

export TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_WHITELIST_REMOVE=Conv2D 

before running the model would remove Conv2D from White list. And executing both of these commands before running a model would move Conv2D from White list to Black list.

In general, the template corresponding to the names of the environment variables controlling these lists is:

TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_${LIST}_${OP}=operator

where ${LIST} would be any of {WHITE, BLACK, GREY}, and ${OP} would be any of {ADD, REMOVE}.

To test this feature of adding an op into blacklist and removing from the whitelist, run the code sample conv2D_bf16.py by enabling the environment variables with conv2d ops

export TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_BLACKLIST_ADD=Conv2D 
export TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_WHITELIST_REMOVE=Conv2D  
python conv2D_bf16.py 

As you can see from the console output and comparing with the Step 1, conv2D ops have been removed from the whitelist and added to the blacklist, thus controlling the node from converting into BFloat16 equivalent.

2020-06-16 17:13:57.655035: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3300000000 Hz 
2020-06-16 17:13:57.668690: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55f592d69360 initialized for platform Host (this does not guarantee that XLA will be used). devices: 
2020-06-16 17:13:57.668723: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version 
2020-06-16 17:13:57.668818: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance. 
2020-06-16 17:13:57.674947: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1980] Running auto_mixed_precision_mkl graph optimizer 
2020-06-16 17:13:57.675318: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1345] No whitelist ops found, nothing to do 
2020-06-16 17:13:57.676833: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1980] Running auto_mixed_precision_mkl graph optimizer 
2020-06-16 17:13:57.676945: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1345] No whitelist ops found, nothing to do 
2020-06-16 17:13:57.688769: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1980] Running auto_mixed_precision_mkl graph optimizer 
2020-06-16 17:13:57.688879: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1345] No whitelist ops found, nothing to do 
2020-06-16 17:13:57.689922: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1980] Running auto_mixed_precision_mkl graph optimizer
2020-06-16 17:13:57.690025: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1345] No whitelist ops found, nothing to do

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804