Guide to TensorFlow Runtime optimizations for CPU

Published: 07/30/2020, Last Updated: 07/30/2020

Overview

Runtime settings can greatly affect the performance of TensorFlow workloads running on CPUs particularly regarding threading. 

OpenMP and TensorFlow both have settings that should be considered for their effect on performance.  

The Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) within the Intel® Optimization for TensorFlow uses OpenMP settings as environment variables to affect performance on Intel CPUs. 

TensorFlow has a class (ConfigProto or config depeding on version) with settings that affect performance. 

This guide will describe the settings, usage and how to apply the them.

OpenMP settings descriptions

  • OMP_NUM_THREADS
    • Maximum number of threads to use for OpenMP parallel regions if no other value is specified in the application.
    • Recommend: start with the number of physical cores/socket on the test system, and try increasing and decreasing
  • KMP_BLOCKTIME
    • Time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping.
    • Recommend: start with 1 and try increasing
  • KMP_AFFINITY
    • Restricts execution of certain threads to a subset of the physical processing units in a multiprocessor computer. Only valid if Hyperthreading is enabled.
    • Recommend: granularity=fine,verbose,compact,1,0
  • KMP_SETTINGS
    • Enables (TRUE) or disables (FALSE) printing of OpenMP run-time library environment variables during execution
    • Recommend: Start with TRUE to ensure settings are being utilized, then use as needed

How to apply OpenMP settings

These settings are applied as environment variables

  • Can be set in shell
    • Example:
export OMP_NUM_THREADS=16
  • Can be set in Python code
    • Example:
import os
os.environ["OMP_NUM_THREADS"] = “16”

TensorFlow settings

  • intra_op_parallelism_threads
    • Number of threads used within an individual op for parallelism
    • Recommend: start with the number of cores/socket on the test system, and try increasing and decreasing
  • inter_op_parallelism_threads
    • Number of threads used for parallelism between independent operations.
    • Recommend: start with the number of physical cores on the test system, and try increasing and decreasing
  • device_count
    • Maximum number of devices (CPUs in this case) to use
    • Recommend: start with the number of cores/socket on the test system, and try increasing and decreasing
  • allow_soft_placement
    • Set to True/enabled to facilitate operations to be placed on CPU instead of GPU

How to apply TensorFlow settings

These settings are applied in Python code using Config Proto or config

  • Example in TensorFlow version 1.X:
import tensorflow as tf
config = tf.ConfigProto(intra_op_parallelism_threads=16, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': 16})
session = tf.Session(config=config)
  • Example in TensorFlow 2.X:
import tensorflow as tf
tf.config.threading.set_inter_op_parallelism_threads() 
tf.config.threading.set_intra_op_parallelism_threads()
tf.config.set_soft_device_placement(enabled)

References

Notices and Disclaimers 

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

© Intel Corporation.  Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.  Other names and brands may be claimed as the property of others. 

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804