Accelerating Document Classification (Training) using Intel® Optimization for TensorFlow* on Intel® Xeon® Scalable Processors

Overview

Most of the success of modern AI, especially deep learning algorithms, is due to its impressive results in image classification where near human-level has been observed. This capability can be used for document authentication which is a common task when opening a banking account, performing check-in at the airport or showing a driver license to a police officer. Today most document authentication tasks are done by humans, but AI is showing to be effective and is being increasingly employed for this activity.

In this paper we show how to accelerate training for a document classification system using a 3/5 step pipeline.

  1. Binary Classifier: Label a given image as a Document or Not Document
  2. Multiclass Classifier: Label an image classified as a Document into either Front, Back, or Unfolded.
  3. OCR: This module receives an image and turn it into text
  4. Image Authentication: This module looks for a match between the picture available in the document with the real person picture available at a database
  5. Text Authentication: This module looks for a match between the text available in the document with the real person data available at a database

Note Unfolded means an open document showing both “Front” and “Back” sides.

Only steps 1 and 2 are covered in this article, which will prepare the data to be passed on to subsequent steps 3, 4, and 5

Solution Architecture and Design

The solution is aimed at identifying a document, label its side and extract structured information which can be compared to a database which has a certified version of the document.

The block diagram is shown below:

block diagram

Topologies

The Binary and Multiclass Classifier used in the experiments of this paper were implemented using Keras* high-level API available on TensorFlow* and the CNN topologies are shown below:

C N N topologies

As we can see above, the unique difference between the two topologies is that binary classifier has two neurons in the last layer while multiclass classifier has six neurons.

Setting Up Environments

Optimized Environment (Uses Intel® MKL-DNN in the Backend)

The optimized environment consists of Intel® Distribution for Python* and the Intel® Optimization for TensorFlow*.

Install the Intel® Distribution for Python*

Install Intel® Optimization for TensorFlow*

Default Environment (Uses EIGEN in the Backend)

To install the default environment execute:

pip install tensorflow

Hardware Configuration

The following is the hardware configurations used for all comparisons of this paper:

Test ByIntel®
Test date 
Platformx86_64
# Nodes1
# Sockets2
CPUIntel® Xeon® Platinum 8153 CPU @ 2.00 GHz
Cores/socket, Threads/socket16,2
ucode0x200004d
HTOn
TurboOn
BIOS version (including microcode verisonSE5C620.86B.00.01.0015.110720180833
System DDR Mem Config1 slot / 394 GB / n/a
System DCPMM Config: slots / cap / run-speed 
Total Memory/Node (DDR+DCPMM)394 GB
OSCentOS* Linux* 7 (Core)
Kernel3.10.0-693.11.6.el7.x86_64
Mitigation variants (1,2,3,3a,4, L1TF)3,L1TF
Workload and version 
CompilerGCC 6.4.0
LibrariesIntel® Optimization for TensorFlow*
Frameworks versionTensorFlow* 1.9
Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) VersionIntel® MKL-DNN 2018
DatasetImages provided by Big Data Corp
Topology (include link)Custom CNN
Batch Size16 to 720

Software Used

The following is the software configuration used:

Optimized Environment

Intel® Distribution for Python* VersionPython 3.6.1
Intel® Optimization for Tensor Flow* Version1.9
Anaconda* Version5.2.0

Default Environment

Python* VersionPython 3.6.1
Tensorflow* Version (from pip)1.9
Anaconda Version5.2.0

Improving Training Performance

On the CPU, Intel® Distribution for Python* along with Intel® Optimization for TensorFlow* will help with achieving a better performance.

Around 70% to 80% improvement was observed only by installing Intel® Optimization for TensorFlow*.2

It is important that we use the full bandwidth that the CPU provides. Hence the TensorFlow* performance optimization guide3 provides details on optimization for CPU. See below some guidelines followed during our experiments:

Set intra_op_parallelism_threads and OMP_NUM_THREADS equal to number of physical cores;

Set inter_op_parallelism_threads equal to number of sockets;

Set KMP_BLOCKTIME to zero;

Setting Number of Threads to Execute in Parallel for Inter and Intra Operations in TensorFlow* and Keras*

As we can see at the hardware configuration section, Intel® Xeon® Platinum CPU 8153 has 32 physical cores and 2 sockets, therefore we set intra_op_parallelism_threads to 32 and inter_op_parallelism_threads to 2 as shown in the code snippet below:

import tensorflow as tf
from tensorflow.keras import backend as K
K.set_session(
tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=32, inter_op_parallelism_threads=2)))

Setting Environment Variables Before Execution

Here we also set OMP_NUM_THREADS to 32 in order to reflect the number of physical cores, and follow instructions provided on the TensorFlow performance optimization guide for CPU3

export MKL_VERBOSE=0
export MKLDNN_VERBOSE=0
export KMP_BLOCKTIME=0
export OMP_NUM_THREADS=32
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_SETTINGS=1

We can also take advantage of large memory size available on Intel® Xeon® Scalable processors and increase the batch size to process more images at the same time while computing the gradients of a Neural Network. Increasing the batch size can reduce the execution time for training on CPUs, but it may also have an impact on testing accuracy, therefore this step should be carefully taken to decide if the gain in execution time is worth the loss in accuracy.

Results

Training Parameters
Train Dataset Size5337 images
Validation Dataset Size1098 images
Number of Epochs for Training25 epochs

Performance Analysis

Performance Analysis Configuration

Accuracy Analysis

Accuracy Analysis Graphic

Conclusion

This paper shows instructions on how to optimize Deep Learning training on Intel® CPUs. A 3.1x speedup was achieved when training a binary image classifier and 3.6x speedup when training a multiclass image classifier. The comparison was taken using a default environment with libraries from official pip channel (baseline) and an Intel optimized environment where Intel® Distribution for Python* and Intel® Optimization for TensorFlow* were installed. For even better performance, batch size was increased in the optimized environment. Increasing batch size delivered a boosted performance but led to an accuracy drop on both classifiers. Validation accuracy drop on binary classifier went from 98% to 85% and on the multiclass classifier from 95% to 44%. Some recent papers4 actually shows how to speedup convergence of optimization algorithms and improve accuracy of Neural Networks by increasing the batch size, these approaches could be used as future work in order to find the set of hyperparameters which improves accuracy even for large batch size values.

References

  1. Using Intel® Distribution for Python* with Anaconda*
  2. Installing Intel® Optimization for TensorFlow*
  3. Optimizing TensorFlow* for CPU
  4. Don’t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying & Quoc V. Le. Google Brain. Published as a conference paper at ICLR 2018
For more complete information about compiler optimizations, see our Optimization Notice.