Most of the success of modern AI, especially deep learning algorithms, is due to its impressive results in image classification where near human-level has been observed. This capability can be used for document authentication which is a common task when opening a banking account, performing check-in at the airport or showing a driver license to a police officer. Today most document authentication tasks are done by humans, but AI is showing to be effective and is being increasingly employed for this activity.
In this paper we show how to accelerate training for a document classification system using a 3/5 step pipeline.
- Binary Classifier: Label a given image as a Document or Not Document
- Multiclass Classifier: Label an image classified as a Document into either Front, Back, or Unfolded.
- OCR: This module receives an image and turn it into text
- Image Authentication: This module looks for a match between the picture available in the document with the real person picture available at a database
- Text Authentication: This module looks for a match between the text available in the document with the real person data available at a database
Note Unfolded means an open document showing both “Front” and “Back” sides.
Only steps 1 and 2 are covered in this article, which will prepare the data to be passed on to subsequent steps 3, 4, and 5
Solution Architecture and Design
The solution is aimed at identifying a document, label its side and extract structured information which can be compared to a database which has a certified version of the document.
The block diagram is shown below:
The Binary and Multiclass Classifier used in the experiments of this paper were implemented using Keras* high-level API available on TensorFlow* and the CNN topologies are shown below:
As we can see above, the unique difference between the two topologies is that binary classifier has two neurons in the last layer while multiclass classifier has six neurons.
Setting Up Environments
Optimized Environment (Uses Intel® MKL-DNN in the Backend)
The optimized environment consists of Intel® Distribution for Python* and the Intel® Optimization for TensorFlow*.
Install the Intel® Distribution for Python*
Default Environment (Uses EIGEN in the Backend)
To install the default environment execute:
pip install tensorflow
The following is the hardware configurations used for all comparisons of this paper:
|CPU||Intel® Xeon® Platinum 8153 CPU @ 2.00 GHz|
|BIOS version (including microcode verison||SE5C620.86B.00.01.0015.110720180833|
|System DDR Mem Config||1 slot / 394 GB / n/a|
|System DCPMM Config: slots / cap / run-speed|
|Total Memory/Node (DDR+DCPMM)||394 GB|
|OS||CentOS* Linux* 7 (Core)|
|Mitigation variants (1,2,3,3a,4, L1TF)||3,L1TF|
|Workload and version|
|Libraries||Intel® Optimization for TensorFlow*|
|Frameworks version||TensorFlow* 1.9|
|Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) Version||Intel® MKL-DNN 2018|
|Dataset||Images provided by Big Data Corp|
|Topology (include link)||Custom CNN|
|Batch Size||16 to 720|
The following is the software configuration used:
|Intel® Distribution for Python* Version||Python 3.6.1|
|Intel® Optimization for Tensor Flow* Version||1.9|
|Python* Version||Python 3.6.1|
|Tensorflow* Version (from pip)||1.9|
Improving Training Performance
On the CPU, Intel® Distribution for Python* along with Intel® Optimization for TensorFlow* will help with achieving a better performance.
Around 70% to 80% improvement was observed only by installing Intel® Optimization for TensorFlow*.2
It is important that we use the full bandwidth that the CPU provides. Hence the TensorFlow* performance optimization guide3 provides details on optimization for CPU. See below some guidelines followed during our experiments:
Set intra_op_parallelism_threads and OMP_NUM_THREADS equal to number of physical cores;
Set inter_op_parallelism_threads equal to number of sockets;
Set KMP_BLOCKTIME to zero;
Setting Number of Threads to Execute in Parallel for Inter and Intra Operations in TensorFlow* and Keras*
As we can see at the hardware configuration section, Intel® Xeon® Platinum CPU 8153 has 32 physical cores and 2 sockets, therefore we set intra_op_parallelism_threads to 32 and inter_op_parallelism_threads to 2 as shown in the code snippet below:
import tensorflow as tf from tensorflow.keras import backend as K K.set_session( tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=32, inter_op_parallelism_threads=2)))
Setting Environment Variables Before Execution
Here we also set OMP_NUM_THREADS to 32 in order to reflect the number of physical cores, and follow instructions provided on the TensorFlow performance optimization guide for CPU3
export MKL_VERBOSE=0 export MKLDNN_VERBOSE=0 export KMP_BLOCKTIME=0 export OMP_NUM_THREADS=32 export KMP_AFFINITY=granularity=fine,compact,1,0 export KMP_SETTINGS=1
We can also take advantage of large memory size available on Intel® Xeon® Scalable processors and increase the batch size to process more images at the same time while computing the gradients of a Neural Network. Increasing the batch size can reduce the execution time for training on CPUs, but it may also have an impact on testing accuracy, therefore this step should be carefully taken to decide if the gain in execution time is worth the loss in accuracy.
|Train Dataset Size||5337 images|
|Validation Dataset Size||1098 images|
|Number of Epochs for Training||25 epochs|
This paper shows instructions on how to optimize Deep Learning training on Intel® CPUs. A 3.1x speedup was achieved when training a binary image classifier and 3.6x speedup when training a multiclass image classifier. The comparison was taken using a default environment with libraries from official pip channel (baseline) and an Intel optimized environment where Intel® Distribution for Python* and Intel® Optimization for TensorFlow* were installed. For even better performance, batch size was increased in the optimized environment. Increasing batch size delivered a boosted performance but led to an accuracy drop on both classifiers. Validation accuracy drop on binary classifier went from 98% to 85% and on the multiclass classifier from 95% to 44%. Some recent papers4 actually shows how to speedup convergence of optimization algorithms and improve accuracy of Neural Networks by increasing the batch size, these approaches could be used as future work in order to find the set of hyperparameters which improves accuracy even for large batch size values.
- Using Intel® Distribution for Python* with Anaconda*
- Installing Intel® Optimization for TensorFlow*
- Optimizing TensorFlow* for CPU
- Don’t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying & Quoc V. Le. Google Brain. Published as a conference paper at ICLR 2018