In continued efforts to optimize Deep Learning workloads on Intel® architecture, our engineers explore various paths leading to the maximum performance. Not long ago, a technical preview of optimized AlexNet training on Caffe was published. Now we are sharing another preview of our work completely focused on the classification path and bringing it to new performance levels never demonstrated before on an Intel CPU.
Once again, we used Caffe deep learning framework developed by Berkeley Vision and Learning Center (BVLC) as the vehicle for demonstrating our results.
We have focused solely on optimizing the classification path of AlexNet* and CaffeNet* topologies. Neither the training path nor other topologies are parts of this preview and they are not intended to be run with this package.
In the future, highly optimized routines for both scoring and training will become available in Intel® Math Kernel Library (Intel® MKL) and Intel® Data Analytics Acceleration Library (Intel® DAAL).
Performance achieved by this package was possible thanks to innovative approach to code generation and very low level optimizations applied to critical routines, i.e. convolution function. The code is parallel at CPU level (multithreaded) and Caffe is modified to execute on both sockets in a dual socket system, but doesn’t use OpenMP* for this purpose.
The package supports AlexNet and CaffeNet topologies classification with batches of 96 and uses Python* interface provided by Caffe, because of its ease of use. This interface has been tweaked to provide more accurate metrics on actual network classification performance. The original version included Python code's overhead to the actual metrics. We believe this is a non-optimal way to perform these tests because actual network input data can be continuously loaded into memory, in an asynchronous manner by different devices, without interrupting classification.
First of all, you will need all dependencies required by Caffe and its Python scripts from the June version (commit SHA: b051ce474425d4e991cf674107ee5f6999a2be38). Then, you need to unpack the package which contains step-by-step instructions how to run our demo.
For your convenience, we are showing these instructions here:
Consider DEMO_ROOT as a main directory of unpacked archive.
Get reference ilsvrc12 data.
Go to DEMO_ROOT/data/ilsvrc12/ subdir and run:
Get reference networks' parameters:
Go to DEMO_ROOT/models/bvlc_reference_caffenet/ subdir and run:
Go to DEMO_ROOT/models/bvlc_alexnet/ subdir and run:
Go to DEMO_ROOT/python subdir.
Make sure you have sudo access.
sudo is required to change default thread scheduler from round-robin to fifo and to change threads' priority. It provides better performance in our case.
In case you don't want to run it in sudo mode a change is required in the execute_classification.sh script.
Get some images in png format.
Required once per dataset – run:
sudo ./split_data.sh <path/to/image/dir>
*) Because of Python heavy interference after each batch, performance metrics can be underestimated when using single run. More iterations will give more accurate results, achieving maximum performance.
A special thank you to Krzysztof Badziak, Jacek Czaja, Jaroslaw Dukat, Bartosz Kalinczuk, Piotr Majcher, Piotr Majchrzak, Jacek Reniecki and Maciej Urbanski from Intel’s Visual Cloud Computing team and Vadim Pirogov from Intel’s Software Services Group. They were the driving force behind the performance optimizations shown and the Caffe work illustrated in this blog post.
Caffe* is a third-party trademark owned by Berkeley Vision and Learning Center (BLVC). Other names and brands may be claimed as the property of others.
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.