Caffe* Training on Multi-node Distributed-memory Systems Based on Intel® Xeon® Processor E5 Family

    Deep neural network (DNN) training is computationally intensive and can take days or weeks on modern computing platforms. In the recent article, Single-node Caffe Scoring and Training on Intel® Xeon® E5 Family, we demonstrated a tenfold performance increase of the Caffe* framework on the AlexNet* topology and reduced the training time to 5 days on a single node. Intel continues to deliver on the machine learning vision outlined in Pradeep Dubey’s Blog, and in this technical preview, we demonstrate how the training time for Caffe can be reduced from days to hours in a multi-node, distributed-memory environment.

    This article includes preview package that has limited functionality and is not intended for production use. Discussed features are now available as part of Intel MKL 2017 and Intel’s fork of Caffe.

    Caffe is a deep learning framework developed by the Berkeley Vision and Learning Center (BVLC) and one of the most popular community frameworks for image recognition. Caffe is often used as a benchmark together with AlexNet*, a neural network topology for image recognition, and ImageNet*, a database of labeled images.

    The Caffe framework does not support multi-node, distributed-memory systems by default and requires extensive changes to run on distributed-memory systems. We perform strong scaling of the synchronous minibatch stochastic gradient descent (SGD) algorithm with the help of Intel® MPI Library. Computation for one iteration is scaled across multiple nodes, such that the multi-threaded multi-node parallel implementation is equivalent to the single-node, single-threaded serial implementation.

     We use three approaches—data parallelism, model parallelism, and hybrid parallelism—to scale computation. Model parallelism refers to partitioning the model or weights into nodes, such that parts of weights are owned by a given node and each node processes all the data points in a minibatch. This requires communication of the activations and gradients of activations, unlike communication of weights and weight gradients, as is the case with data parallelism.

With this additional level of distributed parallelization, we trained AlexNet on the full ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC-2012) dataset and reached 80% top-5 accuracy in just over 5 hours on a 64-node cluster of systems based on Intel® Xeon® processor E5 family.

Getting Started

While we are working to incorporate the new functionality outlined in this article into future versions of Intel® Math Kernel Library (Intel® MKL) and Intel® Data Analytics Acceleration Library (Intel® DAAL), you can use the technology preview package attached to this article to reproduce the demonstrated performance results and even train AlexNet on your own dataset. The preview includes both the single-node and the multi-node implementations. Note that the current implementation is limited to the AlexNet topology and may not work with other popular DNN topologies.

The package supports the AlexNet topology and introduces the ‘intel_alexnet’ and ‘mpi_intel_alexnet’ models, which are similar to ‘bvlc_alexnet’ with the addition of two new ‘IntelPack’ and ‘IntelUnpack’ layers, as well as the optimized convolution, pooling, normalization layers, and MPI-based implementations for all these layers. We also changed the validation parameters to facilitate vectorization by increasing the validation minibatch size from 50 to 256 and reducing the number of test iterations from 1,000 to 200, thus keeping constant the number of images used in the validation run. The package contains the ‘intel_alexnet’ model in these folders:

  • models/intel_alexnet/deploy.prototxt
  • models/intel_alexnet/solver.prototxt
  • models/intel_alexnet/train_val.prototxt.
  • models/mpi_intel_alexnet/deploy.prototxt
  • models/mpi_intel_alexnet/solver.prototxt
  • models/mpi_intel_alexnet/train_val.prototxt.
  • models/mpi_intel_alexnet/train_val_shared_db.prototxt
  • models/mpi_intel_alexnet/train_val_split_db.prototxt

Both the ’intel_alexnet’ and the ’mpi_intel_alexnet’ models allow you to train and test the ILSVRC-2012 training set.

To start working with the package, ensure that all the regular Caffe dependencies and Intel software tools listed in the System Requirements and Limitations section are installed on your system.

Running on Single Node

  1. Unpack the package.
  2. Specify the paths to the database, snapshot location, and image mean file in these ‘intel_alexnet’ model files:
    • models/intel_alexnet/deploy.prototxt
    • models/intel_alexnet/solver.prototxt
    • models/intel_alexnet/train_val.prototxt
  3. Set up a runtime environment for the software tools listed in the System Requirements and Limitations section.
  4. Add the path to ./build/lib/libcaffe.so to the LD_LIBRARY_PATH environment variable.
  5. Set the threading environment as follows:
    $> export OMP_NUM_THREADS=<N_processors * N_cores>
    $> export KMP_AFFINITY=compact,granularity=fine

Note: OMP_NUM_THREADS must be an even number equal to at least 2.

  1. Run timing on a single node using this command:
    $> ./build/tools/caffe time \
           -iterations <number of iterations> \
           --model=models/intel_alexnet/train_val.prototxt
  2. Run training on a single node using this command:
    $> ./build/tools/caffe train \
           --solver=models/intel_alexnet/solver.prototxt

Running on Cluster

  1. Unpack the package.
  2. Set up a runtime environment for the software tools listed in the System Requirements and Limitations section.
  3. Add the path to ./build-mpi/lib/libcaffe.so to the LD_LIBRARY_PATH environment variable.
  4. Set the NP environment variable to the number of nodes to be used, as follows:

$> export NP=<number-of-mpi-ranks>

Note: the best performance is achieved with one MPI rank per node.

  1. Create a node file in the root directory of the application with the name of x${NP}.hosts. For instance, for IBM* Platform LSF*, run the following command:

$> cat $PBS_NODEFILE > x${NP}.hosts

  1. Specify the paths to the database, snapshot location, and image mean file in the following ‘mpi_intel_alexnet’ model files:
    • models/mpi_intel_alexnet/deploy.prototxt,
    • models/mpi_intel_alexnet/solver.prototxt,
    • models/mpi_intel_alexnet/train_val_shared_db.prototxt

Note: on some system configurations, performance of a shared-disk system may become a bottleneck. In this case, pre-distributing the image database to compute nodes is recommended to achieve best performance results. Refer to the readme files included with the package for instructions.

  1. Set the threading environment as follows:

$> export OMP_NUM_THREADS=<N_processors * N_cores>
$> export KMP_AFFINITY=compact,granularity=fine

Note: OMP_NUM_THREADS must be an even number equal to at least 2.

  1. Run timing using this command:
    $> mpirun -nodefile x${NP}.hosts -n $NP -ppn 1 -prepend-rank \

         ./build/tools/caffe time \

         -iterations <number of iterations> \

        --model=models/mpi_intel_alexnet/train_val.prototxt

  1. Run training using this command:
    $> mpirun -nodefile x${NP}.hosts -n $NP -ppn 1 -prepend-rank \

         ./build-mpi/tools/caffe train \

         --solver=models/mpi_intel_alexnet/solver.prototxt

System Requirements and Limitations

The package has the same software dependencies as non-optimized Caffe:

Intel software tools:

Hardware compatibility:

This software was validated with the AlexNet topology only and may not work with other configurations.

Support

Please direct questions and comments on this package to mailto:intel.mkl@intel.com.

For more complete information about compiler optimizations, see our Optimization Notice.

6 comments

Top
Mingfei M. (Intel)'s picture

as i train AlexNet, the accuracy fluctuates a little bit. such as 79.98%, 80.01%, 79.99%...

so when you say the "top-5 accuracy reach 80%", do you mean the first accuracy above 80%?

or you count the accuracy that does not go below 80% again?

sathish k.'s picture

ohh fine i figured out....accept is the password..

sathish k.'s picture

Same question as Michael ...what is the zip password...?

 

Michael A.'s picture

The zip file contained in this .tgz requires a password. What is it?

Andrew L.'s picture

According to the spec sheet, the Xeon E5-2697 v3 CPU has 14 cores and 35MB L3 cache, not 18 cores and 45MB.

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.