Ravi Panchumarthy (Intel), Thomas “Elvis” Jones (AWS), Andres Rodriguez (Intel), Joseph Spisak (Intel)
Deep neural networks are capable of amazing levels of representation power resulting in state-of-the-art accuracy in areas such as computer vision, speech recognition, natural language processing, and various data analytic domains. Deep networks require large amounts of computation to train, and the time to train is often days or weeks. Intel is optimizing popular frameworks such as Caffe*, TensorFlow*, Theano*, and others to significantly improve performance and reduce the overall time to train on a single node. In addition, Intel is adding or enhancing multi. node distributed training capabilities to these frameworks to share the computational requirements across multiple nodes and further reduce time to train. A workload that previously required days can now be trained in a matter of hours. Read more about this.
Amazon Web Services* (AWS) Virtual Private Cloud (VPC) provides a great environment to facilitate multinode distributed deep network training. AWS and Intel partnered to create a simple set of scripts for creating clusters that allows developers to easily deploy and train deep networks, leveraging the scale of AWS. In this article, we provide the steps to set up the AWS CloudFormation* environment to train deep networks using the Caffe network.
The following steps create a VPC that has an Elastic Compute Cloud (EC2) t2.micro instance as the AWS CloudFormation cluster (cfncluster) controller. The cfncluster controller is then used to create a cluster composed of a master EC2 instance and a number of compute EC2 instances within the VPC.
Figure 1. CloudFormation in Amazon Web Services
Figure 2. Entering the template URL.
name, and a Value, such as,
Figure 3. Selecting the Physical ID from the Resources tab.
Figure 4. AWS EC2 console.
cp config.edit_this_cfncluster_config config
Note that while the master node is not labelled as a compute node, it also acts as a compute node. Therefore, if the total number of nodes to be used in training is 32, then choose a queue_size = 31 compute nodes.
ami-77aa6117; this article will be updated when newer AMI are provided.
cfncluster create <vpc_name_choosen_in_config_file>. This will launch more AWS CloudFormation templates. You can see them via the AWS CloudFormation page in the AWS Management Console.
After the cloud-formation-setup is complete, if you configured the size of the cluster to be N, there will be N+1 instances created (1 master node and N compute nodes). Note that the master node is also treated as a compute node. The created cluster has a shared drive among all N+1 instances. The instances contain intelcaffe, Intel® Math Kernel Library (Intel® MKL) and sample scripts to train CIFAR-10 and GoogLeNet. To start training a sample network, login into the master node.
To start training a CIFAR-10 model with provided solver and train_val prototxt files, run:
To start training a GoogLeNet model, you should download ImageNet dataset and configure the variables
batchsize_pernode and others if required in the script and run the
#Edit variables path_to_imagenet_train_folder, batchsize_pernode and others if required
aws_ic_mn_run_cifar.sh creates a hosts file (
~/hosts.aws) which contains all the IP addresses of the instances in your VPC. It then updates the solver and train_val prototxt files located in
~/models/cifar10/. You could modify these prototxt files to suit your training requirements. The
aws_ic_mn_run_cifar.sh script will start the data server, which will provide data to the compute nodes. There will be a little overhead on the master with data server running along with the compute. After the data server is launched, the distributed training is launched using the mpirun command.
aws_ic_mn_run_googlenet.sh creates a hosts file (
~/hosts.aws) which contains all the IP addresses of the instances in your VPC. Unlike, the CIFAR-10 example where the data server provides the data, in GoogLeNet training, each worker will read its own data. The script will create separate solver, train_val prototxt files and train.txt files for each worker based on the template solver and train_val prototxt located in
~/models/googlenet/. You could modify these template prototxt files to suit your training requirements. The
aws_ic_mn_run_googlenet.sh script will then launch the job using the mpirun command.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
For more information go to http://www.intel.com/performance.
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.