In recent years, the scale of datasets and models used in deep learning has increased dramatically. Although larger datasets and models can improve the accuracy in many artificial intelligence (AI) applications, they often take much longer to train on a single machine. However, it is not very common to distribute the training to large clusters using current popular deep learning (DL) frameworks, compared to what’s been around for a long time in the Big Data area, as it’s often harder to gain access to a large graphics processing unit (GPU) cluster, and the lack of convenient facilities in popular DL frameworks for distributed training. By controlling the cluster distribution capabilities in Apache Spark*, BigDL successfully performs very large-scale distributed training and inference.
In this article, we demonstrate a parameter server (PS) style of parameter synchronization (using peer-to-peer allreduce) in BigDL to reduce the communication overhead along with coarse-grained scheduling, which is able to provide significant speedups for large-scale distributed deep learning training.
What is BigDL
BigDL (https://github.com/intel-analytics/BigDL) is a distributed deep learning library for Apache Spark developed by Intel and contributed to the open source community for the purposes of uniting big data processing and deep learning. The goal of BigDL is to help make deep learning more accessible to the Big Data community, by allowing them to continue the use of familiar tools and infrastructure to build deep learning applications.
As shown in Figure 1, BigDL is implemented as a library on top of Spark, so that users can write their deep learning applications as standard Apache Spark programs. As a result, it can be seamlessly integrated with other libraries on top of Apache Spark (for example, Apache Spark SQL and DataFrames, Apache Spark ML Pipelines, Apache Spark Streaming, Structured Streaming, and so on), and can directly run on existing Apache Spark or Hadoop* clusters.
Figure 1. BigDL implementation.
Communications in BigDL
In Apache Spark MLlib, a number of machine learning algorithms are based on using synchronous mini-batch stochastic gradient descent (SGD). To aggregate parameters, these algorithms use the reduce or treeAggregate methods in Spark, as shown in Figure 2.
In this process, the time spent at the driver linearly increases with the number of nodes. This is both due to the CPU and network bandwidth limitations of the driver. The CPU cost arises from merging partial results, while the network cost incurred is a result of transferring one copy of the model from each of the tasks (or partitions). Thus, the centralized driver becomes a bottleneck when there are a large number of nodes in the cluster.
Figure 2. Parameter synchronization in Apache Spark MLlib.
Figure 3. Parameter synchronization in BigDL.
Figure 3 shows how parameter managers inside BigDL implement a PS architecture (through the AllReduce operation) for synchronous mini-batch SGD. After each task computes its gradients, instead of sending gradients back to the driver, gradients from all the partitions within a single worker are aggregated locally. Then each node will have one gradient: This ensures that data transferred on each node will not increase if we increase the number of partitions in a node. After that, the aggregated gradient on each node is sliced into chunks, and these chunks are exchanged between all the nodes in the cluster. Each node is responsible for a specific chunk, which in essence implements a Parameter Server architecture in BigDL for parameter synchronization. Each node retrieves gradients for the slice of the model that this node is responsible for from all the other nodes and aggregates them in multiple threads. After the pair-wise exchange completes, each node has its own portion of aggregated gradients and uses this to update its own portion of weights. Then the exchange happens again for synchronizing the updated weights. At the end of this procedure, each node will have a copy of the updated weights.
As the parameters are stored in Apache Spark BlockManager, each task can get the latest weights from it. As all nodes in the cluster play the same role and the driver is not involved in the communication, there is no bottleneck in the system. Besides, as the cluster grows, data size transferred on each node remains the same and thus lowers the time spent in parameter aggregation, enabling BigDL to achieve near-linear scaling. Figure 4 shows that for Inception v1, the throughput of 16 nodes is ~1.92X of 8 nodes, while for ResNet*, it is ~1.88X. These results show that BigDL achieves a near linear scale-out performance.
However, we find that increasing the number of partitions still leads to an increase in training time. Our profiling showed this increase was because of the significant scheduling overhead present in Apache Spark for low-latency applications. Figure 5 shows the scheduling overheads as a fraction of average compute time for Inception v1 training as we increase the number of partitions. We see that with partition numbers greater than 300, Apache Spark overheads takes up more than 10 percent of average compute time and thus slows down the training process. To work around this issue, BigDL runs a single task (working on a single partition) on each worker, and each task in turn runs multiple threads in the deep learning training.
Figure 4. BigDL scaling behavior.
Figure 5. Apache Spark overheads as a fraction of average compute time for Inception v1 training.
What is Drizzle
Drizzle is a research project at the RISELab to study low-latency execution for streaming and machine learning workloads. Currently, Apache Spark uses a bulk synchronous parallel (BSP) computation model, and notifies the scheduler at the end of each task. Invoking the scheduler at the end of each task adds overhead, and results in decreased throughput and increased latency. We observed that for many low-latency workloads, the same operations are executed repeatedly; for example, processing different batches in streaming or iterative model training in machine learning. Based on this observation, we find that we can improve performance by amortizing the number of times the scheduler is invoked.
In Drizzle, we introduce group scheduling, where multiple iterations (or a group) of computations are scheduled at once. This helps decouple the granularity of task execution from scheduling and amortize the costs of task serialization and launch. One key challenge here is in launching tasks before their input dependencies have been computed. We solve this using prescheduling in Drizzle, where we proactively queue tasks to be run on worker machines, and rely on workers to trigger tasks when their input dependencies are met.
How Drizzle Works with BigDL
In order to exploit the scheduling benefits provided by Drizzle, we modified the implementation of the distributed optimization algorithm in BigDL. The main changes we made include refactoring the multiple stages of computation (like gradient calculation, gradient aggregation, and so on) to be part of a single DAG (Directed Acyclic Graph) of stages submitted to the scheduler. This refactoring enables Drizzle to execute all the stages of computation without interacting with the centralized driver for control plane operations. Thus, when used in conjunction with the above-described parameter manager, we can execute BigDL iterations without any centralized bottleneck in the control plane and data plane.
UC Berkeley RISE Lab executed performance benchmarks to measure the benefits from using BigDL. These benchmarks were run using Inception v1 on ImageNet and Visual Geometry Group (VGG) on Cifar-10 on Amazon EC2* clusters. sr4.x2 large machines, where each machine has four cores were used. Big is configured to use one partition per core.
Figure 6. Drizzle with VGG on Cifar-10.
Figure 7. Drizzle with Inception v1 on ImageNet.
Figure 6 shows that for VGG on 32 nodes, there is a 15 percent improvement when using Drizzle with group size 20. For Inception v1 on 64 nodes, (Figure 7), there is consistent performance improvement when increasing group size in Drizzle, and there is 10 percent improvement with a group size of 10. These improvements map directly to the scheduling overheads that we observed without Drizzle in Figure 5.
This article demonstrated how BigDL performs parameter aggregation and how Drizzle reduces Apache Spark scheduling overhead. To get started with BigDL with Drizzle, try out the BigDL parameter manager with drizzle GitHub* page; and to learn more about our work, please check out the BigDL GitHub page and Drizzle GitHub page.