Spark Summit is a professional conference which usually has in attendance thousands of developers, scientist, analysts, researchers and executives from all over the world. At the conference, attendees come together to understand how big data, machine learning and data science could deliver new insights. The 2016 mid-year event of Spark Summit concluded today in San Francisco, California. Now the summit is over – I am rethinking how the experience has changed my perspective on the ecosystem of hardware and software for big data and machine learning.
Apache Spark* is a large-scale, cluster computing framework for big data, streaming, machine learning and analytics applications. It is open-source and was originally developed at the University of California, Berkeley’s AMPLab. Maintained by the Apache Software Foundation, Apache Spark has the largest open-source community in big data, with over 1000 contributions from 250+ organizations.
Apache Spark is fast and easy to use. It supports multiple programming languages (such as Python*, Java* and Scala*) and provides many libraries for machine learning, streaming, structured query language (SQL) and graph processing. The Application Programming Interface (API) in Apache Spark is centered on the notion of resilient distributed dataset (RDD). RDDs provide fault-tolerance and unlike Hadoop* MapReduce, caching of working dataset in memory, thus significantly improving overall program performance.
The availability of RDD makes Apache Spark suitable for both iterative algorithms – algorithms that loop over a dataset multiples times, such as in machine learning. Apache Spark also shines well in interactive data analysis applications that involve exploratory or repeated querying of data, for example in database applications like HBase* or Cassandra*.
Release of Apache Spark 2.0
Moving from the current edition of Apache Spark 1.6.1, Apache Spark 2.0 features a number of performance improvements. This announcement was made at the summit by Matei Zaharia, CEO of DataBricks and creator of Apache Spark. The performance optimizations in Apache Spark 2.0 include improvements to Dataset, DataFrames and SQL in addition to its Structured Streaming API.
In the rest of this blog, I will highlight briefly four key presentations from the summit that I found personally enlightening. My choice for these selections is based on the overlap of my work at Intel in relation with these areas. They are Ecosystem, Accelerators, Scheduling and Algorithms.
- Apache Spark Ecosystem (Use Case)
As Apache Spark is modular and extendable, AirStream - from AirBnB* - runs atop Spark Streaming and Spark SQL (two of the application components in Apache Spark). AirBnB leveraged these two components together with custom-built modules to allow for their engineers and data scientist to build realtime insights and feedback loops. The room was full to the brim. Several attendees stood in the hallway to catch a glimpse of the presenter's PowerPoint slides. The key takeaway from this use case is the ability to build Apache-Spark-based applications that are not only scalable but meet real-time requirements of large organizations like AirBnB.
Field Programmable Gate Array (FPGA) accelerators are gradually finding use cases in the cloud as a result of their high computing performance and energy efficiency. However, the deployment of such accelerators at scale can pose significant effort owing to the long design times of FPGAs. BLAZE, which is an accelerator-aware software and runtime system, aims to enable large-scale accelerator deployment for Hadoop and Apache Spark clusters. It uses YARN (a cluster management framework) for resource allocation and accelerator management. Under its hood, BLAZE makes calls to an OpenCL* runtime if one has been installed from a vendor. Currently, BLAZE mostly supports GPUs and was previously evaluated by its developers to use the Intel® QuickAssist Technology Accelerator Abstraction Library (AAL) – a set of drivers and software development kit similar to OpenCL - for accelerator programming.
Many machine learning and stream processing workloads require low-latency. Apache Spark uses the Bulk Synchronous Parallel (BSP) computation model where invocation of its scheduler adds overheads which can result in penalties to an application’s throughput and latency bounds. UC Berkeley is aiming to improve the performance of Spark by amortizing the number of schedulers invoked during tasks execution, especially for machine learning algorithms where many identical operations are performed repeatedly.
One of the most commonly used machine learning algorithms is Decision Tree. Along with its variants (Gradient-Boosted Trees and Random Forest), decision trees work well with large datasets. The decision tree implantation in Apache Spark’s machine learning library (MLib) uses a framework from Google called PLANET. The creators of a newly proposed Apache Spark package called YggDrasil demonstrated how communication costs can be reduced in decision tree algorithms by partitioning data vertically (by columns) instead of horizontally (by rows) in contrast to the current implementation in Apache Spark.
Overall, Spark Summit 2016 was a huge success. The industry (both open-source communities and commercial firms) - as well as research organizations - are working more collaboratively than ever before to share insights and advances in big data and machine learning. Today, there exists almost a seamless unification of big data and machine learning systems into a single ecosystem. While machine learning and big data are forming convergence on one hand, data scientists, developers and engineers are collaborating even more closely to build complex systems that enable smarter applications. As far as I foresee it, a new spectrum around big data and machine learning is rapidly evolving.
As research communities maintain the drive for new innovations, companies continue to retransform themselves with new ideas and spin-offs that create new business opportunities in the fast changing field of machine learning and big data. It is possible that some of these innovations and opportunities would not have been possible without a complete lifecycle of engineers, developers, data scientist and software foundations like Apache that all together make it happen. Nonetheless, the future continues to look very promising for a new wave of computing in cloud-based learning systems and artificial intelligence.