This blog post was jointly written by Jiangang Duan, Jie Huang and Weihua Jiang (Intel), Alex Gutow (Cloudera), and Dale Kim (MapR)
As one of the most popular tools in the Apache Hadoop* ecosystem, a lot of noise has been made about Apache Spark*—and for good reason. It complements the existing Hadoop ecosystem by adding easy-to-use APIs and data-pipelining capabilities to Hadoop data. And project support continues to grow. Since its launch in 2009, Spark has seen over 400 contributors from more than 50 different companies.
This strong community effort has secured Spark’s place as an open standard in Hadoop. With a robust engineering focus, its quality and popularity have ensured its portability, with support from all the major Hadoop vendors. Its production use has also led to the development and certification of Spark applications by the leading software companies—opening up Spark to more use cases and users. See https://databricks.com/certified-on-spark for more details.
One of the most exciting projects is the Spark community’s efforts to improve batch processing with Spark as the execution backend. As a powerful batch processing engine, Spark not only improves the performance of several popular projects such as Apache Hive*, Apache Pig*, and Apache Sqoop*, but also drives standardization as an execution backend, making management and development more efficient. In July 2014, Cloudera, Databricks, IBM, Intel, and MapR announced an industry-wide collaboration to port the open source MapReduce tools to support Spark. Since the initial announcement, a lot of progress has been made. Here’s a look at what’s been accomplished:
- Apache Crunch* 0.11 releases with a SparkPipeline, making it easy to migrate data processing applications from MapReduce to Spark.
- Spark support added to Apache Kite* 0.16 release, so Spark jobs can read and write to Kite datasets.
- Sigmoid Analytics has been driving the development of Pig on Spark, successfully passing 100% of the end-to-end test cases on Pig. Sigmid Analytics is now working to merge its work in to a future release.
- Another open standard, Apache Solr*, added a Spark-based indexing tool for fast and easy indexing and ingestion, and serving searchable complex data. We also expect to see a Solr-on-Spark solution in the near future.
- The first demo of Hive on Spark is available, the result of a strong community effort with over 140 commits to the main project
- Based on joint work from Cloudera, Intel, and MapR, the first Hive-on-Spark AMI is now available on Amazon. The VM lets you quickly try out Spark in conjunction with one of the most widely used Hadoop tools.
- The core of the plan is to co-work with the community on the robustness of Spark to better support more extensions beyond it (e.g., Hive, HBase, etc.). Among all the projects, our big data benchmark suite HiBench is being extended for Spark as a large-scale, performance evaluation tool. Work is in progress to provide more insight in to Spark performance regression tests and analysis results. More data will be published on 01.org soon.
- Another goal is to make Spark easier to use by adding or improving its related libraries. For example, SparkR provides APIs to Spark for data scientists who are familiar with the R language. Intel is collaborating with AMPLab to make the production ready for the future release. In addition, Tachyon is a memory-centric distributed store and the only off-heap solution for Spark. Intel is contributing the hierarchy storage for Tachyon’s even wider adoption.
- Last but not least, Databricks drives the trend to initiate more customized APIs or utilities working with Spark by announcing Spark packages last month. Intel is continuously contributing to the innovation project (i.e., StreamSQL) to better support and provide easier usage atop Spark.
Spark has come a long way at an impressive rate thanks to the community rallying behind it as an open standard in Hadoop. With such robust developer support, we expect to see continued advancements around Spark, especially as it continues to progress as a standard execution engine for key workloads.