Analytics Zoo: Unifying Analytics + AI for Apache Spark*

By Jinquan Dai, published on September 10 , 2018

By Jason Dai, senior principal engineer, Intel Corporation

Continued advancements in artificial intelligence applications have brought deep learning to the forefront of a new generation of data analytics development. In particular, we are seeing increasing demand from organizations to apply deep learning technologies (such as computer vision, natural language processing, generative adversary networks, etc.) to their big data platforms and pipelines.

Today this often requires manually “stitching together” many separate components (e.g., Apache Spark*, TensorFlow*, Caffe*, Apache Hadoop* Distributed File System (HDFS), Apache Storm*/Kafka*, and others), in what can be a complex and error-prone process.

At Intel, we have been working extensively with open source community users and several partners & customers including, UCSF, Mastercard*, and many others to build deep learning (DL) and AI applications on Apache Spark. To streamline end-to-end development and deployment, Intel developed Analytics Zoo, a unified analytics + AI platform that seamlessly unites Spark, TensorFlow, Keras and BigDL programs into an integrated pipeline that can transparently scale out to large Apache Hadoop/Spark clusters for distributed training or inference.

Analytics Zoo also provides developers and users alike a rich set of analytics and AI support for the end-to-end pipeline, including:

  • Easy-to-use abstractions, such as Spark DataFrame and ML pipeline support, transfer learning support, POJO-style model serving API, and more.

  • Common feature engineering operations for image, text, and 3D image

  • Built-in deep learning models, such as text classification, recommendation, and object detection

  • Reference use cases, such as time-series anomaly detection, fraud detection, image similarity search, and more.

I am pleased to announce that we have recently open sourced  Analytics Zoo, making this platform available for wide community use and contributions.

Early users such as World Bank, Cray, Talroo, Baosight, Midea/KUKA, and others have built analytics + AI applications on top of Analytics Zoo for a wide range of workloads. These include transfer learning based image classification, sequence-to-sequence prediction for precipitation nowcasting, neural collaborative filtering for job recommendations, and unsupervised time-series anomaly detection, among other examples.

Intel is committed to continuing the collaboration with the open source community and users, as we together advance the convergence of analytics and AI on Apache Spark.


Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserverd for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804