Deliver the Value of Analytics & AI at Scale with Intel®-optimized Open Source Software

By Ziya Ma,

Published:09/11/2018   Last Updated:09/11/2018

How will organizations turn the deluge of data into value, for a sustainable competitive advantage? Intel® technology innovations use open source software as a catalyst to help deliver that value. Read on to learn how Intel@-optimized Apache Spark*, BigDL and Analytics Zoo can bring greater capacity and performance to your big data analytics and AI solutions.

data analytics and a i

Open Source Software for Data Analytics and AI

As an open source software leader, Intel is the number-one upstream contributor to the Linux* kernel, and we have provided a steady stream of source code contributions and optimizations for Spark*, the emerging unified analytics and AI engine for large-scale data processing.

Spark fills a role at the intersection of AI, streaming analytics, and batch analytics, offering ease-of-use for developers writing in Java*, Scala*, or Python*. Adding even more value to the Spark platform, Intel open source contributions also include the Optimized Analytics Package to accelerate Spark queries, the BigDL deep learning library/framework and the Analytics Zoo analytics and AI platform for Apache Spark and BigDL.

A Key Trend: Fast Data

Companies are transitioning from big data to fast data. Fast data is data in motion. It provides the ability to learn continuously, make decisions, and take actions as soon as data arrives, typically in milliseconds.

Imagine the scenario for a credit card company when a person swipes a card to purchase something: analytics or AI applications need to immediately run hundreds of input variables such as location, time, recent purchases, and previous transactions through complex logic to determine whether to approve or decline the transaction—all within milliseconds.

Implementing such a use case can present an extreme processing bottleneck. Many learning algorithms iterate a computation over a training dataset and update the model parameters until the model converges. To accelerate training performance, it’s common to cache the huge dataset and parameters into memory. However, memory constraints are a common challenge.

That is exactly why we believe Intel® Optane™ DC persistent memory can be a real game changer for fast data. Our benchmark testing shows that Spark SQL (Spark's module for working with structured data) performs eight times faster1 at a 2.6 Terabytes (TB) data scale using Intel Optane DC persistent memory versus a comparable system using dynamic random-access memory (DRAM) dual in-line memory modules (DIMMs) and solid-state drives (SSDs)4. Even greater improvements were noted with the Apache Cassandra* not only SQL (NoSQL) database.2, 3

spark and apache comparison

Intel Software Advancements

Intel advancements for analytics workloads go beyond our silicon innovations to include in-memory database optimizations and upstream contributions to numerous open source projects.

As an ecosystem leader and open source software contributor, Intel aims to optimize all major deep learning frameworks and topologies, including TensorFlow*, Caffe*, MXNet*, and Chainer* to run well on Intel@ architecture.

As a top contributor to Apache Spark, Intel open sourced the BigDL deep learning library/framework and Analytics Zoo.

BigDL was created natively for Apache Spark, which makes it very easy to perform deep learning model training and inference on existing Intel® Xeon® processor-based big processor-based big data clusters. It is highly optimized through the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN). BigDL is the latest software to be included in the Intel® Select Solutions family to deliver faster, easier, optimized capabilities that are pre-tested and verified by Intel and our ecosystem partners.

To unify analytics and AI on one platform, we recently open sourced Analytics Zoo. It unites Spark, TensorFlow, Keras, and BigDL programs into one pipeline. The entire pipeline can transparently scale out to a large Spark/Hadoop cluster for distributed training or inference. In addition, it provides high level pipeline APIs, pre-trained deep learning models, and reference use cases. Analytics Zoo includes high level pipeline APIs, built-in deep learning models and reference use cases to provide an end-to-end analytics and AI platform.

Analytics + Pipelines for Apache Spark and BigDL

Reference Use Cases

(pre-built end-to-end pipeline)

Anomaly detection, sentiment analysis, fraud detection, chatbot, sequence prediction...
Built-in Algorithms and Models Image classification, object detection, text classification, recommendations, GAN...
Feature Engineering and Transformations Image, text, speech, 3D imaging, time series...
High-level Pipeline APIs DataFrames, ML Pipeliones, Autograd, Keras/Keras2, Transfer Learning...
Runtime Environment Apache* Spark*, BigDL, Python...

Next Steps for Developers

With open source software as a catalyst, Intel innovations in communications, storage/memory and computer processing can help you move faster, store more, and process everything, to turn the data deluge into value. Get started your solutions development started with Intel® Analytics open source software.

Related Content

Intel® AI Frameworks optimizations. Explore installation guides and other learning material available for popular Artificial Intelligence (AI) frameworks optimized on Intel architecture.

Intel® AI Academy. Get essential learning materials, tools and technology to boost your AI development.

Intel® Artificial Intelligence. Learn more about Intel technologies for analytics and AI.

Intel® Advanced Analytics. See what Intel architecture based analytics solutions can do for business.

Make Business Smarter with Advanced Data Analytics. Learn how advanced analytics can help organizations create a competitive advantage in the new era of data-driven business.

Author

ziya ma portraitZiya Ma is vice president of the Intel Architecture, Graphics and Software group and director of Data Analytics Technologies in System Software Products at Intel Corp. Ma is responsible for optimizing big data solutions on the Intel® architecture platform, leading open source efforts in the Apache community, and bringing about optimal big data analytics experiences for customers. Her team works across Intel, the open source community, industry, and academia to further Intel’s leadership in big data analytics. Ma is a co-founder of the Women in Big Data forum. At the 2018 Global Women Economic Forum, she was honored as Women of the Decade in Data and Analytics. Ma received her bachelor’s degree in computer engineering from Hefei University of Technology in China and a master’s degree and Ph.D. in computer science and engineering from Arizona State University.

Footnotes

1. 8x (8/2/2018)

2., 3. 9x reads/11x users (5/24/2018)

Performance results are based on testing and may not reflect all publicly available security updates. No product can be absolutely secure. See detailed configurations in backup slides for details. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks.

[4] Spark SQL Configurations

  AEP DRAM
Hardware DRAM 192 GB (12x 16 GB DDR4) 768 GB (24x 32 GB DDR4)
Apache Pass 1 TB (ES2: 8x 128 GB) N/A
AEP Mode App Direct (Memkind) N/A
SSD N/A N/A
CPU Worker: Intel® Xeon® Platinum 8170 @ 2.10GHz (Thread(s) per core: 2, Core(s) per socket: 26, Socket(s): 2 CPU max MHz: 3700.0000 CPU min MHz: 1000.0000 L1d cache: 32K, L1i cache: 32K, L2 cache; 1024K, L3 cache: 36608K)
OS 4.16.6-202.fc27.x86_64 (BKC:WWW26, BIOS: SE5C620.86B.01.00.0918.062020181644)
Software OAP 1 TB AEP based OAP cache 620 GB DRAM based OAP cache
Hadoop 8 * HDD disk (ST1000NX0313, 1-replica uncompressed and plain encoded data on Hadoop)
Spark 1 * Driver (5 GB) + 2 * Executor (62 cores, 74 GB), spark.sql.oap.rowgroup.size=1MB
JDK Oracle* JDK 1.8.0_161
Workload Data Scale 2.6 TB (9 queries related data is of 729.4 GB in capacity)
TPC-DS Queries 9 I/O intensive queries (Q19,Q42,Q43,Q52,Q55,Q63,Q68,Q73,Q98)
Multi-Tenants 9 threads (Fair scheduled)

Apache Cassandra configurations

  NVMe Apache Pass
Server Hardware System Details Intel® Server Board Purely Platform (2 socket)
CPU Dual Intel® Xeon® Platinum 8180 Processors, 28 core/socket, 2 sockets, 2 threads per core
Hyper-Threading Enabled
DRAM DDR4 dual rank 192GB total = 12 DIMMs 16GB@2667Mhz DDR4 dual rank 384GB total = 12 DIMMs 32GB@2667Mh
Apache Pass N/A AEP ES.2 1.5TB total = 12 DIMMs * 128GB Capacity each: Single Rank, 128GB, 15W
Apache Pass Mode N/A App-Direct
NVMe 4 x Intel P3500 1.6TB NVMe devices N/A
Network 10Gbit on board Intel NIC
Software OS Fedora 27
Kernel Kernel: 4.16.6-202.fc27.x86_64
Cassandra Version 3.11.2 release Cassandra 4.0 trunk, with App Direct patch version 2.1, software found at https://github.com/shyla226/cassandra/tree/13981
with PCJ library: https://github.com/pmem/pcj
JDK Oracle Hotspot JDK (JDK1.8 u131)
Spectra/Meltdown/Compliant Patched for variants 1/2/3
Cassandra Parameters Number of Cassandra Instances 1 14
Cluster Nodes One per Cluster
Garbage Collector CMS Parallel
JVM Options (difference from default) -Xms64G
-Xmx64G
-Xms20G
-Xmx20G
-Xmn8G
-XX:+UseAdaptiveSizePolicy
-XX:ParallelGCThreads=5
Schema cqlstress-insanity-example.yaml
DataBase Size per Instance 1.25 Billion entries 100 K entries
Clients Hardware Number of Client machines 1 2
System Intel® Server Board model S2600WFT (2 socket)
CPU Dual Intel® Xeon® Platinum 8176M CPU @ 2.1Ghz, 28 core/socket, 2 sockets, 2 threads per core
DRAM DDR4 384GB total = 12 DIMMs 32GB@2666Mhz
Network 10Gbit on board Intel NIC
Software OS Fedora27
Kernel Kernel: 4.16.6-202.fc27.x86_64
JDK Oracle Hotspot JDK (JDK1.8 u131)
Workload Benchmark Cassandra-Stress
Cassandra-Stress Instances 1 2
Command line to write database cassandra-stress user profile/root/cassandra_4.0/tools/cqlstress-insanity-example.yaml ops\(insert=1\) n=1250000000 cl=ONE no-warmup -pop seq=1..1250000000 -mode native cql3 -node <ip_addr> -rate threads=10 cassandra-stress user profile/root/cassandra_4.0/tools/cqlstress-insanity-example.yaml ops\(insert=1\) n=100000 cl=ONE no-warmup -pop seq=1..100000 -mode native cql3 -node <ip_addr> -rate threads=10
Command line to read database cassandra-stress user profile=/root/cassandra_4.0/tools/cqlstress-insanity-example.yaml ops\(simple1=1\) duration=10m cl=ONE no-warmup -pop dist=UNIFORM\(1.. 1250000000\) -mode native cql3 –node <ip_addr> -rate threads=300 cassandra-stress user profile=/root/cassandra_4.0/tools/cqlstress-insanity-example.yaml ops\(simple1=1\) duration=3m cl=ONE no-warmup -pop dist=UNIFORM\(1..100000\) -mode native cql3 –node <ip_addr> -rate threads=320

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804