How will organizations turn the deluge of data into value, for a sustainable competitive advantage? Intel® technology innovations use open source software as a catalyst to help deliver that value. Read on to learn how Intel@-optimized Apache Spark*, BigDL and Analytics Zoo can bring greater capacity and performance to your big data analytics and AI solutions.
Open Source Software for Data Analytics and AI
As an open source software leader, Intel is the number-one upstream contributor to the Linux* kernel, and we have provided a steady stream of source code contributions and optimizations for Spark*, the emerging unified analytics and AI engine for large-scale data processing.
Spark fills a role at the intersection of AI, streaming analytics, and batch analytics, offering ease-of-use for developers writing in Java*, Scala*, or Python*. Adding even more value to the Spark platform, Intel open source contributions also include the Optimized Analytics Package to accelerate Spark queries, the BigDL deep learning library/framework and the Analytics Zoo analytics and AI platform for Apache Spark and BigDL.
A Key Trend: Fast Data
Companies are transitioning from big data to fast data. Fast data is data in motion. It provides the ability to learn continuously, make decisions, and take actions as soon as data arrives, typically in milliseconds.
Imagine the scenario for a credit card company when a person swipes a card to purchase something: analytics or AI applications need to immediately run hundreds of input variables such as location, time, recent purchases, and previous transactions through complex logic to determine whether to approve or decline the transaction—all within milliseconds.
Implementing such a use case can present an extreme processing bottleneck. Many learning algorithms iterate a computation over a training dataset and update the model parameters until the model converges. To accelerate training performance, it’s common to cache the huge dataset and parameters into memory. However, memory constraints are a common challenge.
That is exactly why we believe Intel® Optane™ DC persistent memory can be a real game changer for fast data. Our benchmark testing shows that Spark SQL (Spark's module for working with structured data) performs eight times faster1 at a 2.6 Terabytes (TB) data scale using Intel Optane DC persistent memory versus a comparable system using dynamic random-access memory (DRAM) dual in-line memory modules (DIMMs) and solid-state drives (SSDs)4. Even greater improvements were noted with the Apache Cassandra* not only SQL (NoSQL) database.2, 3
Intel Software Advancements
Intel advancements for analytics workloads go beyond our silicon innovations to include in-memory database optimizations and upstream contributions to numerous open source projects.
As an ecosystem leader and open source software contributor, Intel aims to optimize all major deep learning frameworks and topologies, including TensorFlow*, Caffe*, MXNet*, and Chainer* to run well on Intel@ architecture.
As a top contributor to Apache Spark, Intel open sourced the BigDL deep learning library/framework and Analytics Zoo.
BigDL was created natively for Apache Spark, which makes it very easy to perform deep learning model training and inference on existing Intel® Xeon® processor-based big processor-based big data clusters. It is highly optimized through the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN). BigDL is the latest software to be included in the Intel® Select Solutions family to deliver faster, easier, optimized capabilities that are pre-tested and verified by Intel and our ecosystem partners.
To unify analytics and AI on one platform, we recently open sourced Analytics Zoo. It unites Spark, TensorFlow, Keras, and BigDL programs into one pipeline. The entire pipeline can transparently scale out to a large Spark/Hadoop cluster for distributed training or inference. In addition, it provides high level pipeline APIs, pre-trained deep learning models, and reference use cases. Analytics Zoo includes high level pipeline APIs, built-in deep learning models and reference use cases to provide an end-to-end analytics and AI platform.
Analytics + Pipelines for Apache Spark and BigDL
Reference Use Cases
(pre-built end-to-end pipeline)
|Anomaly detection, sentiment analysis, fraud detection, chatbot, sequence prediction...|
|Built-in Algorithms and Models||Image classification, object detection, text classification, recommendations, GAN...|
|Feature Engineering and Transformations||Image, text, speech, 3D imaging, time series...|
|High-level Pipeline APIs||DataFrames, ML Pipeliones, Autograd, Keras/Keras2, Transfer Learning...|
|Runtime Environment||Apache* Spark*, BigDL, Python...|
Next Steps for Developers
With open source software as a catalyst, Intel innovations in communications, storage/memory and computer processing can help you move faster, store more, and process everything, to turn the data deluge into value. Get started your solutions development started with Intel® Analytics open source software.
Intel® AI Frameworks optimizations. Explore installation guides and other learning material available for popular Artificial Intelligence (AI) frameworks optimized on Intel architecture.
Intel® AI Academy. Get essential learning materials, tools and technology to boost your AI development.
Intel® Artificial Intelligence. Learn more about Intel technologies for analytics and AI.
Intel® Advanced Analytics. See what Intel architecture based analytics solutions can do for business.
Make Business Smarter with Advanced Data Analytics. Learn how advanced analytics can help organizations create a competitive advantage in the new era of data-driven business.
Ziya Ma is vice president of the Intel Architecture, Graphics and Software group and director of Data Analytics Technologies in System Software Products at Intel Corp. Ma is responsible for optimizing big data solutions on the Intel® architecture platform, leading open source efforts in the Apache community, and bringing about optimal big data analytics experiences for customers. Her team works across Intel, the open source community, industry, and academia to further Intel’s leadership in big data analytics. Ma is a co-founder of the Women in Big Data forum. At the 2018 Global Women Economic Forum, she was honored as Women of the Decade in Data and Analytics. Ma received her bachelor’s degree in computer engineering from Hefei University of Technology in China and a master’s degree and Ph.D. in computer science and engineering from Arizona State University.
1. 8x (8/2/2018)
2., 3. 9x reads/11x users (5/24/2018)
Performance results are based on testing and may not reflect all publicly available security updates. No product can be absolutely secure. See detailed configurations in backup slides for details. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks.
 Spark SQL Configurations
|Hardware||DRAM||192 GB (12x 16 GB DDR4)||768 GB (24x 32 GB DDR4)|
|Apache Pass||1 TB (ES2: 8x 128 GB)||N/A|
|AEP Mode||App Direct (Memkind)||N/A|
|CPU||Worker: Intel® Xeon® Platinum 8170 @ 2.10GHz (Thread(s) per core: 2, Core(s) per socket: 26, Socket(s): 2 CPU max MHz: 3700.0000 CPU min MHz: 1000.0000 L1d cache: 32K, L1i cache: 32K, L2 cache; 1024K, L3 cache: 36608K)|
|OS||4.16.6-202.fc27.x86_64 (BKC:WWW26, BIOS: SE5C620.86B.01.00.0918.062020181644)|
|Software||OAP||1 TB AEP based OAP cache||620 GB DRAM based OAP cache|
|Hadoop||8 * HDD disk (ST1000NX0313, 1-replica uncompressed and plain encoded data on Hadoop)|
|Spark||1 * Driver (5 GB) + 2 * Executor (62 cores, 74 GB), spark.sql.oap.rowgroup.size=1MB|
|JDK||Oracle* JDK 1.8.0_161|
|Workload||Data Scale||2.6 TB (9 queries related data is of 729.4 GB in capacity)|
|TPC-DS Queries||9 I/O intensive queries (Q19,Q42,Q43,Q52,Q55,Q63,Q68,Q73,Q98)|
|Multi-Tenants||9 threads (Fair scheduled)|
Apache Cassandra configurations
|Server||Hardware||System Details||Intel® Server Board Purely Platform (2 socket)|
|CPU||Dual Intel® Xeon® Platinum 8180 Processors, 28 core/socket, 2 sockets, 2 threads per core|
|DRAM||DDR4 dual rank 192GB total = 12 DIMMs 16GB@2667Mhz||DDR4 dual rank 384GB total = 12 DIMMs 32GB@2667Mh|
|Apache Pass||N/A||AEP ES.2 1.5TB total = 12 DIMMs * 128GB Capacity each: Single Rank, 128GB, 15W|
|Apache Pass Mode||N/A||App-Direct|
|NVMe||4 x Intel P3500 1.6TB NVMe devices||N/A|
|Network||10Gbit on board Intel NIC|
|Cassandra Version||3.11.2 release||Cassandra 4.0 trunk, with App Direct patch version 2.1, software found at https://github.com/shyla226/cassandra/tree/13981|
with PCJ library: https://github.com/pmem/pcj
|JDK||Oracle Hotspot JDK (JDK1.8 u131)|
|Spectra/Meltdown/Compliant||Patched for variants 1/2/3|
|Cassandra Parameters||Number of Cassandra Instances||1||14|
|Cluster Nodes||One per Cluster|
|JVM Options (difference from default)||-Xms64G|
|DataBase Size per Instance||1.25 Billion entries||100 K entries|
|Clients||Hardware||Number of Client machines||1||2|
|System||Intel® Server Board model S2600WFT (2 socket)|
|CPU||Dual Intel® Xeon® Platinum 8176M CPU @ 2.1Ghz, 28 core/socket, 2 sockets, 2 threads per core|
|DRAM||DDR4 384GB total = 12 DIMMs 32GB@2666Mhz|
|Network||10Gbit on board Intel NIC|
|JDK||Oracle Hotspot JDK (JDK1.8 u131)|
|Command line to write database||cassandra-stress user profile/root/cassandra_4.0/tools/cqlstress-insanity-example.yaml ops\(insert=1\) n=1250000000 cl=ONE no-warmup -pop seq=1..1250000000 -mode native cql3 -node <ip_addr> -rate threads=10||cassandra-stress user profile/root/cassandra_4.0/tools/cqlstress-insanity-example.yaml ops\(insert=1\) n=100000 cl=ONE no-warmup -pop seq=1..100000 -mode native cql3 -node <ip_addr> -rate threads=10|
|Command line to read database||cassandra-stress user profile=/root/cassandra_4.0/tools/cqlstress-insanity-example.yaml ops\(simple1=1\) duration=10m cl=ONE no-warmup -pop dist=UNIFORM\(1.. 1250000000\) -mode native cql3 –node <ip_addr> -rate threads=300||cassandra-stress user profile=/root/cassandra_4.0/tools/cqlstress-insanity-example.yaml ops\(simple1=1\) duration=3m cl=ONE no-warmup -pop dist=UNIFORM\(1..100000\) -mode native cql3 –node <ip_addr> -rate threads=320|