Deliver the Value of Analytics & AI at Scale with Intel®-optimized Open Source Software

How will organizations turn the deluge of data into value, for a sustainable competitive advantage? Intel® technology innovations use open source software as a catalyst to help deliver that value. Read on to learn how Intel@-optimized Apache Spark*, BigDL and Analytics Zoo can bring greater capacity and performance to your big data analytics and AI solutions.

data analytics and a i

Open Source Software for Data Analytics and AI

As an open source software leader, Intel is the number-one upstream contributor to the Linux* kernel, and we have provided a steady stream of source code contributions and optimizations for Spark*, the emerging unified analytics and AI engine for large-scale data processing.

Spark fills a role at the intersection of AI, streaming analytics, and batch analytics, offering ease-of-use for developers writing in Java*, Scala*, or Python*. Adding even more value to the Spark platform, Intel open source contributions also include the Optimized Analytics Package to accelerate Spark queries, the BigDL deep learning library/framework and the Analytics Zoo analytics and AI platform for Apache Spark and BigDL.

A Key Trend: Fast Data

Companies are transitioning from big data to fast data. Fast data is data in motion. It provides the ability to learn continuously, make decisions, and take actions as soon as data arrives, typically in milliseconds.

Imagine the scenario for a credit card company when a person swipes a card to purchase something: analytics or AI applications need to immediately run hundreds of input variables such as location, time, recent purchases, and previous transactions through complex logic to determine whether to approve or decline the transaction—all within milliseconds.

Implementing such a use case can present an extreme processing bottleneck. Many learning algorithms iterate a computation over a training dataset and update the model parameters until the model converges. To accelerate training performance, it’s common to cache the huge dataset and parameters into memory. However, memory constraints are a common challenge.

That is exactly why we believe Intel® Optane™ DC persistent memory can be a real game changer for fast data. Our benchmark testing shows that Spark SQL (Spark's module for working with structured data) performs eight times faster1 at a 2.6 Terabytes (TB) data scale using Intel Optane DC persistent memory versus a comparable system using dynamic random-access memory (DRAM) dual in-line memory modules (DIMMs) and solid-state drives (SSDs)4. Even greater improvements were noted with the Apache Cassandra* not only SQL (NoSQL) database.2, 3

spark and apache comparison

Intel Software Advancements

Intel advancements for analytics workloads go beyond our silicon innovations to include in-memory database optimizations and upstream contributions to numerous open source projects.

As an ecosystem leader and open source software contributor, Intel aims to optimize all major deep learning frameworks and topologies, including TensorFlow*, Caffe*, MXNet*, and Chainer* to run well on Intel@ architecture.

As a top contributor to Apache Spark, Intel open sourced the BigDL deep learning library/framework and Analytics Zoo.

BigDL was created natively for Apache Spark, which makes it very easy to perform deep learning model training and inference on existing Intel® Xeon® processor-based big processor-based big data clusters. It is highly optimized through the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN). BigDL is the latest software to be included in the Intel® Select Solutions family to deliver faster, easier, optimized capabilities that are pre-tested and verified by Intel and our ecosystem partners.

To unify analytics and AI on one platform, we recently open sourced Analytics Zoo. It unites Spark, TensorFlow, Keras, and BigDL programs into one pipeline. The entire pipeline can transparently scale out to a large Spark/Hadoop cluster for distributed training or inference. In addition, it provides high level pipeline APIs, pre-trained deep learning models, and reference use cases. Analytics Zoo includes high level pipeline APIs, built-in deep learning models and reference use cases to provide an end-to-end analytics and AI platform.

Analytics + Pipelines for Apache Spark and BigDL

Reference Use Cases

(pre-built end-to-end pipeline)

Anomaly detection, sentiment analysis, fraud detection, chatbot, sequence prediction...
Built-in Algorithms and ModelsImage classification, object detection, text classification, recommendations, GAN...
Feature Engineering and TransformationsImage, text, speech, 3D imaging, time series...
High-level Pipeline APIsDataFrames, ML Pipeliones, Autograd, Keras/Keras2, Transfer Learning...
Runtime EnvironmentApache* Spark*, BigDL, Python...

Next Steps for Developers

With open source software as a catalyst, Intel innovations in communications, storage/memory and computer processing can help you move faster, store more, and process everything, to turn the data deluge into value. Get started your solutions development started with Intel® Analytics open source software.

Related Content

Intel® AI Frameworks optimizations. Explore installation guides and other learning material available for popular Artificial Intelligence (AI) frameworks optimized on Intel architecture.

Intel® AI Academy. Get essential learning materials, tools and technology to boost your AI development.

Intel® Artificial Intelligence. Learn more about Intel technologies for analytics and AI.

Intel® Advanced Analytics. See what Intel architecture based analytics solutions can do for business.

Make Business Smarter with Advanced Data Analytics. Learn how advanced analytics can help organizations create a competitive advantage in the new era of data-driven business.

Author

ziya ma portraitZiya Ma is vice president of the Intel Architecture, Graphics and Software group and director of Data Analytics Technologies in System Software Products at Intel Corp. Ma is responsible for optimizing big data solutions on the Intel® architecture platform, leading open source efforts in the Apache community, and bringing about optimal big data analytics experiences for customers. Her team works across Intel, the open source community, industry, and academia to further Intel’s leadership in big data analytics. Ma is a co-founder of the Women in Big Data forum. At the 2018 Global Women Economic Forum, she was honored as Women of the Decade in Data and Analytics. Ma received her bachelor’s degree in computer engineering from Hefei University of Technology in China and a master’s degree and Ph.D. in computer science and engineering from Arizona State University.

Footnotes

1. 8x (8/2/2018)

2., 3. 9x reads/11x users (5/24/2018)

Performance results are based on testing and may not reflect all publicly available security updates. No product can be absolutely secure. See detailed configurations in backup slides for details. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks.

[4] Spark SQL Configurations

 AEPDRAM
HardwareDRAM192 GB (12x 16 GB DDR4)768 GB (24x 32 GB DDR4)
Apache Pass1 TB (ES2: 8x 128 GB)N/A
AEP ModeApp Direct (Memkind)N/A
SSDN/AN/A
CPUWorker: Intel® Xeon® Platinum 8170 @ 2.10GHz (Thread(s) per core: 2, Core(s) per socket: 26, Socket(s): 2 CPU max MHz: 3700.0000 CPU min MHz: 1000.0000 L1d cache: 32K, L1i cache: 32K, L2 cache; 1024K, L3 cache: 36608K)
OS4.16.6-202.fc27.x86_64 (BKC:WWW26, BIOS: SE5C620.86B.01.00.0918.062020181644)
SoftwareOAP1 TB AEP based OAP cache620 GB DRAM based OAP cache
Hadoop8 * HDD disk (ST1000NX0313, 1-replica uncompressed and plain encoded data on Hadoop)
Spark1 * Driver (5 GB) + 2 * Executor (62 cores, 74 GB), spark.sql.oap.rowgroup.size=1MB
JDKOracle* JDK 1.8.0_161
WorkloadData Scale2.6 TB (9 queries related data is of 729.4 GB in capacity)
TPC-DS Queries9 I/O intensive queries (Q19,Q42,Q43,Q52,Q55,Q63,Q68,Q73,Q98)
Multi-Tenants9 threads (Fair scheduled)

Apache Cassandra configurations

 NVMeApache Pass
ServerHardwareSystem DetailsIntel® Server Board Purely Platform (2 socket)
CPUDual Intel® Xeon® Platinum 8180 Processors, 28 core/socket, 2 sockets, 2 threads per core
Hyper-ThreadingEnabled
DRAMDDR4 dual rank 192GB total = 12 DIMMs 16GB@2667MhzDDR4 dual rank 384GB total = 12 DIMMs 32GB@2667Mh
Apache PassN/AAEP ES.2 1.5TB total = 12 DIMMs * 128GB Capacity each: Single Rank, 128GB, 15W
Apache Pass ModeN/AApp-Direct
NVMe4 x Intel P3500 1.6TB NVMe devicesN/A
Network10Gbit on board Intel NIC
SoftwareOSFedora 27
KernelKernel: 4.16.6-202.fc27.x86_64
Cassandra Version3.11.2 releaseCassandra 4.0 trunk, with App Direct patch version 2.1, software found at https://github.com/shyla226/cassandra/tree/13981
with PCJ library: https://github.com/pmem/pcj
JDKOracle Hotspot JDK (JDK1.8 u131)
Spectra/Meltdown/CompliantPatched for variants 1/2/3
Cassandra ParametersNumber of Cassandra Instances114
Cluster NodesOne per Cluster
Garbage CollectorCMSParallel
JVM Options (difference from default)-Xms64G
-Xmx64G
-Xms20G
-Xmx20G
-Xmn8G
-XX:+UseAdaptiveSizePolicy
-XX:ParallelGCThreads=5
Schemacqlstress-insanity-example.yaml
DataBase Size per Instance1.25 Billion entries100 K entries
ClientsHardwareNumber of Client machines12
SystemIntel® Server Board model S2600WFT (2 socket)
CPUDual Intel® Xeon® Platinum 8176M CPU @ 2.1Ghz, 28 core/socket, 2 sockets, 2 threads per core
DRAMDDR4 384GB total = 12 DIMMs 32GB@2666Mhz
Network10Gbit on board Intel NIC
SoftwareOSFedora27
KernelKernel: 4.16.6-202.fc27.x86_64
JDKOracle Hotspot JDK (JDK1.8 u131)
WorkloadBenchmarkCassandra-Stress
Cassandra-Stress Instances12
Command line to write databasecassandra-stress user profile/root/cassandra_4.0/tools/cqlstress-insanity-example.yaml ops\(insert=1\) n=1250000000 cl=ONE no-warmup -pop seq=1..1250000000 -mode native cql3 -node <ip_addr> -rate threads=10cassandra-stress user profile/root/cassandra_4.0/tools/cqlstress-insanity-example.yaml ops\(insert=1\) n=100000 cl=ONE no-warmup -pop seq=1..100000 -mode native cql3 -node <ip_addr> -rate threads=10
Command line to read databasecassandra-stress user profile=/root/cassandra_4.0/tools/cqlstress-insanity-example.yaml ops\(simple1=1\) duration=10m cl=ONE no-warmup -pop dist=UNIFORM\(1.. 1250000000\) -mode native cql3 –node <ip_addr> -rate threads=300cassandra-stress user profile=/root/cassandra_4.0/tools/cqlstress-insanity-example.yaml ops\(simple1=1\) duration=3m cl=ONE no-warmup -pop dist=UNIFORM\(1..100000\) -mode native cql3 –node <ip_addr> -rate threads=320
Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.