Big Data & Analytics

Gathering and analyzing “big data” helps forecast market conditions, make critical decisions, and get insights into your customers.

Learn More ›

Hadoop logo
Apache Hadoop*

Using simple programming models, Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Apache Hadoop Project


Hadoop Optimization

Optimizing Java* and Apache Hadoop* for Intel® Architecture (PDF)
Because Apache Hadoop is built on Java*, one of the most effective ways to increase performance is to optimize Java itself to take advantage of Intel architecture enhancements.

Performance for Server-Side Java* Applications (PDF)
This paper describes the key architectural advancements of the latest Intel® Xeon® processors and Intel® Atom™ processors C2000 that are beneficial to Java applications, and discusses fast, cost-effective ways to maximize the performance of Java applications.

Big Data Analysis (PDF)
Scale up Apache Hadoop clusters to handle the increasing volume, variety, and velocity of data. Learn how to reduce the complexity and total cost of ownership (TCO) of your clusters by using fewer and more powerful servers.

Accelerating Hadoop* Performance (Video)
This presentation from the 2014 Intel® Developer Conference (IDF) covers technical details and best known methods for optimizing big data clusters and Hadoop* workloads on Intel® Xeon® processor E5 v3 based platforms.

Tuning Java Garbage Collection for HBase*
Taking a look on how to tune Java garbage collection (GC) for HBase focusing on 100% YCSB reads.

Measure Ceph* RBD Performance in a Quantitative Way
To better understand Ceph performance and identify future optimization opportunities, we conduct a lot of experiments with different workloads and IO patterns.
Part I - Random IO Performance on Ceph
Part II - Sequential Read/Write


Hadoop Storage

Virtual Storage Manager
This web-based management application for Ceph storage systems creates, manages, and monitors a Ceph cluster. It simplifies the creation and day-to-day management of a Ceph cluster for cloud and data center storage administrators.

Physical Server Provisioning with OpenStack*
Explore the internal details of provisioning a physical machine and setting it up on OpenStack*.

Single-Node Encryption Performance (Case Study)
Eddie Garcia, Cloudera chief security architect, explains how Intel® Solid State Drive Pro 3700 series storage can be used to improve I/O throughput and performance.


Apache Spark* logo
Apache Spark*

Apache Spark is a fast and general engine for large-scale data processing that runs programs up to 100 times faster than Hadoop MapReduce in memory or ten times faster on disk. Write applications quickly in Java, Scala, Python, and R, and combine SQL, streaming, and complex analytics.

Get Apache Spark


Spark News Spotlight

Hands-on Hive-on-Spark in the AWS* Cloud
The Apache Hive community is working to add Spark as an execution engine for Hive. The Hive-on-Spark work is being tracked by HIVE-7292, which is one of the most popular JIRAs in the Hadoop ecosystem.
Overview
Demo - Hive on Spark
Jira - HIVE-7292


Optimizing Spark Projects

Gearpump, the Real-Time Big Data Streaming Engine
Gearpump adds a key ingestion capability to TAP, capable of handling a variety of use cases that either involve complicated workflows or low latency processing of many types of ingestion streams that need to be fault tolerant.
Program Details
GitHub Repository

Large-Scale Graph Analysis using GraphX (PDF)
Read about the lessons learned while building real-world, large-scale graph analysis applications using GraphX for some of the largest organizations and websites in the world, including both algorithm level and framework level optimizations.

Innovation: Driving a Stronger Community Standard
Apache Spark complements the existing Hadoop ecosystem by adding easy-to-use APIs and data-pipelining capabilities to Hadoop data. Since its launch in 2009, Spark has seen over 400 contributors from more than 50 different companies.

StreamSQL on Spark (Video)
This presentation will show Intel's implementation of StreamSQL by using Spark-streaming and Catalyst modules, which makes SQL users grasp stream processing with ease. Find out what StreamSQL is and what benefits you gain.

Download the presentation (PDF)

Building Real-World Spark Applications (PDF)
Explore what we've learned about managing memory, networks, improving disk I/O, and optimizing computations with real-world Spark applications.