Performance and Agility with Big Data in a Containerized Environment

Performance Check Mark

There is great business value in the insights that can be gained from analyzing large data sets with Apache* Hadoop*, Apache* Spark*, and other Big Data frameworks. But since the volume of data is large, in some cases the jobs may require hours to execute and process on large compute clusters. These resource costs can be significant, and the cost of a job is inversely related to the throughput. So performance is of the utmost importance.

To ensure the highest possible performance, most enterprises have deployed their on-premises Big Data analytics using bare-metal physical servers. Until recently, many IT departments were reluctant to use virtual machines or containers for their Big Data implementations. This is largely due to the processing overhead and input/output (I/O) latency that is typically associated with virtualization or containerization.

As a result, most on-premises Big Data initiatives have been limited in terms of agility. Deployments on a traditional bare-metal setup often take weeks or even months to implement. This has impacted the adoption of Hadoop as well as Spark and other Big Data deployments in the enterprise. The need for greater agility has also led more analysts and data scientists to use the public cloud for Big Data – despite any potential performance loss that may entail, since most cloud services run on virtual machines.

Intel Collaboration with BlueData*

About one and half years ago, Intel announced our investment and collaboration agreement with BlueData to address these challenges. BlueData’s EPIC* software platform uses Docker* containers to help accelerate Big Data deployments – leveraging the inherent agility and deployment flexibility of containers. Container-based clusters in the BlueData platform look and feel like standard physical clusters in a bare-metal deployment, with no modification to Hadoop or other Big Data frameworks. It can be implemented either on-premises, in the public cloud, or in a hybrid architecture.

With BlueData, enterprises can quickly and easily deploy Big Data – providing a Big-Data-as-a-Service experience with self-service, elastic, and on-demand Hadoop or Spark clusters – while at the same time reducing costs. And the BlueData platform is specifically tailored to the performance needs of Big Data. For example, BlueData boosts the I/O performance and scalability of container-based clusters with hierarchical data caching and tiering. It also allows multiple user groups to securely share the same cluster resources, avoiding the complexity of each group requiring its own dedicated Big Data infrastructure.  

As part of our strategic technology and business collaboration, Intel has helped to test, benchmark, and enhance the BlueData EPIC platform in order to ensure flexible, elastic, and high-performance Big Data deployments.  We’ve worked closely with BlueData to prove — using validated and quantified benchmarking results — that their software innovations could deliver comparable performance to bare-metal deployments for Hadoop, Spark, and other Big Data workloads. 

Benchmark Performance Results

Intel ran benchmark tests to determine the performance of on-premises Big Data workloads running on BlueData (using containers) versus the same workloads running on a bare-metal environment. The most recent tests were performed using the BigBench benchmarking kit – with identical configurations on Intel® Xeon® processor-based architecture for both test environments to ensure an apples-to-apples comparison.

This in-depth study shows that performance ratios for container-based Hadoop workloads on BlueData EPIC are equal to (or, in some cases, slightly better than) bare-metal Hadoop. For example, it was found that the BlueData EPIC platform demonstrated an average 2.33% performance gain over bare metal, when benchmarked with 50 Hadoop compute nodes and 10 terabytes of data. This is a great milestone, and the result of an ongoing collaboration between the Intel and BlueData software engineering teams.

This means that enterprises no longer need to choose between performance and agility.  Now they can have it all ... ensuring both performance AND agility for Big Data analytics in an on-premises deployment. With BlueData EPIC software and Intel Xeon processors, they gain the flexibility and cost-efficiency benefits of Docker containers – while ensuring bare-metal performance. Data science teams can benefit from on-demand access to their Big Data environments, while leveraging enterprise-grade data governance and security in a multi-tenant architecture. As a result, BlueData EPIC software running on Intel architecture is becoming the solution stack of choice for many Big Data initiatives.

To learn about the performance benchmark results, download this new Intel white paper: Bare-metal performance for Big Data workloads on Docker containers.

*Other names and brands may be claimed as the property of others.

- by Michael Greene, vice president and general manager of System Technologies and Optimization in the Software and Services Group, Intel Corporation. Follow me on Twitter and join the conversation with me @Greene1of5.

For more complete information about compiler optimizations, see our Optimization Notice.