As any good engineer knows, “if you cannot measure it, you cannot improve it.” And a representative benchmark suite is the key for measuring any computer systems. That’s exactly why we have constructed HiBench, a Hadoop* benchmark suite consisting of both micro-benchmarks and real world applications, including:
We released HiBench 2.1 to open source during Hadoop Summit 2012, and got a lot of good feedbacks from our users. (If you have not done so, please check our website at https://github.com/intel-hadoop/hibench and give it a try). One of the frequently asked questions about HiBench is how this approach is different from other trace-based workloads (such as GridMix3). In this post, I’ll discuss some tradeoffs between these two approaches.
First, some brief overview of trace-based workloads. These workloads intend to model the characteristics of a Hadoop cluster by (1) collecting traces of the jobs in the Hadoop cluster over an extended period of time, (2) synthesizing the workload based on the traces, and (3) executing the workloads via replaying the synthesized traces. There are obviously pros and cons with this tracing-synthesis-replay approach. It is often tempting to get the traces of all the job mix in your production Hadoop cluster, and be able to replay these production traces for benchmarking/analysis – nothing is more representative than a production Hadoop cluster. On the other hand, the major issue here is that this approach is only as good as the model it uses to synthesize Hadoop jobs into traces, and it is always almost impossible to design an appropriate model that precisely captures all the intrinsic properties of a Hadoop job.
For instance, in GridMix3, the trace is generated by Rumen which mines the job execution history logs collected by the JobTracker. For each job, it records a lot of information about the job, such as job submission time, numbers of map/reduce tasks, numbers of bytes and records read/written by each task, etc. Then the framework replays the trace by running synthetic Hadoop jobs simulating the recorded information. However, with the above information, the synthetic jobs can only simulate the I/O load. To emulating other resource usage (e.g., CPU and memory), Rumen also collects CPU and memory usages for each task, and when replaying the traces, each task in the synthetic jobs launches a separate thread to emulate these resource usages. Unfortunately, this approach does not precisely capture all the intrinsic properties of a Hadoop job; for instance:
In a sense, the only way to precisely capture all the intrinsic properties of Hadoop applications is to run the actual codes in the applications; and this is exactly some big Hadoop users do in their benchmarking – re-run a day’s worth of Hadoop jobs that are actually executed in their Hadoop cluster. Unfortunately, this approach does not scale and the actual Hadoop jobs are not publicly available; trace-based workloads such as GridMix3 help address this issue by synthesizing actual Hadoop jobs into traces for replay – there is nothing wrong with them, but we need to be aware of the tradeoffs they have made and understand their limitations.
HiBench has taken a different approach – it provides a collection of real-world Hadoop applications that are both representative and diverse, so that users can understand their characteristics by running the actual application codes on the underlying Hadoop cluster. On the other hand, the user typically runs each workload in HiBench in isolation, as it currently does not provide an obvious way to generate a representative mix of Hadoop jobs. One approach we have experimented before is to continuously submit the Hadoop jobs to saturate the Hadoop cluster, so as to understand the peak performance.
For more details and example results of HiBench, please refer to (an updated version of) our WBDB2012 paper in the attachment, as well as our “Optimizing Hadoop Deployments” whitepaper (http://communities.intel.com/docs/DOC-5645).
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804