As any good engineer knows, “if you cannot measure it, you cannot improve it.” And a representative benchmark suite is the key for measuring any computer systems. That’s exactly why we have constructed HiBench, a Hadoop benchmark suite consisting of both micro-benchmarks and real world applications, including:
- Search Indexing
- Nutch Indexing
- Page Rank
- Machine Learning
- Bayesian Classification
- K-Means Clustering
- Analytical Query
- Hive Join
- Hive Aggregation
We released HiBench 2.1 to open source during Hadoop Summit 2012, and got a lot of good feedbacks from our users. (If you have not done so, please check our website at https://github.com/intel-hadoop/hibench and give it a try). One of the frequently asked questions about HiBench is how this approach is different from other trace-based workloads (such as GridMix3). In this post, I’ll discuss some tradeoffs between these two approaches.
First, some brief overview of trace-based workloads. These workloads intend to model the characteristics of a Hadoop cluster by (1) collecting traces of the jobs in the Hadoop cluster over an extended period of time, (2) synthesizing the workload based on the traces, and (3) executing the workloads via replaying the synthesized traces. There are obviously pros and cons with this tracing-synthesis-replay approach. It is often tempting to get the traces of all the job mix in your production Hadoop cluster, and be able to replay these production traces for benchmarking/analysis – nothing is more representative than a production Hadoop cluster. On the other hand, the major issue here is that this approach is only as good as the model it uses to synthesize Hadoop jobs into traces, and it is always almost impossible to design an appropriate model that precisely captures all the intrinsic properties of a Hadoop job.
For instance, in GridMix3, the trace is generated by Rumen which mines the job execution history logs collected by the JobTracker. For each job, it records a lot of information about the job, such as job submission time, numbers of map/reduce tasks, numbers of bytes and records read/written by each task, etc. Then the framework replays the trace by running synthetic Hadoop jobs simulating the recorded information. However, with the above information, the synthetic jobs can only simulate the I/O load. To emulating other resource usage (e.g., CPU and memory), Rumen also collects CPU and memory usages for each task, and when replaying the traces, each task in the synthetic jobs launches a separate thread to emulate these resource usages. Unfortunately, this approach does not precisely capture all the intrinsic properties of a Hadoop job; for instance:
- The numbers of tasks is not purely an intrinsic property of the Hadoop job – it depends on both the job itself (e.g., how much data to process) and the underlying Hadoop framework (e.g., how to partition the data & computation). For instance, one common Hadoop performance issue is when the Hadoop job contains a lot of small map tasks (probably due to a lot of small input files), and one way to address this issue is to enhance the Hadoop framework to automatically samples the input data and generate reasonably large input data splits; however, this improvement cannot be evaluated by simply replaying the traces of the original job.
- The CPU usage of each task is also not purely an intrinsic property of the Hadoop job – it depends on both the job itself (e.g., how much computation to perform) and the underlying HW/SW platform. For instance, Rumen collects the cumulative CPU time (in milliseconds) for each task, and then Gridmix3 launches a thread for each task to consume such time. However, the same amount of CPU time on different processors can represents very different computing power; by fixing the CPU time, the resource simulation Gridmix3 cannot evaluate the impacts of different processors in the Hadoop cluster.
- It cannot capture the actual computation performed by the tasks – an important property of the Hadoop job. Without this information, Gridmix3 just conducts some random math operations (e.g., multiply, square root, etc.), which does not represent real world processing in a Hadoop job. For instance, Hive queries are often CPU bottlenecked (most likely due to serialization, compression, sorting, etc.), and a processer can significant speeds up these queries if it provide better support for Java serialization (e.g., new instructions or accelerators); however, this improvement cannot be evaluated by performing random math operations.
In a sense, the only way to precisely capture all the intrinsic properties of Hadoop applications is to run the actual codes in the applications; and this is exactly some big Hadoop users do in their benchmarking – re-run a day’s worth of Hadoop jobs that are actually executed in their Hadoop cluster. Unfortunately, this approach does not scale and the actual Hadoop jobs are not publicly available; trace-based workloads such as GridMix3 help address this issue by synthesizing actual Hadoop jobs into traces for replay – there is nothing wrong with them, but we need to be aware of the tradeoffs they have made and understand their limitations.
HiBench has taken a different approach – it provides a collection of real-world Hadoop applications that are both representative and diverse, so that users can understand their characteristics by running the actual application codes on the underlying Hadoop cluster. On the other hand, the user typically runs each workload in HiBench in isolation, as it currently does not provide an obvious way to generate a representative mix of Hadoop jobs. One approach we have experimented before is to continuously submit the Hadoop jobs to saturate the Hadoop cluster, so as to understand the peak performance.
For more details and example results of HiBench, please refer to (an updated version of) our WBDB2012 paper in the attachment, as well as our “Optimizing Hadoop Deployments” whitepaper (http://communities.intel.com/docs/DOC-5645).