Part #1 of a multi-parts post, we will take a look on how to tune Java garbage collection (GC) for HBase focusing on 100% YCSB reads. In part #2, we will look at 100% writes and finally in part #3, we will tune Java GC for a mix of 50/50 read/writes. As already mentioned, we are using YCSB which seems to be the de facto NoSQL workload. We wont go into much details on how to install, configure YCSB with HBase as there are a lot of literature on that topic already.
Finally, Liqi Yi and Yanping Wang have done all the heavy lifting for this blog post! I couldn’t pass the opportunity to share it with a wider audience (I have 2 readers!!! :))
HBase* is an Apache open source project offering NoSQL data storage. Often used together with Apache Hadoop Distributed File System (HDFS), HBase is widely used across the world. Well-known users include Facebook, Twitter, Yahoo, and more . From the developer’s perspective, HBase is a “distributed, versioned, non-relational database modeled after Google's Bigtable, a distributed storage system for structured data”. HBase can easily handle very high throughput by either scaling up (i.e., deployment on a larger server) or scaling out (i.e., deployment on more servers).
From a user’s point of view, the latency for each single query matters very much. As we work with users to test, tune, and optimize HBase workloads, we encounter a significant number now who really want 99th percentile operation latencies. That means a round-trip, from client request to the response back to the client, all within 100 milliseconds.
Several factors contribute to variation in latency. One of the most devastating and unpredictable latency intruders is the Java Virtual Machine’s (JVM’s) “stop the world” pauses for garbage collection (i.e., memory clean-up).
To address that, we tried some experiments using Oracle jdk7u21 and jdk7u60 G1 (Garbage 1st) collector. The server system we used was based on Intel® Xeon® Ivy-bridge EP processors with Hyper-threading (40 logical processors). It had 256 GB DDR3-1600 RAM, and three 400GB SSDs as local storage. This small setup contained one master and one slave, configured on a single node with the load appropriately scaled. We used the Apache HBase version 0.98.1 and local file system for HFile storage. HBase test table was configured as 400 million rows, and it was 580GB in size. We used the default HBase heap strategy: 40% for blockcache, 40% for memstore. YCSB was used to drive 600 work threads sending requests to the HBase server.
The following charts shows jdk7u21 running 100% read for one hour using "-XX:+UseG1GC -Xms100g -Xmx100g -XX:MaxGCPauseMillis=100" . We specified the garbage collector to use, the heap size, and the desired garbage collection (GC) “stop the world” pause time.
Figure 1: Wild swings in GC Pause time
In this case, we got wildly swinging GC pauses. The GC pause had a range from 7 milliseconds to 5 full seconds after an initial spike that reached as high as 17.5 seconds.
The following chart shows more details, during steady state:
Figure 2: GC pause details, during steady state
Figure 2 tells us the GC pauses actually comes in three different groups: (1) between 1 to 1.5 seconds; (2) between 0.007 seconds to 0.5 seconds; (3) spikes between 1.5 seconds to 5 seconds. This was very strange, so we tested the most recently released jdk7u60 to see if the data would be any different:
We ran the same 100% read tests using exactly the same JVM parameters: "-XX:+UseG1GC -Xms100g -Xmx100g -XX:MaxGCPauseMillis=100"
Figure 3: Greatly improved handling of pause time spikes
Jdk7u60 greatly improved G1’s ability to handle pause time spikes after initial spike during settling down stage. Jdk7u60 made 1029 Young and mixed GCs during a one hour run. GC happened about every 3.5 seconds. Jdk7u21 made 286 GCs with each GC happening about every 12.6 seconds. Jdk7u60 was able to manage pause time between 0.302 to 1 second without major spikes.
Figure 4, below, gives us a closer look at 150 GC pauses during steady state:
Figure 4: Better, but not good enough
During steady state, jdk7u60 was able to keep the average pause time around 369 milliseconds. It was much better than jdk7u21, but it still did not meet our requirement of 100 milliseconds given by –Xx:MaxGCPauseMillis=100.
To determine what else we could do to get our 100 million seconds pause time, we needed to understand more about the behavior of the JVM’s memory management and G1 (Garbage First) garbage collector. The following figures show how G1 works on Young Gen collection.
Figure 5: from the 2012 JavaOne presentation by Charlie Hunt and Monica Beckwith: “G1 Garbage Collector Performance Tuning”
When JVM starts, based on the JVM launching parameters, it asks the operating system to allocate a big continuous memory chunk to host the JVM’s heap. That memory chunk is partitioned by the JVM into regions.
Figure 6: from the 2012 JavaOne presentation by Charlie Hunt and Monica Beckwith: “G1 Garbage Collector Performance Tuning”
As Figure 6 shows, every object that the Java program allocates using the Java API first comes to the Eden space in the Young generation on the left. After a while, the Eden becomes full, and a Young generation GC is triggered. Objects that still are referenced (i.e., “alive”) are copied to Survivor space. When objects survive several GCs in the Young generation, they get promoted to the Old generation space.
When Young GC happens, the Java application’s threads are stopped in order to safely mark and copy live objects. These stops are the notorious “stop-the-world” GC pauses, which make the applications non-responding until the pauses are over.
Figure 7: from the 2012 JavaOne presentation by Charlie Hunt and Monica Beckwith: “G1 Garbage Collector Performance Tuning”
The Old generation also can become crowded. At acertain level—controlled by -XX:InitiatingHeapOccupancyPercent=? where the default is 45% of total heap—a mixed GC is triggered. It collects both Young gen and Old gen. The mixed GC pauses are controlled by how long the Young gen takes to clean-up when mixed GC happens.
So we can see in G1, the “stop the world” GC pauses are dominated by how fast G1 can mark and copy live objects out of Eden space. With this in mind, we will analyze how the HBase memory allocation pattern will help us tune G1 GC to get our 100 milliseconds desired pause.
In HBase, there are two in-memory structures that consume most of its heap: The BlockCache, caching HBase file blocks for read operations, and the Memstore caching the latest updates.
Figure 8: In HBase, two in-memory structures consume most of its heap
The default implementation of HBase’s BlockCache is the LruBlockCache, which simply uses a large byte array to host all the HBase blocks. When blocks are “evicted”, the reference to that block’s Java object is removed, allowing the GC to relocate the memory.
New objects forming the LruBlockCache and Memstore go to the Eden space of Young generation first. If they live long enough (i.e., if they are not evicted from LruBlockCache or flushed out of Memstore), then after several Young generations of GCs, they makes their way to the Old generation of the Java heap. When the Old generation’s free space is less than a given threshOld (InitiatingHeapOccupancyPercent to start with), mixed GC kicks in and clears out some dead objects in the Old generation, copies live objects from the Young gen, and recalculates the Young gen’s Eden and the Old gen’s HeapOccupancyPercent. Eventually, when HeapOccupancyPercent reaches a certain level, a FULL GC happens, which makes huge “stop the world” GC pauses to clean-up all dead objects inside the Old gen.
After studying the GC log produced by “-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy”, we noticed HeapOccupancyPercent never grew large enough to induce a full GC during HBase 100% read. The GC pauses we saw were dominated by Young gen “stop the world” pauses and the increasing reference processing over the time.
Upon completing that analysis, we made three groups of changes in the default G1 GC setting:
(1) Use -XX:+ParallelRefProcEnabled
When this flag is turned on, GC uses multiple threads to process the increasing references during Young and mixed GC. With this flag for HBase, the GC remarking time is reduced by 75%, and overall GC pause time is reduced by 30%.
(2) Set -XX:-ResizePLAB and -XX:ParallelGCThreads=8+(logical processors-8)(5/8)
Promotion Local Allocation Buffers (PLABs) are used during Young collection. Multiple threads are used. Each thread may need to allocate space for objects being copied either in Survivor or Old space. PLABs are required to avoid competition of threads for shared data structures that manage free memory. Each GC thread has one PLAB for Survival space and one for Old space. We would like to stop resizing PLABs to avoid the large communication cost among GC threads, as well as variations during each GC.
We would like to fix the number of GC threads to be the size calculated by 8+(logical processors-8)(5/8). This formula was recently recommended by Oracle.
With both settings, we are able to see smoother GC pauses during the run.
(3) Change -XX:G1NewSizePercent default from 5 to 1 for 100GB heap
Based on the output from -XX:+PrintGCDetails and -XX:+PrintAdaptiveSizePolicy, we noticed the reason for G1’s failure to meet our desired 100GC pause time was the time it took to process Eden. In other words, G1 took an average 369 million seconds to empty 5GB of Eden during our tests. We then changes the Eden size using -XX:G1NewSizePercent=<positive integer> flag from 5 down to 1. With this change, we saw GC pause time reduced to 100 milliseconds.
From this experiment, we found out G1’s speed to clean Eden is about 1GB per 100 milliseconds, or 10GB per second for the HBase setup that we used.
Based on that speed, we can set -XX:G1NewSizePercent=<positive integer> so the Eden size can be kept around 1GB. For example:
Here is GC pause time chart for running 100% read operation for 1 hour:
Figure 9: The highest initial settling spikes were reduced by more than half
In this chart, even the highest initial settling spikes were reduced from 3.792 seconds to 1.684 seconds. The most initial spikes were less than 1 second. After the settlement, GC was able to keep pause time around 100 milliseconds.
The chart below compares jdk7u60 runs with and without tuning, during steady state:
Figure 10: jdk7u60 runs with and without tuning, during steady state
The simple GC tuning we described above gives ideal GC pause times, around 100 milliseconds, with average 106 milliseconds and 7 milliseconds standard deviation.
HBase is a response-time-critical application that requires GC pause time to be predictable and manageable. With Oracle jdk7u60, based on the GC information reported by
“-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy”, we are able to tune the GC pause time down to our desired 100 milliseconds.
2012 JavaOne presentation: “G1 Garbage Collector Performance Tuning” by Charlie Hunt and Monica Beckwith
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804