Hadoop and HBase Optimization for Read Intensive Search Applications

by Abinasha Karana, Director
Bizosys Technologies Pvt Ltd, Bangalore

Abstract

Traditionally, web scale is achieved by master-slave replication and data sharing, which can be a huge challenge as data scales beyond 500Gb. Map-Reduce based technologies such as HBase* and Hadoop* achieve web scale and address this challenge by transparent partitioning, distribution and replication. Also, most of the currently available recommended HBase and Hadoop server configurations are targeted towards write intensive applications.

Bizosys Technologies* has built a sSearch engine whose index is on Hadoop and HBase to deploy in a cluster environment. Search applications by nature involve read intensive operations. Bizosys experimented with its search engine that involved use of latest hardware options, software configuration and cluster deployment provisioning. These tests were conducted on Intel hardware at the Intel® Innovation Labs in June 2010.

Understanding the Application

Bizosys search engine provides a single window to content and applications in an enterprise. Not only does it find relevant information, it presents in visually rich views with integration to relevant application as actions embedded inside the results. This dynamic portal is built on-the-fly from the user's search.



System Configuration

Bizosys search engine was tested in a cluster and standalone environments. The cluster consisted of homogenous machines. Refer to the deployment architecture and machine specifications below. The Intel® Xeon® processor 5600 series was used for deployment.



Test Case Scenario & Analysis of Results

Three broad test scenarios were explored, viz.
  1. I/O optimization using SSD, caching and compression
  2. Read optimized server settings
  3. Optimal cluster deployment provisioning
TEST SCENARIO - I/O optimization using SSD, caching and compression

In a read intensive operation, CPU I/O WAIT is the biggest bottleneck. We resolved this by using:
  • SSD for a higher throughput (Read only),
  • SSD for a higher throughput (Simultaneous Read Write), and
  • Compression and in-memory caching
SSD for a higher throughput (Read only)

To find out the impact of disks we experimented using:

  1. Using 1 SATA disk for keeping the Index
  2. Using 1 SSD disk mounted with "noatime" option
  3. Using 3 SSD disks mounted with "noatime" option.
    dfs.data.dir property of hdfs-site.xml file was pointed to "data/1/dfs", "data/2/dfs" and "data/3/dfs" folder. 1, 2 and 3 are directories where the SSDs were mounted. Similarly in file mapred-site.xml, mapred.local.dir property was pointed to "data/1/mapred", "data/2/mapred" and "data/3/mapred".)
In a test environment consisting of:
  1. 1 Million Wikipedia® English documents
  2. Tuned Hadoop, HBase and Bizosys Search with no cache implementation.
  3. The "dfs.datanode.max.xcievers" property of hdfs-site.xml configuration file was at 4096. The allowed open file limit for the process was set at 16384.
  4. Both server and client sitting in the same machine
  5. A variable workload of 1, 5, 10, 25, 50 and 100 concurrent clients each consisting of different query string. (Query sampling consists of 28 single word, 42 double word, 19 three word, 3 four word, 4 six word, 1 seven word and 2 8 words.)
  6. Out of 100 queries 15 had no matching results.
  7. The result page size is 100.
  8. Readings are taken in a stabilized environment averaged of 9 iterations.


Analysis

SSD performed better than SATA. The performance gain was significant with increased number of concurrent users reducing the CPU I/O WAIT timings. One disk vs. multi disks gave an overall better response time for the queries.

Currently Bizosys search engine keeps data in two buckets:
  • Index Section: Finds relevant documents based on query term.
  • Preview Section: Shows previews in the result page of matching documents.
The Index section data access is much higher. A provision for SSD disk mounting for this section would be a better design consideration from I/O usage view point.

The current SSDs come with capacities of 60GB, 120GB and so on. The cost of SSDs is higher than the regular SATA disks for same storage capacity. Mixing SATA and SSD together will be a better value proposition from a cost perspective.

SSD for a higher throughput (Simultaneous Read Write)

Bizosys Search is designed to add, update and delete information in a real-time environment. This empowers Bizosys Search to serve queries while the index is undergoing changes (Additions and 6 Modifications). In the software stack Hadoop, HBase and Bizosys search engine are designed to handle simultaneous Read-Write.

This test benchmarks compares
  1. Querying Index 2.
  2. Querying Index while undergoing modifications
In a test environment consisting of:
  1. 1 Million Wikipedia English documents
  2. Tuned Hadoop, HBase and Bizosys search engine with no cache implementation.
  3. In each one second interval, 10 Wikipedia documents are modified for the test duration.
  4. Index was distributed in 3 slave machines coordinated by a master node. In one slave machine, the updating client was running while another slave machine was querying.
  5. A variable workload of 50 and 100 concurrent clients each consisting of different query terms.
  6. Maximum 100 documents retrieved per result set.
  7. Performance readings were taken in a stabilized environment averaging eight iterations.


Analysis

Both the read and Simultaneous read write response timings are almost same. Bizosys Search Index can undergo modifications while serving the queries with no impact to the query response time. However, we saw a slightly improved response for Simultaneous Read-Write case. This is possibly due to Write operation having loaded index to the memory. This could be critical in an enterprise context, where the permissions and content undergo changes very frequently.

Compression and In-memory Caching

Hadoop and HBase provides two ways to decongest I/O

Compressing the Index: Compression minimizes the data resident size. However each time loading data blocks requires decompressing which is a CPU intensive task. This can slowdown the response time in situations where content undergoes frequent changes. This is an additional overhead where the index size and concurrent clients are not sufficient to create a CPU I/O WAIT.

HBase and Hadoop provide zlib, lzo compression algorithms. In addition to these, we compiled Hadoop native code linking Intel® Integrated Performance Primitives (Intel® IPP) ZLib libraries to leverage multi-processors during decompression.

Bizosys Search "Preview data blocks" are compressed by default. These Preview data blocks take up considerable storage capacity, while very few of these are read per search.

Caching: Clears disk bottleneck by offloading I/O WAIT to RAM memory. The complexity arises due to limited amount of RAM relative to index size. HBase manages the memory using Eviction algorithm. However, a constant swapping to work on the constrained RAM will slow down response time. Sufficient RAM should be provided to avoid swapping.

We tested Bizosys search engine query response time on a compressed and cached index. The test environment consisted of:
  1. 180 Million short records of 50 characters created by dividing Wikipedia English pages.
  2. The preview and index section was compressed with zlib compression algorithm.
  3. Tuned Hadoop, HBase and Bizosys search engine.
  4. Index was distributed in 3 slave machines coordinated by a master node.
  5. Maximum 100 documents retrieved per result set.


Analysis
  1. During the initial initialization (The first run), the cache gets built by loading the index from disk. However, once the Index is loaded, second time onwards, I/O WAIT is eliminated. Caching improved the performance by around 50%.
  2. During this test, the average CPU idle was 98% and the I/O WAIT was only 2.
Decompression increases the CPU usage. However, last test indicated a 98% idle CPU; which implies there was sequential processing. These sequential processing zones consist of HBase Region scanner data block parsing.

One Million Wikipedia pages were broken to 200M short records showed a poorer query response in comparison to 7 Million Wikipedia pages (full length) in the index.
In all our testing we have employed the block size as 32MB (dfs.block.size property in hdfs-site.xml file). However, with compression, more records got packed on a data block. Parsing decompressed bigger data block files sequentially, resulted in a poorer query response time and a higher CPU idle time.

TEST SCENARIO - Read optimized server settings

Hadoop, HBase and Bizosys Search are highly configurable systems.
The primary areas of tuning are:
  • Configuration Settings
  • JVM Settings
  • Custom Indexing on HBase
Hadoop, HBase, Zookeeper and Bizosys Search and Test client instances ran in a single machine and indexed 1 Million Wikipedia documents. Readings were taken on various server settings on this environment.

Configuration Settings

We focused tuning the following areas to remove software bottlenecks viz.:
  1. Insufficient handlers choke parallel processing and queues the requests to execute in non-parallel mode. In this situation the hardware resources are not used optimally.
  2. Improper buffering makes more time consuming trips to I/O resources such as network and file system. A higher value setting keeps the CPU in a wait state.
  3. Parsing a data block is a sequential activity. Making smaller size blocks increases parallel processing. However, the number of open file handlers also increases. The nature of documents, the cluster size, compression algorithm settings are few of the variables which influence the parsing response time.
Some of the configurations carried out are:

Configuration File Property Description Value
hdfs-site.xml dfs.block.size Lower value offers parallelism. 33554432
dfs.datanode.handler.count Number of handlers dedicated to serve data block requests in hadoop DataNodes. 100
core-site.xml io.file.buffer.size This is the read and write buffer size. By setting limit to 16KB it allows continuous streaming. 16384
hbase-site.xml hbase.regionserver.handler.count RPC Server instances spun up on HBase RegionServers 100
hfile.min.blocksize.size Small size increases the index but reduces the lesser fetch on a random access. 65536
default.xml SCAN_IPC_CACHE_LIMIT Number of rows cached in Bizosys search engine for each scanner next call over the wire. It reduces the network round trip by 300 times caching 300 rows in each trip. 300
LOCAL_JOB_HANDLER_COUNT Number of parallel queries executed at one go. Query requests above than this limit gets queued up. 100

JVM Settings

Hadoop JVM instance was set to run with "-server" option with a HEAP size of 4GB. HBase nodes ran with 4GB heap with JVM settings

"-server -XX:+UseParallelGC -XX:ParallelGCThreads=4 -XX:+AggressiveHeap -XX:+HeapDumpOnOutOfMemoryError"

The parallel GC leverages multiple CPUs.

Custom Indexing on HBase

Bizosys Search has an HBase extension indexing module. This Index module is the first step for all HBase requests. It is a binary gate which deciphers the availability of the term inside the region. If it's not available, whole processing at HBase Region server is avoided.

This index is designed in an optimal way to consume minimal memory foot print while deciphering query term availability in 1-2 milliseconds.



Analysis
  1. IPC optimization gave 33% better performance.
  2. JVM GC tuning improved it by 16% over and above the IPC optimization
  3. Custom Indexing on HBase further increased performance by 62.5% over and above JVM GC.
  4. Finally I/O bottlenecks removed using SSDs and performance boosted up by 66% over and above the custom indexing results.
The block cache setting was only at 20% (around 800Mb) of allocated HEAP. For this specific setting lot of cache eviction was noticed. To avoid this, it is recommended that more memory for caching.

With these recommended configuration settings, 100 concurrent users experienced an average response time of 46 sec while in the tuned environment it came down to 5sec. It can be further improved by increasing In-Memory caching and distributed client provisioning.

TEST SCENARIO - Optimal Cluster Deployment Provisioning

Hadoop and HBase designed to scale in a cluster environment. HBase is currently in use at several places including Bing, Yahoo's Cluster with 10000 machines (See: http://www.slideshare.net/teste999/distributed-databases-overview-2722599)

Our cluster had 1 master machine and 3 slave machines. Master machine ran Hadoop name node, Zookeeper and the HBase master. Each slave machines ran Hadoop Data nodes and HBase Region Server.

We primarily tested three scenarios:
  1. HBase, Hadoop and Bizosys Search all in standalone mode
  2. HBase, Hadoop in distributed mode and Bizosys Search running in a standalone mode.
  3. HBase, Hadoop and Bizosys Search all deployed in a distributed mode
Bizosys Search result quality is configurable. It takes extra payload to calculate user specific relevance ranking (dynamic) besides the regular weight based static ranking (E.g.. documents from same role as the searcher, company departmental proximity, location proximity). This finer refinement depends on knowing the user and presenting relevant information. However, it also involves accessing more meta information.

We tested for both of these scenarios
  1. For All users
  2. User-Specific ( Personalization computation on top 1000 Documents)
In a test environment consisting of:
  1. 4 Machines having similar configuration./li>
  2. Each machine had an SSD. One machine had a combination of 1 SSD and 1 SATA. Master had only a single SATA disk./li>
  3. 1 Million documents were indexed.
Legend:
1M = 1 Standalone Machine, 4M = 1 Master and 3 Slaves
1C = 1 Bizosys Search process serving concurrent users in threads, 3C = The Load is distributed to 3 Bizosys Search processes running in 3 slave machines, 100U = 100 concurrent user, 50U = 50 Concurrent user
R1 = Not initialized, First run, RX = Average of RUN 2 through 9



Analysis

In cluster environment average I/O WAIT is lower, however I/O WAIT spikes were noticed.

Bizosys search engine deployment in a distributed environment demonstrated better performance than on a single machine.

Understanding Westmere Processor Architecture

Intel® Xeon® processor 5600 series automatically regulates power consumption and intelligently adjusts server performance according to your needs. Enabled by Intel® Intelligent Power Technology, these processors shift into the lowest available power state, while Intel® Turbo Boost Technology intelligently adjusts performance according to your needs.

Designed to enable increased secure server deployments, this series now includes Advanced Encryption Standard New Instructions (AES-NI), providing hardware-based acceleration for secure transaction servers and with hardware-based Intel® Trusted Execution Technology (Intel® TXT), Intel® Xeon® processor 5600 series helps guard against malicious software attacks.

Intel® Xeon® processor 5600 feature overview and benefits

Dynamic scalability, managed cores, threads, cache, interfaces, and power for energy-efficient performance on-demand.
  • Design and performance scalability for servers, workstations, notebooks, and desktops with support for 2-8+ cores and up to 16+ threads with Intel® Hyper-Threading Technology (Intel® HT Technology), and scalable cache sizes, system interconnects, and integrated memory controllers.
  • Intelligent performance on-demand with Intel® Turbo Boost Technology taking advantage of the processor's power and thermal headroom. This enables increased performance of both multi-threaded and single-threaded workloads.
  • Increased performance on highly-threaded applications with Intel® HT Technology, bringing high-performance applications into mainstream computing with 1-16+ threads optimized for new generation multi-core processor architecture.
  • Scalable shared memory features memory distributed to each processor with integrated memory controllers and Intel QuickPath Technology high-speed point-to-point interconnects to unleash the performance of future versions of next-generation Intel® multi-core processors.
  • Multi-level shared cache improves performance and efficiency by reducing latency to frequently used data
  • Flexible virtualization Combine servers from multiple generations into the same virtualized server pool to extend failover, load balancing, and disaster recovery capability.
  • Intel® Virtualization Technology (Intel® VT)
    • Enables more operating systems and software to run in today's virtual environments.
    • Developed with virtualization software providers to enable greater functionality and compatibility compared to non-hardware-assisted virtual environments.
    • Get the performance and headroom to improve the average virtualization performance over previous generations of two-processor servers.
  • Intel® Trusted Execution Technology Provides enhanced virtualization security through hardware-based resistance to malicious software attacks at launch
For more information on the Intel® Xeon® processor 5600 series, please visit http://www.intel.com/Assets/en_US/PDF/prodbrief/323501.pdf

Intel® X25-E Extreme SATA Solid-State Drive

The Intel® Extreme SATA Solid-State Drive (SSD) offers outstanding performance and reliability, delivering the highest IOPS per watt for servers, storage and high-end workstations. Enterprise applications place a premium on performance, reliability, power consumption and space. Unlike traditional hard disk drives, Intel Solid-State Drives have no moving parts, resulting in a quiet, cool storage solution that also offers significantly higher performance than traditional server drives. Imagine replacing up to 50 high-RPM hard disk drives with one Intel® X25-E Extreme SATA Solid-State Drive in your servers - handling the same server workload in less space, with no cooling requirements and lower power consumption. That space and power savings, for the same server workload, will translate to a tangible reduction in your TCO.

For product brief and more information on the product, please visit: http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-product-brief.pdf

Conclusion

Achieving optimal results from a Hadoop and HBase based Bizosys Search engine begins with choosing appropriate combination of disk throughput, in-memory caching, cluster deployment and multi CPU box.

We found SSDs to be very effective for both read and write operation. In-memory caching resulted in better response through setting right amount of "HEAP CACHE" to achieve higher cache hit percentage. Cluster environment served the requests faster where as "CPU I/O WAIT" spikes were noticed. Overall most of the CPUs remain idle during the test.

References

Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.