Enhancing the Scalability of Memcached

Enhancing the Scalability of Memcached (PDF 1.26MB)

1 Introduction - Memcached and Web Services

Memcached is a Key-Value cache used by cloud and web service delivery companies, such as Facebook [1], Twitter [2], Reddit [3], and YouTube [4], to reduce latency in serving web data to consumers and to ease the demand on database and computational servers [5]. In addition to lowering latency, memcached’s scale-out architecture supports increased throughput by simply adding memcached servers. Scaling up, however, is less rewarding as more than 4 cores causes performance degradation [1] [6].

In this paper, we introduce memcached optimizations to utilize concurrent data structures, an updated cache replacement strategy, and network optimizations to increase throughput over 6X versus the current open-source version (v1.6) on Intel® Xeon® E5 server processors while maintaining latency Service-Level Agreements (SLAs) of 1ms or less for Round-Trip Time (RTT). Figure 1 illustrates the 6X speedup of the optimized version over the v1.6 memcached baseline, with the optimized version on 32 logical cores providing 3.15 Million requests per second (RPS), with a median round-trip time(RTT) less than one millisecond.


Figure 1 – Best throughput measured using any configuration from the base open-source code version 1.6.0_beta1 and the optimized version

Memcached operates on simple data types of key-value pairs, similar to NoSQL databases, but is not persistent like NoSQL. Memcached stores all the key-value pairs in non-persistent memory, so in the event of failure, all the stored data is lost. Throughout this paper we use the term cache item to specify the combination of a key and its associated value. Keys are, by definition, unique values. In a web-serving architecture, the memcached application sits between the Front-End Web Servers, or Web Tier, and the Back-End Database as shown in figure 2.


Figure 2 – memcached server sits between the Front-end Web Tier and Back-end Database

Memcached’s purpose is to intercept data requests and satisfy them out of its logical cache (i.e. system memory) when possible, avoiding trips to disk storage attached to the back-end database. There is also the potential to avoid compute-intensive tasks by retrieving a pre-computed value from the logical cache. In both cases, the time spent retrieving or computing data results is reduced (i.e. transaction latency). A cluster of servers participates in a memcached logical cache, with each server offering its system memory as a portion of the complete logical cache [7] [8] [9]. For example, a memcached cluster of 10 nodes, with each node having a 128 GB memory footprint, provides a 1.28 TB logical cache for servicing data requests. The logical cache is populated from data in the back-end database, the persistent data store for web-serving, or from computational servers. Cache items are maintained using Least Recently Used (LRU) policy as well as Time To Live (TTL). When an item is evicted its slot is filled by a more recent item.

A web service request can require multiple trips to memcached, the database, and other services. If the caching strategy is efficient, the number of trips to memcached will be an order of magnitude greater than the required trips to the database (i.e. 10:1). Data retrieval from system memory, as provided by memcached, is an order of magnitude faster than retrieving the same data from the database (microseconds versus milliseconds) and, in most cases, orders of magnitude faster than computing the data on-the-fly. Therefore, avoiding database accesses and computation is necessary to provide acceptable user response times for web service requests.

1.1 Terminology

Throughout the paper, we will reference several items with abbreviations and short-hand.

LRU (Least Recently Used) -- This is the eviction scheme used in memcached to determine which items are removed from the cache to free space for additional cache items.

Base (Baseline) – This is the unmodified 1.6.beta_0 version of Memcached, which can be downloaded directly from memcached.org.

Bags – This is short-hand for the modified version of memcached optimized to increase performance. Bags is the modified LRU cache update strategy, called a “bag LRU”. This was the first section of code developed.

RTT (Round Trip Time) – This metric is the elapsed time of a Memcached request, including the elapsed time from request from client to server until response back to client.

SLA (Service Level Agreement) – This metric is the maximum RTT allowed to maintain acceptable web service request response time to the User. Response time above the SLA will result in an unacceptable response time for generating the web service request. There can be multiple memcached requests (to memcached logical cache) per web service request (to Front-end Web Tier). For the purpose of this paper, the memcached SLA is 1 millisecond response time to the requesting Front-end Web Tier client.

NIC (Network Interface Card) -- This is the network card used in the system.

2 Current Memcached Architecture

When memcached was introduced in 2003 [5], there were few X86 processors with multiple cores, and multi-socket servers were extremely expensive. Intel® Xeon® dual-core and quad-core processors weren’t released until 2005 and 2009 respectively. Today, dual-socket, 8 and 16-core server systems are the building block of web service delivery, with roadmaps for processors showing increasing core counts [10]. These advanced systems provide ample opportunity for workloads to execute in parallel using multiple threads or processes, yet memcached is unable to do so using the current data structures and software architecture [1] [6].

2.1 Data Structures

Memcached has 4 main data structures, including

  • A hash table to locate a cache item.
  • Least Recently Used (LRU) list to determine cache item eviction order when the cache is full.
  • A cache item data structure for holding the key, data, flags, and pointers.
  • A slab allocator, which is the memory manager for cache item data structures.

The hash table data structure is an array of buckets. The array size (k) is always a power of 2 to make finding the correct bucket quicker by taking the value of 2^k-1 and using it as a hash mask. Executing a bit-wise AND (e.g. hash_value & hash_mask) quickly determines the bucket that contains the hash value. The buckets are constructed as single linked-lists of cache items that are NULL-terminated. Figure 3 illustrates this data structure.


Figure 3 – Data Structure of hash table used to lookup cache items

The LRU, used to determine eviction order, is a double linked-list which holds all cache items in a single-sized slab (i.e. unit of memory allocation, described below). The double linked-list pointers are maintained in each cache item structure, similar to the hash table, and are modified each time the cache item is manipulated. In the event of an eviction, the tail of this list is checked for the oldest cache item, which is removed for reuse. Figure 4 Illustrates the LRU data structure.


Figure 4 – Data Structure of the current open-source LRU used to determine cache item eviction order

The cache item data structure holds the key-value pair data. It is an intersection of the following reference data:

  • pointers used in the hash table for the single linked-list
  • pointers used in the LRU for the double linked-list
  • reference counter to determine how many threads are currently accessing a cache item
  • flags for cache item status
  • the key
  • length of the value in bytes
  • value

The slab allocator provides memory management for cache items. Since the cache items are relatively small, allocating and freeing of these small chunks of memory using system calls would be slow and likely cause thrashing. Instead, memcached uses slabs as the unit of memory allocation. A slab is a larger piece of memory that can hold many cache items within it. For example, a slab could be a 1024 byte chunk of memory which will hold 16 cache items that are 64 bytes or less. The slab allocator uses these larger memory allocations, and then maintains a list of free cache items in the slabs. Each time a cache item is requested, the slab allocator checks the size of the value being stored and returns a cache item from a slab large enough to accommodate. In some cases this solution wastes space, but the time saved and memory thrashing avoided offsets this disadvantage.

2.2 Commands

  1. GET to retrieve a cache item from the cache
  2. STORE to add a cache item to the cache
  3. DELETE to remove a cache item from the cache

Additionally, there are 15 other specialty requests including stats, replacements, arithmetic, flush, and updates. These code paths generally follow some order of the above primary commands. For instance, a replacement first DELETEs the cache item and then STOREs a new cache item into the hash table and LRU. In analysis, we focus effort on GETs, as this operation constitutes the bulk of the command requests in memcached operation. [11]

For the three primary commands (GET, STORE, and DELETE), the flow of memcached is as follows:

  1. Request arrives at NIC and is processed by libevent
  2. A worker thread
    • Takes connection and data packet
    • Determines the command and data
  3. A hash value is created from the key data.
  4. The cache lock is acquired to begin hash table and LRU processing. (Critical Section)
    • (STORE only)
      • Cache item memory is allocated.
      • Data is loaded into the cache item data structure (flags, key, and value).
    • Hash table is traversed to GET, STORE, or DELETE the cache item from the hash table.
    • LRU is modified to place move cache item in the front of the LRU (GET, STORE) or remove it (DELETE).
    • Cache item flags are updated
  5. Cache lock is released. (End of Critical Section)
  6. Response is constructed.
  7. (GET only) Cache lock is asserted.
    • Cache item reference counter is decremented.
    • Cache lock is released.
  8. Response is transmitted back to requesting client (i.e. Front-end Web server).

2.3 Process Flow

Figure 5 documents the process flow of memcached when processing data requests from the perspective of a memcached server in the cluster. Multiple servers work in conjunction to act as single larger logical data cache. In essence, these servers constitute a large, distributed hash table. Reviewing figure 5 from top to bottom, the clients (these are usually in the form of web tier servers building a response for a user or computational servers needing data for calculations) are responsible for submitting operation requests to memcached. When memcached is initiated, a static number of worker threads are created to process these client requests, normally aligned with the number of physical processors in the server system.


Figure 5 – Process flow for cache item operations (STORE/GET/DELETE) in open-source v1.6

Client requests are distributed to worker threads for processing. For a GET request, each thread must complete a hash table lookup to identify where the value data resides associated with the key. Additionally, to indicate that the key-value is recently accessed and update the eviction order, it is moved to the front of the Least Recently Used (LRU) list. Unfortunately, both of these activities require protection (i.e. locks) for these shared data structures. After the data is located, it is retrieved from system memory by the worker thread and transmitted back to the requesting client.

Hash table thread safety is provided by a single global cache lock that is asserted for hash table access. Additionally, the LRU linked-list, maintained to manage cache evictions, uses the same global cache lock when cache items’ LRU linked-list is manipulated to ensure thread safety.

Threads are initiated by libevent, which is a portable network protocol wrapper. Threads wait in libevent for a request, and upon receiving one, load the data from the packet and decode the command. Assuming a request for a cache item (GET/STORE/DELETE/ARITHMETIC), the thread computes a hash value from the key and acquires the global cache lock. The thread will hold the lock until the command is completed, then release the lock and create a response. Doing this, memcached is effectively serialized to ensure data consistency and thread safety.

2.3.1 Hash Table Lookup

After acquiring the lock, the hash table traversal and/or manipulation is executed. Serialized hash table entry and exit eliminates concern for thrashing pointers and corrupting the linked-list bucket during a STORE or DELETE. If we leave the hash table implementation as-is and remove the locks, there is a scenario where two cache items “close” to each other (i.e. adjacent in the linked-list) are being removed and, upon completion of the removals, the linked-list no longer points to cache items in the correct chain. This is illustrated in Figure 6, with each row representing a time progression.


Figure 6 - Linked-list corruption from two threads removing cache items concurrently.

In the second row, one thread is removing the yellow cache item (#3) and one is removing the orange cache item (#4). They both modify the pointer of the previous cache item. In row 3, they both NULL out the next pointer. Finally, in the fourth row, the cache items are removed and the remaining hash chain is left with a gap in the middle of the chain, causing item #5 to be lost from the hash chain.

The lock is also required for GET requests so they are not traversing a bucket while the linked-list is changing. Traversing a linked-list during update could cause the thread to follow a pointer into non-existent memory, causing a segmentation fault or a miss of cache item that does exist. Refer to Figure 3 for the layout of the hash table data structure.

2.3.2 LRU Manipulation

Upon completion of the hash table operation, the cache item is either added (STORE) or removed (DELETE) from the double linked-list structure (Figure 4) used for the LRU eviction scheme. When processing a GET request, the cache item is removed from its current spot in the LRU and added to the head of the list. The cache lock is held for all LRU manipulation as two cache items being removed from the same location, or inserted at the head simultaneously, can cause list corruption. This is the same corruption described above with the hash table linked-list, except with an additional pointer to corrupt.

2.3.3 Other Serial operations

The Hash table lookup and LRU Manipulation operations are the primary code segments protected by the global cache lock, but there are others. Cache evictions, allocations, and flushes also require holding the global cache lock. These operations access the either the LRU or hash table, so simultaneous access by multiple threads can cause corruption. Additionally, cache item flags that are modified by these operations are protected by the global cache lock to ensure thread safety. An example of flag use is the ITEM_LINKED flag, which needs to be cleared when removing a cache item from the hash table. The global cache lock ensures only one worker thread can manipulate this flag at a time.

2.3.4 Parallel operations

The bulk of the process flow for memcached operation requires protection by the global cache lock. Outside of this functionality, there are some operations that are completed in parallel, including: network transmission (both receive and transmit), packet decode, key hashing, and building of the data response.

2.4 Performance Bottleneck

The global cache lock is a performance bottleneck for more than four threads, as discussed in [1] and [6]. This behavior was also confirmed in testing (Figure 7 and Table 1). Figure 7 illustrates that application throughput degrades for open-source memcached past 4 cores. The lock is coarse-grained; that is, time spent in serial execution is substantial, and contention for the lock among worker threads is high.

To verify this, Intel® Vtune™ Amplifier XE was used to produce stack traces on this code and indicated that a majority of the execution time spent in a lock (i.e. waiting for the global cache lock). Looking deeper into the stack trace, we determined the global cache lock was causing contention. To quickly confirm the problem, we removed the global cache lock (which we knew would be unsafe, but effective) and got a substantial throughput increase on a read-only workload.


1 Vtune is a trademark of Intel Corporation in the U.S. and other countries.
2 http://www.opencompute.org
3 Other brands and names are the property of their respective owners.

Abundant research illustrates shared lock contention can cripple a program’s parallel performance. Reference [12] provides an in-depth analysis of contention and how it affects a program and degrades performance. We concluded that avoiding the contention for the shared global cache lock would improve parallel performance and application scalability.


Table 1 - Captured "perf top" from a server running memcached with 8 threads


Figure 7 - Scaling performance of Base (version 1.6) memcached showing performance peak around 3-4 cores

3 Memcached Optimizations

This section outlines the optimizations to memcached to improve application scalability for multi-threaded execution on many-core servers. After examining the current memcached architecture, we investigated techniques to minimize or eliminate the global cache lock, the primary performance bottleneck.

Performance goals include:

  • Equivalent performance regardless of number of keys (i.e. 1 key versus 1 million key lookups)
  • Updated cache replacement strategy for LRU Eviction should provide equivalent (or better) hit rates. The expectation is a 90%+ hit rate for cached value lookups in the synthetic workload we are using.
  • LRU correctness for cache management; that is, the correct value is evicted based on its LRU status.

To measure progress versus these goals, we have baseline measurements using a synthetic workload to replay a set of transactions and measure the hit rate.

3.1 Data Structures

For our modifications two data structures are modified: the hash table and the LRU. The following optimizations were implemented:

  • Hash table locking mechanism changed to allow for parallel access
  • Bag LRU – The data structure is changed to an array of different sized LRU’s with Single linked-list bags (i.e. collections) of cache items.

While we are changing the algorithms in the hash table, we make no physical changes to the hash table data structures. We implement striped locks which are required to modify the hash table buckets. A striped lock is a fine-grain set of locks (versus the global hash table lock) implemented on sections of the hash table. For this implementation, we first create and initialize a set of Z locks, where Z is a power of 2. Keeping this a power of 2, a bitmask is used to calculate the lock number quickly. To traverse the linked-list, the lock guarding that bucket is acquired the performing a bit-wise AND of the value Z-1 with the bucket to-be-locked. This protocol provides a shared lock for every Zth bucket (Figure 8), reducing contention for hash table locks. In testing, Z=32 was a suitable value for the number of locks.

The LRU is where the bulk of the data structure changes are required. To build a more parallel LRU to take advantage of the parallel hash table, we researched ideas for parallel LRU data structures. There are two good ideas for lock-mitigating LRU algorithms and data structures described in [13] and [14].


Figure 8 - Addition of the striped locks that guard each bucket

Taking inspiration from both of these, and the concept of a Bag LRU, we implement single linked-list “bags” of cache items. Each “bag” contains cache items with similar insertion or last-touched times, and can be inserted atomically. There is a cache item “clean-up” functionality required with this model; a cleaner thread exists for this purpose. It evicts expired/invalid cache items and performs maintenance on the bags. A more detailed explanation of the new data structure including diagrams and logic is in section 3.5.

3.2 Commands

There are no changes to any of the memcached commands or protocols described in section 2.2. The difference is solely in the methodology for command execution.

3.3 Process Flow

Figure 9 relates the updated process flow with modifications. From top to bottom, the client, libevent, and worker thread functionality remains the same. The serialized operations for hash table lookup and LRU processing are modified to operate in a parallel fashion.

DELETE and STORE operations now use a parallel hash table approach with striped locks (explained earlier) (Figure 8), and GET operations now execute non-blocking and parallel.


Figure 9 - Process flow for cache item operations (STORE/GET/DELETE) in optimized version of 1.6.0_beta1

The algorithmic changes, described below, prevent hash chain corruption and maintain pointers of cache items being removed. In the event a thread is pre-empted on a removed cache item, it will likely continue in the correct bucket. There is a slight nuance to the unblocked GET. If a thread is preempted for a long duration on a cache item, and that cache item is moved, the GET can return a “false negative” (i.e. memcached will return “no cache item found” when the cache item is actually in the hash table).

The pros and cons of this were weighed, and, since memcached is a cache for data that should be persistent elsewhere (e.g. Database Tier); the extra throughput realized from removing the locks outweighs the possibility of a false negative. If there is a case where the false negative presents a problem, you could use the striped locks, which minimally decrease performance. However, the expectation is, as core counts increase, that striped locks could again become a point of contention.

LRU processing is now non-blocking and parallel, as described above in section 3.2, and below in section 3.5. In bags, a single linked-list of cache items is stored so they can be inserted atomically, removing the need for a lock while placing a new cache item into a bag. There is still a Compare-and- Swap (CAS) command required, which could be a point of contention if only STOREs are performed. However, optimizing for the majority of requests (i.e. GETs), the Bag LRU is implemented with no locks or CAS required for GETs.

3.4 Parallel Hash Table Lookup

The first performance impediment encountered when executing a memcached transaction is the serialized hash table lookup. Previous work on increasing parallel performance of memcached has focused on a partitioned hash table in which threads access only specific portions (i.e. partitions) of the cache. This works well for removing lock contention and increasing performance as seen in [16]. The disadvantage of this approach is that multiple frequently accessed cache items in a single partition can overwhelm the threads available to service that partition. Similarly, traffic patterns that favor a partition can be starved for processing power. Overall performance of a partition is limited to the threads that can access that cache without causing lock contention.

This implementation optimizes throughput for the majority of traffic, GET operations. Reference [11] shows that traffic on the majority of Facebook’s cache servers are GET requests. This balance is as high as 99.8% GETs in some cases, with only a few caching tiers being write-dominated. This is consistent with the performance testing done by [15] and [16] as both testing methodologies only measure throughput for GET requests.

With this in mind, a parallel hash table is designed that doesn’t require any locks when processing GET requests. Lock contention, no matter how small, impacts performance (transaction latency, overall throughput under Latency SLA), as shown in [12]. In the future, as CPUs execute faster and core-counts increase, lock contention will increase. The new designs accounts for these trends by completely eliminating lock contention for GETs.

3.4.1 GET Lock Removal

In order to remove locks on GETs, there are two main cases to address: expansions and hash chain modifications (i.e. insertions and removals).

An expansion is required when the hash table has a much larger amount of cache items then buckets. In this case, adding more cache items will increase the length of the hash chain, increasing the time required to find a cache item and RTT. Instead of suffering this time penalty, a new hash table with twice the amount of buckets is allocated, as follows:

  • all cache items are taken from the old table
  • the key is hashed /li>
  • the cache item is inserted into the new table/li>

Assoc_get logic:

1. mask hash to get the bucket

2. If (!expanding)

3. Check hash chain bucket – return cache item if found

4. If (cache item not found)

5. If(expanding && expansion bucket is greater than the bucket)

6. Find bucket in new larger table

7. Wait for old bucket to be empty

8. Check new hash bucket – return cache item if found

9. else if expanding

10. Ensure bucket is in old table

11. Check hash chain bucket – return cache item if found

12. If (cache item not found)

13. If(expansion bucket is greater than the bucket

14. Find bucket in new larger table

15. Wait for old bucket to be empty

16. Check new hash bucket – return cache item if found

To accomplish cache item retrieval, a check for an expansion is required. If there is no expansion in progress, the GET request executes normally and returns the cache item, if found. However, there is the possibility the expansion started after the initial check and moved the cache item. In this case, the process waits for the old bucket (the bucket where the item would be located in the previous hash table) to be empty and then checks the new hash table and bucket for the cache item.

If the hash table is currently being expanded, there is a check to see if the old bucket has been moved already. If the old bucket has not been moved, it is checked for the cache item. Like before, if the cache item isn’t found, a check is required to see if a “bucket move” occurred on the old bucket. If so, the new hash table will then be checked for the item.

The second case to address, hash chain modifications, is handled in the STORE and DELETE code. As long as pointers are changed in correct order, GETs will have no problem with hash chain traversal.

3.4.2 STORE/DELETE Lock Removal

The STORE and DELETE commands are very similar to the current open-source implementation. To ensure safety in hash chains and multiple inserts/deletes on the same chain or cache item, STOREs and DELETEs use the striped locks that we described earlier (Figure 8).

Assoc_delete logic:

1. Determine bucket of cache item

2. Get the bucket lock

3. If(expanding)

4. If(expansion bucket is greater than the bucket){

5. Release the bucket lock

6. Determine new bucket

7. Get bucket lock

8. Delete cache item from bucket if found

9. Release bucket lock

10. Else if not expanding

11. Delete cache item from bucket if found

12. Release bucket lock

The striped lock is required for all hash chain manipulation, including moving buckets during an expansion. This makes STOREs and DELETEs simpler since they get the bucket lock and then check for an expansion. If the expansion is moving the bucket the STORE or DELETE is going to manipulate, they are locked out until the old bucket is empty and know to check the new one. If the move started after the lock was acquired, then the expansion waits until the lock is released and can be acquired by the expansion to move the bucket. The logic is identical for contending STOREs, DELETEs, and MOVEs as they are all locked out of the buckets until the manipulations are complete.

Assoc_insert logic:

3.5 Parallel LRU Eviction

A parallel replacement for the double linked-list LRU implementation is required to determine cache item eviction order. The concept behind the “bag” implementation is to group cache items into “bags” based on their timestamp. In these bags, a single linked-list of cache items is stored so they can be inserted atomically to the end of the list using a CAS. This removes the need for a lock while placing a new cache item into a bag although the CAS could be a point of contention if only STOREs are performed. Again, optimizing for the majority (GET requests), the Bag LRU is implemented with no locks or CAS for GETs.

The new design also needs a way to evict old cache items and maintain new bag LRU structures. To accomplish this, we implement a cleaner thread to run in the background. This thread removes expired cache items, invalid cache items, and performs maintenance on the LRU. Work that GET requests were doing in the old version (i.e. Removal on expirations) is moved to this background thread, increasing the throughput of GET requests and decreasing overhead per GET request transaction. The cleaner thread also provides a benefit of increasing hit rates of the cache. Cache items inserted with quick expirations are now cleaned up within a few minutes of expiring, instead of taking space until aging out of the cache.

3.5.1 Linked-list structure

The linked-list structure reuses the previous and next pointers from the current LRU, so no extra memory allocation is required. The next pointer still points to the next cache item in the new single linked-list and the previous pointer is used to mark the cache item’s bag membership which is the newest bag at the time the cache item was last accessed. The cleaner thread uses the residence pointer while cleaning the bags.

3.5.2 Additional Data structures


Figure 10 - Bag data structure


Figure 11 – Bag List data structure

There are two data structures required for the bag LRU, an array of LRUs for bag identification, and the actual bag data structure.

The first data structure is the bag LRU list, an array of bag heads, with one per slab id (Figure 10). These structures are referenced in a global variable and initialized when the LRU is initialized. Each bag head in this array maintains:

  • A pointer to the newest bag
  • An alternate to the newest bag
  • The oldest bag in the bag list
  • A count of the number of bags for statistics

The newest bag is the bag that is currently filled with newly-allocated cache items. The oldest bag is where the cleaner thread starts cleaning bags and where eviction threads evict cache items from. The newest alternate bag is used by the cleaner thread to avoid locks and contention with inserts into the newest bag. Any cache item that already exists in a bag, but has been accessed recently and needs to be moved to the newest bag by the cleaner thread to maintain eviction order (the one currently being filled), will be placed into the newest alternate bag.

The second data structure created is a bag, which maintains a single linked-list of cache items (Figure 11). Each bag has:

  • A pointer to the newest and oldest cache items in the bag
  • A counter for the number of cache items in the bag
  • A lock
  • Bag open and close timestamps that are currently set, but not used.

When cleaning a bag, the cleaner thread begins its operation from the oldest cache item in the bag. The evictions done by worker threads begin in the same location. The newest cache item pointer is to allow for cache items to be added to the end of the bag quickly when being inserted or as bags are merged. The count is for determining when to open a new bag and when to merge an old bag into a newer one. This count can also be used for statistics and debugging as needed. The reason for a lock on each bag is to ensure thread safety between the cleaner thread and evictions done by worker threads. If an eviction thread and a cleaner thread are moving cache items in the same bag, it is possible for one of the threads to corrupt a bag (we have shown linked list corruption in Figure 6). Since the cleaner thread spends very little time in each bag, it is unlikely that there is contention on this lock between cleaner and worker threads. Also, since worker threads are serialized higher up in the call stack, so there is no contention on this lock between these threads.

3.5.3 Operations on the Data Structures

This section reviews operations on the new data structures and algorithms for initialization, inserting cache items, and cleaning the LRU List.

First, all added locks need to be initialized and an initial bag is created for each slab size to be used. After all the structures are initialized, the cleaner thread is created and launched in a cleaning loop.

Bag LRU initialization

1. Initialize the global eviction lock

2. Initialize the cleaner locks – (currently only one lock could be one cleaner thread per slab id)

3. Create the array of bag head structures

4. For each slab id

5. Initialize the first bag

6. Initialize the bag lock

7. Point the bag array head to this as the newest bag

8. Increment the bag counter

9. Create and launch the cleaner thread

In the cleaning loop, the thread continually checks to see if bags are full and adds new bags to those slabs as needed. Periodically, the cleaner thread traverses every cache item in the bags, removing invalid or expired cache items, and merging bags below a minimum cache item count threshold.

Cleaner thread loop

Cleaner thread loop

1. Loop forever

2. Acquire cleaner lock

3. For each slab id

4. if the newest bag is full

5. create and initialize a new bag

6. point the newest bags next bag pointer to the newly initialized bag

7. set a bag holder = newest bag

8. Newest bag = newly initialized bag

9. Newest alternate = bag holder

10. atomic increment of bag count

11. Every Nth iteration, clean the bags

12. Release cleaner lock

13. Sleep for a given interval

When a cache item is first inserted into a LRU, it goes into the newest bag available for that cache item’s slab size. Once the bag is located, the cache item is placed at the end of the newest cache item in the newest bag using a CAS. If this fails, the thread traverses the chain for that bag until it finds a NULL pointer (indicating the end of the bag), and tries the CAS again. This continues until the CAS passes indicating a successful insertion.

Inserting New Cache item not already in LRU

1. New cache item next pointer = NULL

2. If(current open bag is empty)

3. If( CAS (current bags newest cache item, NULL, new cache item)){

4. New cache item previous pointer = current bag

5. Current bag newest cache item = new cache item

6. Atomic increment bag cache item counter

7. Return

8. Swap cache item = current bags newest cache item

9. While( !CAS(swap cache item next pointer, NULL, new cache item)){

10. While(swap cache item next pointer != NULL)

11. Follow the swap cache item next pointer until you find a null and try the swap again

12. New cache item previous pointer = current bag

13. Current bag newest cache item = new cache item

14. Atomic increment bag cache item counter

For a GET, since cache item already exists in the LRU, we are simply touching the cache item to update its timestamp and location in the LRU. The touch copies the new timestamp into the cache item last accessed field and points the cache item’s previous pointer (the bag the cache item should now reside in) to the newest bag in that slab size.

Updating Cache item already in the LRU

1. Cache item previous pointer = current bag

2. Update cache item timestamp

3.5.3.1 Cleaning the LRU list

Within the context of cleaning the LRU list, there are several operations to support; including cache item reordering, merging buckets, cache item eviction, and dumping the cache.

As the cleaner thread traverses the bag lists, it checks the cache item previous pointer. If the pointer points to the current bag it is cleaning, it leaves the item in its current location. If the item points to different bag, that means that the item has been touched since it was first inserted. In this case, the cleaner thread will remove the item from its current bag, and insert it into the bag it should be residing in. As described above, if this is the newest bag in the LRU size, it will be placed into the newest alternate to avoid contention.

Merging two buckets

1. Older buckets newest cache item next pointer = newer buckets oldest cache item

2. Newest buckets oldest cache item = older buckets oldest cache item

3. For each cache item we added

4. If the cache item was pointing to the old bag

5. Point it to the new bag

6. Increment the newer bags counter by the amount of cache items in the older bag

The logic to merge two buckets is to concatenate the two linked-lists together, put the combined linked-list into the newer of the two bags. Then it must ensure every cache item that had its previous pointer pointing to the old bag now points to the newer bag and reclaim the older bag.

Evicting a cache item

1. Get the eviction lock for the slab id

2. Get the oldest bag from the slab id

3. Find the first bag in the list that is before the newest alternate and has cache items in it

4. if we can’t find the previous bag, add two new bags to this slab id, then check those two bags

5. if we still can’t find a bag

6. stop trying to evict cache items just return a failure

7. release eviction lock

8. lock the bag we found

9. starting at the oldest cache item in the bag

10. try to find an cache item with a refcount of 1 or 0 or a delete locked cache item in the first N cache items of the bag

11. if cache item is found evict it and free it

12. try and get the newly free cache item from the slab allocator

13. if we evicted an cache item but didn’t get it from the allocator –

14. evict another cache item and try again

15. else

16. stop trying to evict cache items just return a failure

17. unlock the bag

18. release eviction lock

To evict a cache item from the cache, the worker threads are serialized with a global eviction lock to ensure correctness. The bag is then locked to avoid conflicts with the cleaner thread. After this is done, the thread is the only thread touching the bag and can remove any cache item that meets the eviction criteria (which is the same as the open-source version).

Dumping the cache

1. get cleaner lock

2. get eviction lock

3. add new bags so we can evict without effecting other threads

4. for each slab id

5. for each cache item

6. If (cache item older than the cache dump time)

7. Free cache Item

8. release eviction lock

9. release cleaner lock

Dumping the cache is the same as the eviction of a cache item as before, but this time all cache items older then the time specified in the command are evicted.

3.6 Removal of global cache locks

With functionality in place to manage parallel hash lookups, LRU ordering, and parallel cache evictions, the global cache locks are nearly ready to be removed from the GET/STORE/DELETE code paths. Management of reference counts must by changed; instead of a lock to change the reference count, atomic operations are used to increment and decrement the count. Upon completion, the global cache lock/unlock pairs are removed from the operations, as the global lock is no longer necessary.

1. Lock (global_cache_lock) <- Remove these locks

2. . . . . . (GET/STORE/DELETE)

1. 3. Unlock (global_cache_lock) <- Remove these unlocks

4 Test Environment Configuration and Metrics

This section reviews the hardware configuration of all the systems being used.

4.1.1 System under Test (i.e. SUT)

Based on early evaluation, a 1 GbE (Gigabit Ethernet) NIC is easily saturated, so a 10GbE NIC is required to remove the network as a performance bottleneck.

Memcached is primarily run in large server farms where Total Cost of Ownership (TCO) is important. The objective is to maximize performance/watt results to decrease the TCO. The test system is a power-optimized 2-Socket Open Compute 2.0 System Board manufactured by Quanta Computer.

System Hardware:

  • Dual-Socket Open Compute 2.0 System board (half width) in 1.5U chassis2
  • Quantity 2 -- Intel® Xeon® E5-2660 2.2Ghz Processors (95W TDP) [10]
  • 128 GB DRAM – Quantity 16, 8 GB DDR3 PC3 12800 ECC
  • Intel® “Niantic” 82599 10Gb Direct Attach SFP+ Ethernet Controller [10]
  • Arista 7124 Switch – 24 Port SFP+4

4.1.2 Load generators

With the open-source version of memcached on the previous generations of servers, a single load generator is sufficient. However, with the modifications, an increase in the number of clients is required to stress a single SUT. Using one client, it is possible to get approximately 800K requests per second (RPS); therefore, we configure four clients for driving the load. All load generators are configured identically and confirmed individually to be capable of generating 800K+ RPS with the following Hardware configuration:

  • Dual-Socket Open Compute 2.0 System board (half width) in 1.5U chassis5
  • Quantity 2 -- Intel® Xeon® E5-2660 2.2Ghz Processors (95W TDP) [10]
  • 16 GB DRAM – Quantity 2, 8 GB DDR3 PC3 12800 ECC
  • Intel® “Niantic” 82599 10Gb Direct Attach SFP+ Ethernet Controller [10]

4.2 Software (SW) Configuration

This section reviews the SW configurations for the SUT and client systems, network optimizations, and power monitoring configuration.

4.2.1 Test System Configuration

The Following software is installed to run the Memcached tests:


2http://www.opencompute.org
3Other brands and names are the property of their respective owners.
4http://www.aristanetworks.com/en/products/7100sbr /> 5http://www.opencompute.org

  • CentOS (Community ENTerprise Operating System), version 6.2 – Linux Kernel 2.6.326
  • Libevent 1.4.14b-stable
  • Intel® ixgbe 3.8.14 NIC driver
  • Memcached-1.6.0_beta1 and Memcached-1.6.0_beta1-bags
  • Memcached –p 11211 -u nobody –t -m 128000 –d

With this command, thread is the number of worker threads to run in Memcached. For all tests, the only variable modified for testing memcached is the number of threads; this workload driver and input parameters are utilized to determine peak throughput and scalability.

4.2.2 Client System Configuration

The client systems are configured with the following software:

  • CentOS (Community ENTerprise Operating System), version 6.2 – Linux Kernel 2.6.32 7
  • Intel® ixgbe (10 GbE) version 3.8.14 NIC driver8
  • Mcblaster9

With the operating system and network driver software installed, mcblaster is utilized to drive a synthetic cache lookup workload and measure performance (i.e. transaction throughput, latency/response time) of the memcached instance. There are two steps involved in this testing. First, the keys and values must be loaded into the memcached instance, using:

  • ./mcblaster –p 11211 -t 16 -z 64 –k 1000000 –r 10 –W 75000 –d 20

This command loads the cache with 1 million keys, all with 64 Byte values. In the next phase of testing, all clients send requests for those keys, in random order, simultaneously at a given rate to measure throughput and latency at that rate.

  • ./mcblaster –p 11211 -t 16 -z 64 –k 1000000 –r –W 0 –d 120

With these tests, the rate is increased as memcached’s average RTT is monitored. Maximum throughput is the highest rate achieved without exceeding the 1ms average RTT SLA.

Using previous work on measuring performance of memcached [15] and [16] as a guide, all measurements are gathered using GET requests. Many key-value caching servers are GET-dominated [11] and the number of STOREs in the workload mix has little impact on GET performance [16].


6http://www.centos.org/
7http://www.centos.org/
8http://sourceforge.net/projects/e1000/files/ixgbe%20stable/ - 3.8.14 was replaced by 3.8.21 (2-24-2012)
9https://github.com/fbmarc/facebook-memcached-old/tree/master/test/mcblaster

4.2.3 Network Setup

For maximum throughput, there are two different networking configurations depending on the Intel® Hyper-Threading (HT) Technology setting (enable/disable).

The following setup is used when Intel® HT Technology was disabled:

  • Reduce the Ring buffer size to 96 for both rx (receive) and tx(transact, send)
    • Goal: Reduce latency ( RTT)
    • Command
      • ethtool –G ethX rx 96
      • ethtool –G ethX tx 96
  • Disable generic receive offload
    • Goal: Reduce latency ( RTT)
    • Command: ethtool –K ethX gro off
  • Affinitize NIC interrupts to CPUs by executing the set_irq_affinity.sh10shell script
    • Goal: Reduce network interrupt contention
    • Command: ./set_irq_affinity.sh ethX
    • Note: Ensure that the irqbalance service is disabled in the kernel or it will overwrite these changes. (killall irqbalance)

These settings provide best throughput when combined with the latest ixgbe driver (3.8.14 when the test was run). Flow director (included with the ixgbe driver) is used to steer incoming packets evenly across the physical CPU cores distributing the load of network interrupts. This allows CPUs to maintain balanced utilization across cores.

With Intel® HT Technology enabled, the optimal network configuration uses 1 memcached thread per physical CPU and 2 NIC queues which are now affinitized to the additional 16 logical cores (Table 2). Using process affinity, this configuration allows a majority of the NIC handling to be off-loaded to the logical cores adjacent to the Memcached threads, providing a boost in performance vs. Intel® HT Technology disabled.


Table 2 – Logical Cores share Memcached and Network traffic in an Intel® HT Technology memcached setup


10This script is provided in the ixgbe driver download.

4.2.4 Power measurements

For testing, power is monitored with a Yokogawa WT210 digital power meter11 . The wattage is collected every 5 seconds while the system is under load. The power is then averaged to give the power readings listed in the charts in the next section.

As a note, all of these servers were run with a video card and a less-than-optimal cooling solution. We expect less power consumption per server in a data center environment versus the numbers reported in this paper.

5 Findings

To summarize the performance gains, Figure 12 provides the maximum server capacity (i.e. RPS) for the SUT while maintaining the Latency SLA of <1ms RTT (Round-trip time). Consistent with our research, these measurements are taken with GET requests.

Testing confirmed that the optimized memcached performs best with one thread per physical processor, so optimized test results use 16 threads to align with 16 physical cores in the SUT. Intel® HT Technology and Turbo mode both provide performance upside when enabled.

The two open-source baseline results illustrate no performance gain from increasing 8 to 16 threads (i.e. lack of scalability due to lock contention). The optimized “Bags” results show increasing performance gains as we add Intel® HT Technology (+8%) and Turbo (+23%). The performance improvement with Intel® HT Technology is attributed to additional logical processors handling the network interrupts concurrently with worker threads on the same physical core, removing the need for context swaps. Turbo mode provides an increase in processor frequency, contributing a significant performance improvement as well. Our best throughput result, at 3.15M RPS, is achieved with Intel® HT Technology and Turbo enabled (+31%), with each feature providing an additive increase in performance. When we compare the best optimized result (3.15M) with an identically-configured baseline (.525M), we realize a 6X performance gain.

Figure 13 relates the average round-trip time (RTT) for the four optimized memcached cases from Figure 12. It is interesting to note that there is a point (260 us, 285 us, 215 us, and 205 us, respectively) where RTT degrades immediately past the 1 ms SLA.

The reality is that, regardless of RPS load, the response time SLA must be managed to less than 300 us to avoid the point on the curve where RTT degrades; however, our “capacity” or RPS per server provides consistent RTT under the 300us threshold, allowing each server to be fed millions of RPS to maximize per-server throughput.

Figure 14 shows the maximum RPS throughput with a median RTT < 1ms SLA as core counts increase. Turbo was enabled and Intel® HT Technology disabled for these tests. As core counts are increased, memcached scales linearly due to parallel data structures and algorithms operating on many-core processors. We expect this linear scaling to continue on next-generation many-core processors, as there is no blocking required between threads for execution of the GET requests.


11http://tmi.yokogawa.com/products/digital-power-analyzers/digital-power-analyzers/wt210wt230-digital-power-meters/#tm-wt210_01.htm


Figure 12 - Maximum GET requests while maintaining <1ms RTT SLA


Figure 13 - Median RTT as GET RPS rate is increased


Figure 14 - Maximum throughput with a median RTT < 1ms SLA as core counts increase

Table 3 adds power metrics to the best baseline and optimized memcached results at maximum load under the SLA. In addition to the 6X performance delta, performance per Watt improves 3.4X over the baseline. In the data center, where power and cooling is a large portion of the total cost of ownership, maximizing the work completed per watt invested is critical. These per-server power numbers would be lower in a data center environment with no graphics card and a better cooling solution than our SUT.


Table 3 - Power measurements at the wall while running at max throughput < 1ms SLA

6 Conclusions and Future Work

Optimized memcached increases throughput by 6X increase and performance per watt by 3.4X over the baseline. Extrapolating these numbers from the half-width server board to a full chassis (i.e. two boards), the SUT configuration supplies 6.3 Million RPS per 1.5U server with 256GB of memory, with capability to scale to 1 TB (32 GB * 16 DIMM slots) of memory per 1.5U chassis. The speedup is attributed to the redesign of the hash table and LRU to utilize parallel data structures. These data structures allow for the removal of all locks for GET requests, and mitigate the contention of locks for SET and DELETE requests, enabling linear speedup when measured from 1 to 16 cores on a 2-socket Intel® system.

Future research should examine memcached on 4-socket servers and micro-server platforms to determine if the bag version will continue to scale with additional cores, and if an increase in performance per watt occurs with the amortization of the power drawn from shared components.

A second area for research is the cache replacement policy. With updated cache item ordering and eviction, as well as the addition of a cleaner thread, new replacement policies could be assessed (e.g. LFU, Adaptive Replacement Cache, and MQ). A key feature improvement of the cleaner thread is its autonomous nature, allowing cache management without worker threads’ involvement. Assessing new replacement policies can now be completed without modifying the flow of the program, potentially increasing the cache hit rate.

7 Acknowledgment

The authors recognize the contributions of project team members; including Rajiv Kapoor (Intel Corporation), George Chaltas (Intel Corporation), Alex Duyck (Intel Corporation), Dean Chandler (Intel Corporation), and Mike Bailey (Oregon State University)

8 References

[1] P. Saab, "Scaling memcached at Facebook," 12 December 2008. [Online]. Available: https://www.facebook.com/note.php?note_id=39391378919&ref=mf. [Accessed 1 April 2012].

[2] Twitter Engineering, "Memcached SPOF Mystery," Twitter Engineering, 20 April 2010. [Online]. Available: http://engineering.twitter.com/2010/04/memcached-spof-mystery.html. [Accessed 1 April 2012].

[3] Reddit Admins, "reddit's May 2010 "State of the Servers" report," Reddit.com, 11 May 2011. [Online]. Available: http://blog.reddit.com/2010/05/reddits-may-2010-state-of-servers.html. [Accessed 1 April 2012].

[4] C. Do, "YouTube Scalability," Youtube / Google, Inc., 23 June 27. [Online]. Available: http://video.google.com/videoplay?docid=-6304964351441328559. [Accessed 1 April 2012].

[5] memcached.org, "memcached - a distributed memory object caching system," Memcached.org, 2009. [Online]. Available: http://memcached.org/about. [Accessed 27 February 2012].

[6] N. Gunther, S. Subramanyam and S. Parvu, "Hidden Scalability Gotchas in Memcached and Friends," in Velocity 2010 Web Performance and Operation Conference, Santa Clara, 2010.

[7] B. Fitzpatrick, "Distributed caching with memcached," Linux J., vol. 2004, no. 1075-3583, p. 124, 2004.

[8] B. Fitzpatrick, "LiveJournal's Backend A history of scaling," August 2005. [Online]. Available: http://www.slideshare.net/vishnu/livejournals-backend-a-history-of-scaling. [Accessed 1 April 2012].

[9] N. Shalom, "Marrying memcached and NoSQL," 24 October 2010. [Online]. Available: http://natishalom.typepad.com/nati_shaloms_blog/2010/10/marrying-memcache-and-nosql.html. [Accessed 1 April 2012].

[10] Intel Corporation, "ARK - Your Source for Intel Product Information," Intel.com, 2012. [Online]. Available: http://ark.intel.com/. [Accessed 2012 1 April].

[11] B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang and M. Paleczny, "Workload Analysis of a Large-Scale Key-Value Store," in ACM SIGMETRICS/Performance 2012 Conference, London, 2012.

[12] J. Preshing, "Locks Aren't Slow; Lock Contention Is," Preshing on Programming, 18 November 2011. [Online]. Available: http://preshing.com/20111118/locks-arent-slow-lock-contention-is. [Accessed 1 April 2012].

[13] B. Agnes, "A High Performance Multi-Threaded LRU Cache," codeproject.com, 3 February 2008. [Online]. Available: http://www.codeproject.com/Articles/23396/A-High-Performance-Multi-Threaded-LRU-Cache. [Accessed 1 September 2011].

[14] M. Spycher, "High-Throughput, Thread-Safe, LRU Caching," Ebay Tech Blog, 30 August 2011. [Online]. Available: http://www.ebaytechblog.com/2011/08/30/high-throughput-thread-safe-lru-caching/. [Accessed 1 September 2011].

[15] D. R. Hariharan, "Scaling Memcached – Harnessing the Power of Virtualization," in VMWorld 2011, Las Vegas, 2011.

[16] M. Berezecki, E. Frachtenberg, M. Paleczny and K. Steele, "Many-Core Key-Value Store," in Green Computing Conference and Workshops (IGCC), 2011 International, Orlando, 2011.

[17] S. Hart, E. Frachtenberg and M. Berezecki, "Predicting Memcached Throughput using Simulation and Modeling," Orlando, 2012.

Tags:
For more complete information about compiler optimizations, see our Optimization Notice.

Comments

Hi!

I realize this is hard to follow and it looks like your article is very thorough and very well done, but the version of memcached you're comparing against is wrong.

1.6 is our future development tree, and it has an older version of the storage engine. I apologize for it being in this state.

The 1.4 tree is the one everyone runs, and while it still shouldn't be nearly as fast as your example, it has much improved read/write performance compared to 1.6's baseline. This ends up with your results being exaggerated, since you're comparing a version nobody runs to a highly optimized version you wrote.

Any chance of getting this fixed, or was this intentional?