Ceph Cache Tiering Introduction

Published: 03/04/2015, Last Updated: 03/03/2015

Ceph is a distributed and unified storage platform. It supports block, file and object storages in the same system. These characteristics make it attractive for the enterprise users. With the emerging of fast storage devices like SSD nowadays and its high cost, the storage tiering technology becomes more and more important in the enterprise storage market. Ceph also introduced this feature starting from its Firefly v0.80 release, known as cache tiering.

Cache tiering aims to improve the IO performance with the fast storage devices acting as cache for an existing larger pool. It creates a pool using the fast/expensive storage devices (mostly SSDs for now). This is known as the cache tier. The existing backing pool could be either an erasure-coded pool or a replicated pool composed of slower/cheaper storage devices. This is known as the storage tier or base tier. Cache tier holds a subset of the data in the base tier. Its architecture is showing as below in pic 1.

Pic 1 cache tiering architecture

The cache tier is transparent to client operations. Once a cache tier has been configured on top of the storage tier, the ‘Objecter’ component in ceph client routes all the IOs to the cache tier. If the needed data are missing in the cache tier, it is promoted from the storage tier to the cache tier. Data are accessed/updated in the cache tier after promotion. There is a tiering agent working in the cache tiering. When the data in the cache tier becomes inactive/cold, the agent flushes them to base tier, and finally removes them from the cache tier. These operations are known as flush and evict.

Cache tiering supports two modes in the initial release in Firefly: writeback mode and read-only mode.

  • Writeback mode: In this mode, data are read from and written to the cache tier. When data are missing in the cache tier, it is promoted from the base tier. After promotion, all the IOs on the promoted data are played on the cache tier until the data become inactive. Then they are flushed and evicted to base tier by the tiering agent as stated above. This mode should be used in most workload which needs to consistently update the data.
  • Read-only mode: Write are forwarded to base tier in this mode. But for read operation, it is handled in cache tier. Missing data are promoted from the base tier as in the writeback mode. The tiering agent performs no flush operations in this mode because write are always done in the base tier. However, it does perform the evict operations to remove the stale data from the cache tier according to the defined policy. This mode works better for the scenario which mostly doing read.

Cache tiering setup

To set up cache tiering, you need to have two pools: the base tier pool and the cache tier pool. Assuming the base tier pool is already there, the cache tiering setup involves the following five steps:

  1. Adding SSDs as OSDs
  2. Edit the crush map
  3. Create the cache tier pool
  4. Create the cache tier
  5. Configure the cache tier

Adding SSDs as OSDs

This is the normal process of adding an OSD to the ceph cluster. Checking the official document (http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ and http://ceph.com/docs/master/install/manual-deployment/#adding-osds) for the instructions.

Edit the crush map

New buckets and rules need to be added to the crush map, so that the cache tier pool can make use of the SSDs. Here is an example on how to add new buckets and rules on a two OSD nodes cluster, each node has 4 SSD OSDs.

  • New buckets:
host host1-ssd {
        id -4           # do not change unnecessarily
        # weight 4.000
        alg straw
        hash 0  # rjenkins1
        item osd.16 weight 1.000
        item osd.17 weight 1.000
        item osd.18 weight 1.000
        item osd.19 weight 1.000

host host2-ssd {
        id -5           # do not change unnecessarily
        # weight 4.000
        alg straw
        hash 0  # rjenkins1
        item osd.20 weight 1.000
        item osd.21 weight 1.000
        item osd.22 weight 1.000
        item osd.23 weight 1.000

root ssd {
        id -6           # do not change unnecessarily
        # weight 8.000
        alg straw
        hash 0  # rjenkins1
        item host1-ssd weight 4.000
        item host2-ssd weight 4.000
  • New rules
rule cachetier {
        ruleset 3
        type replicated
        min_size 1
        max_size 10
        step take ssd  # select osd from the bucket ‘ssd’
        step chooseleaf firstn 0 type host
        step emit

For the definitions of buckets, rules, and the meaning of each fields, checking the document in http://ceph.com/docs/master/rados/operations/crush-map/ for details. These buckets and rules varies by your hardware configurations. But they can be easily added by imitating those existing buckets and rules in your crush map.

Create the cache pool

This is the same as creating a normal pool. But remember to set the crush rule to the one you defined in the previous step after creating the pool. For example, to use the ‘cachetier’ rule defined above whose id is 3, you should run the following command:

ceph osd pool set {cache-pool-name} crush_ruleset 3

Create the cache tier

The following 3 commands are used to create the cache tier:

ceph osd tier add {storage-pool-name} {cache-pool-name}
ceph osd tier cache-mode {cache-pool-name} {cache-mode}
ceph osd tier set-overlay {storage-pool-name} {cache-pool-name}

The first command adds the cache pool as the cache tier of the storage pool. The second command sets the cache mode to be either ‘writeback’ or ‘readonly’. The third command set overlay of the storage pool, so that all the IOs are now routing to the cache pool.

Configure the cache tier

Hit set configuration

Hit set records which object is accessed recently in the cache tier. It is used by the tiering agent to make decision on the flush and evict operations. Ceph uses the bloom filter type of hit set in the production system, which uses the ‘bloom filter’ data structure for the hit set in its internal implementation. Hit_set_count defines how much time in seconds each hit set should cover, and hit_set_period defines how many such hit sets to be persisted. The default values are 4 hit sets and each hit set covers 1200 seconds.

ceph osd pool set {cache-pool-name} hit_set_type bloom
ceph osd pool set {cache-pool-name} hit_set_count 6
ceph osd pool set {cache-pool-name} hit_set_period 600

Cache sizing configuration

There are several parameters which can be set to configure the sizing of the cache tier. ‘target_max_bytes’ and ‘target_max_objects’ are used to set the max size of the cache tier in bytes or in number of objects. When either of these is reached, the cache tier is ‘full’. ‘cache_target_dirty_ratio’ is used to control when to start the flush operation. When the percentage of dirty data in bytes or number of objects reaches this ratio, the tiering agent starts to do flush. This is the same for ‘cache_target_full_ratio’. But it is for evict operation.

ceph osd pool set {cache-pool-name} target_max_bytes {#bytes}
ceph osd pool set {cache-pool-name} target_max_objects {#objects}
ceph osd pool set {cache-pool-name} cache_target_dirty_ratio {0.0..1.0}
ceph osd pool set {cache-pool-name} cache_target_full_ratio {0.0..1.0}

There are some other parameters for cache tiering, such as ‘cache_min_flush_age’ and ‘cache_min_evict_age’. These are not required settings. You can set them as needed.

After the previous five steps, you can start to run workload on ceph without any changes on the client side. You can even concurrently set up cache tiering while your workload is running. That is to say, you can set up and remove cache tiering without disrupting your service.



  1. http://ceph.com/docs/master/
  2. https://github.com/ceph/ceph

Attachment Size
ct.png 10.5 KB

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804