Ceph Erasure Coding Introduction

Ceph introduction

Ceph, The Future of Storage™, is a massively scalable, open source, software-defined storage system that runs on commodity hardware. Ceph has been developed from the ground up to deliver object, block, and file system storage in a single software platform that is self-managing, self-healing and has no single point of failure. Because of its highly scalable, software defined storage architecture, Ceph is an ideal replacement for legacy storage systems and a powerful storage solution for object and block storage for cloud computing environments.

Ceph is started by Sage Weil’s PHD paper:  in June 2004. Currently Ceph belongs to RedHat. But it’s still open source software. Ceph community is also very active (https://github.com/ceph/ceph ). 

Below is the architecture of Ceph. The core is the RADOS (resilient automatic distributed object storage). Above RADOS, Ceph provides several interfaces:

  • LibRADOS: it’s the native API for Ceph, including read, write, append and truncate etc.
  • RGW: it’s the object storage API for Ceph, which is also RESTful and compatible with Swift and S3
  • RBD: it’s the block storage API for Ceph. Currently its driver is already merge in linux kernel. It also provides the driver for QEMU.
  • CephFS:  it’s the filesystem API for Ceph. It’s POSIX compatible.

These interfaces are actually the ‘clients’ of the RADOS, since all of them are implemented with RADOS protocal.

Erasure code introduction

Erasure Code is a theory started at 1960s. The most famous algorithm is the Reed-Solomon. As time goes by, many variations came out, like the Fountain Codes, Pyramid Codes and Local Repairable Codes, etc.  

Erasure Codes usually defines the number of total disks (N) and the number of data disks (K), and it can tolerate N – K failures with overhead of N/K

E,g, a typical Reed Solomon scheme: (8, 5), where 8 is the total disks, 5 is the data disks. In this case, the data in disks would be like:

RS (8, 5) can tolerate 3 arbitrary failures. If there’s some data chunks missing, then one could use the rest available data to restore the original content.

In cloud storage, replication is commonly used to guarantee the availability. The issue is the storage requirement would be quite high if the storage goes to PB level. By using this technical much storage space can be saved while keep the same availability, which will dramatically save TCO. Ceph supports Erasure Code feature since Firefly version.

    How does Ceph support Erasure Code

The general read/write flow Ceph is like:

 

With Erasure Code feature support, the read/write flow has changed to:

EC write:

Data will be encoded in the primary OSD and then spread to the corresponding OSDs

 

 

EC read:

Data will be gathered from the corresponding OSDs and then do the decode work.

If there's some data missing, Ceph will automatically read from the parity and then do the decode.

For now EC was recommended in object storage mode. For the filesystem and block storage mode, Ceph does not recommended to use EC since the performance would suffer a lot.

Currently there're several different EC plugins in Ceph: Jerasure, ISA-l and LRC.

Jerasure was an open source EC library developed by Prof. James Plank, it supports lots of EC technology now and the performance is good.

ISA-l was optimized for Intel platforms using some platform specific instructions. Currently it’s open source software.

LRC was mostly a different layer than Jerasure and ISA-l. Since it could use either Jerasure or ISA-l as the backend encoding/decoding library. 

How to use Erasure Code feature in Ceph?

Ceph EC was set at pool level. All the EC parameters are defined when creating the pool. E.g.:

ceph osd pool create poolname test_pool \

   erasure-code-directory=<dir>         \ # mandatory

   erasure-code-plugin=jerasure         \ # mandatory

   erasure-code-m=1                    \ # optional and plugin dependant

   erasure-code-k=2                    \ # optional and plugin dependant

   erasure-code-technique=reed_sol_van  \ # optional and plugin dependant

All the objects stored in this test_pool will be ECed. However the clients are transparent to this.

Currently Ceph provides its own EC plugin management system, which makes adding more EC plugins quite easy in future. The interfaces are defined like:

set<int> minimum_to_decode(const set<int> &want_to_read, const set<int> &available_chunks);

set<int> minimum_to_decode_with_cost(const set<int> &want_to_read, const map<int, int> &available)

map<int, buffer> encode(const set<int> &want_to_encode, const buffer &in)

map<int, buffer> decode(const set<int> &want_to_read, const map<int, buffer> &chunks)

Ceph can load your own EC plguins cleanly once your EC plugin supports these interfaces.

To easily manage the Erasure Code parameters, Ceph provides an EC profile concept:

Ceph osd erasure-code-profile set {name} \

         [{k=data-chnks}] \

         [{m=coding-chunks}] \

         [{directory=directory}] \

         [{plugin=plugin}] \

         [{key=value}..] \

         [--force]

One could create one EC pool with the erasure-code-profile easily:

ceph osd pool create ecpool PG_NUM PGP_NUM erasure ecprofle

 

Reference:

Ceph project

Inktank – Ceph professional support services

Inktank – Ceph Reference Architecture

Inktank Ceph Enterprise (ICE)

Official Ceph Documentation

Getting Started with Ceph

Ceph Advanced Features

Ceph Overview

Ceph and OpenStack

 

AttachmentSize
Image icon ceph_ec_05_1.png169.33 KB
Image icon ceph_ec_06.png188.88 KB
For more complete information about compiler optimizations, see our Optimization Notice.