Measure Ceph RBD performance in a quantitative way (part I)

Providing a high performance and cost effective volume block Storage service is an important and challenging job for every cloud service provider. There are many open source projects trying to satisfy this requirement. Among those, Ceph is an interesting one for its decent unified architecture. To better understand Ceph performance and identify future optimization opportunities, we conduct a lot of experiments with different workloads and IO pattern. We are glad to summarize our finding into a series of short articles and share with rest of the community – and be happy to hear comments and feedbacks. Please pay attention that in this series we only cover Ceph RBD (RADOS Block Device) performance, not object storage and file system because Ceph is more widely used as the block device than the other two. As the 1st post, we will start from describing our testing configuration and methodology, following with random read/write performance with micro workload.

Figure 1 describes the testing environment. we have four Xeon UP servers as the storage node cluster and five client servers loaded with VM (Virtual machine) to generate the IO traffic. The storage nodes and clients are connected with a full 10Gb network to make sure we have enough bandwidth available. Table 1 shows the details of storage node hardware configuration. For each storage node, we have one Intel E3 3.5GHz processor with 4Core and 8threads plus 16GB memory. One external bay with 10x 1TB enterprise SATA disks is connected to the storage node through a LSI9205 HBA with JBOD model. Each of the disks is parted into one partition for one OSD daemon. We also have 3 Intel SSDs used as the Ceph Journal. Two of them are connected to 6Gb SATA port and each is used as journal for 4 HDD. The 3rd SSD is connected to a 3Gb SATA port and used as journal for the other 2HDD. For each OSD, 20GB journal space is allocated.

Testing environment

Figure 1 Testing Environment

ComponentHardware details
CPU Xeon E3 3.5GHz processor x1 with 4Core/8Threads
Memory 16GB
storage 10x 1TB SATA HDD through LSI9205 HBA for filestore and 3x SSD connected with on board SATA controller for journal
Network 1x 10Gb NIC

Table 1 storage node HW configuration

On software part, Ubuntu 12.10 is deployed for storage node, client host and client VM. Kernel version is 3.6.6. XFS is selected as the storage node file system due to its better stability. The Ceph version we tested is 0.61.2 (CuttleFish) with replica=2. Figure 2 summarizes tuning we have done on Linux and Ceph.

Tuning Summary

Figure 2 Tuning summary

To define a reasonable test methodology, we raise the question “If I am an IT guy responsible to build an AWS EBS like storage cluster service, what kinds of metrics are important to me?” To answer that question, we try to understand what AWS EBS offers today. Amazon provides several different EBS to satisfy different customers’ requirements. According to Amazon (link): Standard volumes deliver approximately 100 IOPS on average with a best effort ability to burst to hundreds of IOPS. We conducted some tests on AWS to verify that. In 7 days, we periodically started VM running for 2Hours, and tested standard EBS performance with four different IO patterns (random read, random write, sequential read, and sequential write). The results are different on different VM flavors (micro, small, medium and large) and vary from time to time. In general, the test result matches Amazon declared SLA. We observed an average 300IOPS/sec (Either AWS data center is not busy enough or they have SSD speed up) for random IO and 60MB/sec sequential BW. Based on the test data, we set following expectation: For one 60GB volume, we hope it can support ~100 IOPS random IO with reasonable latency (<20ms) and 60MB sequential IO BW. Then the storage cluster performance is determined by how many volumes can be supported with such a predefined per-VM performance. Details about the test methodology can be found at Figure 3. To better measure the end user performance, we conduct all the tests inside the VM and use the QEMU RDB driver. FIO is selected as the workload running inside each VM. Four different IO patterns (4KB RR/RW and 64KB SR/SW) have been tested. And FIO --rate and --rate_iops parameters are used to set the maximum throughput (100IOPS for RR/RW and 60MB for SR/SW) to emulate the throttle in the real environment for better isolation in multiple tenant environment. By gradually increasing the volume and stress VM number, we get a load line for the storage cluster performance. And the final performance metric is the aggregated throughput across all the volumes with two QoS criteria satisfied: 1) Average random latency is less than 20ms. 2) Average per-VM throughput is larger than 90% of the predefined target (i.e. 54MB/s for SR/SW and 90IOPS for RR/RW).

Test methodology

Figure 3 Test methodology

Ok, now let’s move to the fun part – the performance data. Figure 4 shows the 4KB random read result. With the increasing of the number of volume/VM, the average latency also increases and average per-VM throughput gradually goes down. Per our pervious QoS definition, the peak throughput achieves at ~30VM, with average 95 IOPS and 21ms latency. The corresponding total throughput of the cluster is 2858. Figure 5 shows the 4KB random write result. Comparing to random read, the random write result shows a pretty lower latency (3ms) when pressure load is not very high (less than 30VM). But when volume number is larger than 30VM, the latency has a big jump to 24ms with average per-VM throughput drop to 82. So we pick up VM number 30 as the report point with total cluster performance as 3000. We believe the reason that random write has such a sharp change is Ceph journal SSD and HDD filestore memory page cache hide a lot of HDD latency. However when the load becomes high enough, cache can’t hold all the requests and the IO still need to be completed to HDD, thus resulting in a longer latency.

RR peformance

Figure 4 4K Random Read

RW performance

Figure 5 4KB random write

Is the performance good enough? We did some tests to understand the native disk performance. The HDD we use is 1TB Seagate enterprise SATA. FIO tests show it can support ~90IOPS for read and ~200IOPS for write (due to write cache) with 20ms latency per disk. Based on this data, we can calculate the Ceph random read/write efficiency in Table 2. The first column shows the max throughput we measure at 80VM/volume. In theory, if we create more volume/VM, we should get a higher throughput. However due to QoS requirement, we didn’t do further tests with higher stress load. The 2nd column shows the peak throughput with QoS consideration, which is 2858 for read and 3000 for write. The 4th column shows the theoretical throughput for 40 SATA disks, which are 3600 for read and 4000 for write (replica=2). Thus we can calculate the efficiency as 79% and 75%. The write efficiency is a little misleading due to SSD journal cache impact. However it still reflects kind of storage system capacity thus we just keep it there for reference.

Ceph random efficiency

Table 2 Random efficiency analysis

In theory, the throughput of random IO depends on the latency if the load pressure is fixed. Figure 6 gives more details of the latency trend analysis for random read. The red line (inside VM latency) is what we measure from the testing VM (w/ FIO). And the green line (OSD latency) represents the disk latency we get from the storage node server (w/ iostat). There is 2ms latency overhead from either the network transmission, Ceph stack or client side code, which definitely has some opportunities for optimization. The good thing is Ceph shows good scalability to handle the random IO. When the load pressure increases, the latency overhead keeps almost the same. The other thing we can learn here is that for random IO the bottleneck is HDD spinning pain, which also reveals some opportunity to mix SSD and HDD together on data filestore to achieve a balance between capacity and performance.

latency trend analaysis

Figure 6 Random Read latency trend analysis

As a summary, we think Ceph does a pretty good job to handle the random IO. The traffic is well distributed across all the nodes and SSD journal does speed up the write a lot. For next post, we will look at sequential performance of Ceph. We are happy to hear any feedbacks or comments. And we will have a talk on the coming OpenStack conference in Hong Kong (link) to talk about the Ceph performance. It would be nice to see you guys there. The report is truly the team work. Thanks for Shu Xinxin, Zhang Jian, Chen Xiaoxi, Xue Chendi and Thomas Barnes etc. for their help to conduct performance evaluation and provide valuable suggestions. Specially thank Mark Nelson (blog) from Inktank provides insight suggestions to make this better.

Categories:
For more complete information about compiler optimizations, see our Optimization Notice.

Comments


Hello,

Hello,

Thanks for this article, it is very interesting. I've noticed you made your tests with HDD + SSD (for journals). On a pure SSD environment, would you expect similar efficiency? And what would you change from the design?


some people ask for the FIO

some people ask for the FIO parameter. Here you are:

Command line :  QUEUE_DEPTH=${qd} RAMP_TIME=${warm_time} DISK=${disk} SIZE=${size} RECORD_SIZE=${record_size} RUNTIME=${run_time} fio --output ${your_output_file_name} --section ${your_job_name} all.fio

Fio configure file :

[global]
iodepth=${QUEUE_DEPTH}
runtime=${RUNTIME}
ioengine=libaio
direct=1
size=${SIZE}
filename=${DISK}
ramp_time=${RAMP_TIME}

[seq-read-64k]
rw=read
bs=${RECORD_SIZE}
iodepth_batch_submit=8
iodepth_batch_complete=8

[seq-write-64k]
rw=write
bs=${RECORD_SIZE}
iodepth_batch_submit=8
iodepth_batch_complete=8

[rand-write-64k]
rw=randwrite
bs=${RECORD_SIZE}
iodepth_batch_submit=1
iodepth_batch_complete=1

[rand-read-64k]
rw=randread
bs=${RECORD_SIZE}
iodepth_batch_submit=1
iodepth_batch_complete=1

[rand-write-4k]
rw=randwrite
bs=${RECORD_SIZE}
iodepth_batch_complete=1
iodepth_batch_submit=1

[rand-read-4k]
rw=randread
bs=${RECORD_SIZE}
iodepth_batch_submit=1
iodepth_batch_complete=1

[seq-read-4k]
rw=read
bs=${RECORD_SIZE}
iodepth_batch_submit=8
iodepth_batch_complete=8

[seq-write-4k]
rw=write
bs=${RECORD_SIZE}
iodepth_batch_submit=8
iodepth_batch_complete=8