Enabling Persistent Memory in Cloud Software Architectures
Intel launched the first generation of Intel® Optane™ persistent memory, and it is widely available from several vendors with full operating system support from both Microsoft Windows*, most Linux* distros, and file systems like ext4, XFS, and NTFS. Persistent memory is a technology that is byte-digestible, like DRAM, and has high capacity non-volatile storage capabilities. So, how do developers leverage persistent memory?
What You Will Learn
- Get a link to a downloadable Programming Persistent Memory eBook
- Learn how Aerospike uses libpmem to add support for persistent memory
- Discover the lessons learned when enabling a Apache Spark* SQL stack to use persistent memory
- Find out how better operational efficiency was achieved in the data center by adding support for persistent memory in the caching software, along with the changes made to Redis
Welcome, everyone. Thank you for taking the time to attend this webinar entitled Enabling Persistent Memory Usages in Cloud Software Architectures. Intel launched the first generation of Optane Persistent Memory, and it is widely available from several vendors with full operating system support from both Microsoft Windows, most Linux distros, and file systems like ext4, XFS, and NTFS. Persistent Memory is a technology that is byte-digestible, like DRAM, and has high capacity non-volatile storage capabilities. So, how do developers leverage Persistent Memory?
Hi, I'm Usha Upadhyayula. I'm part of the Software Architecture Group in Memory and Storage Products with Intel. I'll be the moderator for this session. Our excellent lineup of speakers, Ginger Gilsdorf, Piotr Balcer, and Kuba will tell us how they enable some of the cloud software stacks using the Persistent Memory Development Kit, or PMDK. PMDK is a collection of a growing number of open source libraries that help developers ease into enabling the apps for Persistent Memory. As you see on the slide, going from bottom to top, you see libpmem. Libpmem library provides low level Persistent Memory support. Libpmemobj provides a transactional object store providing memory allocations, transactions, and general facilities for Persistent Memory Programming. Libvmemcache, an embedded and lightweight in-memory caching solution is also part of PMDK. Libmemkind, though not part of PMDK, simplifies Persistent Memory usage in a volatile mode.
Hang around to the end, and we'll give you a link to a downloadable Programming Persistent Memory eBook that describes all these libraries in depth. First, Ginger will tell us how Aerospike uses libpmem to add support for Persistent Memory, Piotr will go over lessons learned when enabling a Spark SQL stack to use Persistent Memory, and Kuba will talk about how we do achieved a better operational efficiency in the data center by adding support for Persistent Memory in the caching software, along with the changes we’re making to Redis.
As a courtesy to speakers, all lines are muted for the duration of the session. You may ask questions using the Q&A box at any time throughout the webinar, and we'll answer them after the main presentation.
With that, I'll hand over to Ginger and we'll come back at the end to answer any questions in the remaining time. Ginger, take it away?
Thanks, Usha. As Usha mentioned, I'll be discussing the Aerospike NoSQL Database – sorry, that title slide went by pretty quickly, but I don't have a whole lot of time, so I just want to give you a background on what Aerospike is, in case you're not familiar with them, and then discuss how they use Intel's Optane Persistent Memory along with the PMDK library, but in order to implement AppDirect support for their database, in multiple ways actually.
So, Aerospike is a NoSQL key-value database. They've been in the Silicon Valley area for more than 10 years now. They do have an open source Community Edition, but with most of the support and features that you're going to want for Persistent Memory, you will want the Enterprise Edition which is a paid subscription. I've listed a few of their customer types, but essentially their customers are anyone who relies on fast access to very large amounts of data, and they need that to be completely reliable and available all the time. The architecture, in the most simple of terms, consists of somebody storing records or values. Each record contains a corresponding index entry or the keys that help you get to the records quickly. They call it the hybrid memory architectures, and in that sense, you can use any storage or memory tier to store both the index and the actual records. The architecture is clustered, shares nothing, and they achieve fault tolerance through replication and self-healing clusters. It's very much a read/write request workload from Aerospike clients to the Aerospike database.
So, what I want to describe here is how Aerospike was configured before version 4.5, and this is, again, a very simplified view, but in essence, each data node that is in an Aerospike cluster contains a set of data. That data can be in DRAM, but more likely it's in SSD, and each entry in the database requires an entry in the index. There are two options to store the index, in DRAM or SSD, but customers come back and say, well, hold on, DRAM is really expensive. If I want to store more records in my database, I have to either buy more DRAM, more expensive DRAM, or scale out to more nodes. On the other hand, yes, SSD is cheaper, but it's not nearly as fast as what a lot of customers need to access the data. And you can see that every access to the database does require going through the index, so when Persistent Memory was first coming out and the Aerospike was hearing about the features, they immediately latched onto the fact that Persistent Memory is going to be less expensive than DRAM, while faster than SSD, but then of course, they said, oh, it's got that persistence feature, that will really solve this main pain point that, if I stored the index in DRAM, every time I reboot the system, I lose it. The only option is to go back through the data and rebuild the index, which can be a very costly experience. So, Aerospike said, great, Persistent Memory will save our customers money, it's faster than SSD, and it reduces that index rebuild time.
And then, just of note, Aerospike and Intel worked together on a prototype before this hardware was actually available. That helped us estimate the performance difference between storing the index in DRAM and Persistent Memory, and I'll cover that a little bit later, and they also allowed Aerospike to preload by doing some optimizations to their index data structure to prepare for that.
So, the idea here is moving the index—and you can see this is in Enterprise Edition 4.5 and above, moving the index from memory into Persistent Memory. So, what we started with was that the index in-memory consists of blocks of Linux shared memory and updated in place. Since the index was in shared memory, if you wanted to shut down Aerospike and maybe do a process upgrade, you could contain or keep that index in shared memory and be able to just reattach to it on restart. That solved one problem, but that doesn't help when you have to reboot the whole system, then you still have to go through rebuilding that index.
So, I have to show you the process I've highlighted of what's different when we're dealing with Persistent Memory. Instead of allocating blocks of shared memory, or allocating files in a Persistent Memory aware file system, and these files are then memory mapped. Once the memory mapped, you can just update them in place, similar to how the memory version works. When you shut down Aerospike, it's really important that Aerospike goes through this process to flush the index to persistence. This is something that's covered in PMDK, but if you were to close Aerospike, or a process using Persistent Memory, without flushing, some of those changes could be lost in the caches and wouldn't be actually flushed to persistence. So, that is an important step. And then you can see you can just reattach on restart, but the big benefit here is on a system reboot, you still retain that index even across the power cycle. All you have to do is simply reattach once you've restarted. And now I'll share some performance numbers on how big of an impact that does make. So, Aerospike said this what we want to do, now let's look at the code and see how we can implement this in our code base.
We looked at a couple of different libraries from PMDK and it turns out that libpmem was the good fit for Aerospike. It is considered the lowest level of support, but it also gives the developers a high level of control over what they're doing, and Aerospike said we've got a lot of things in place to manage the index once we've got a block of memory to work with, Persistent Memory. Aerospike can track pointers to those blocks and work with them. What they needed libpmem to do is actually allocate the memory and then flush it to persistence. This is just a very simple example of how one might work with Persistent Memory in the libpmem library. You can see it can be as easy as creating a PMEM file, memory mapping to it, writing to the file, and then when you do this unmap and exit, that PMDK underneath handles the flushing that I've talked about.
So, I haven't mentioned this yet, but Aerospike’s first foray into Persistent Memory was to work with the index, but since version 4.5, Aerospike has also added the ability to store data in Persistent Memory. That came out with version 4.8. So, you can see that this opens a wide variety of ways you can deploy or reflect hybrid memory architecture. So, right now, though, I wanted to highlight with the index in Persistent Memory, there's two really good cases that work out well. One is to store data also in Persistent Memory, and one is to keep data on SSD. Both of those cases offer a high degree of performance at a cost-effective point, and they also offer that warm restart that I've mentioned before. So, you can see there's a lot of possibilities, but in this slide, I'm purely talking about the case where the index is in Persistent Memory.
The benefit of persistence comes in this restart time. If you need to reboot your entire system, the index in the DRAM version can take you hours, depending on the size of your database, because like I said, you have to walk through each record on the database and rebuild the index piece by piece. With Persistent Memory, that can be done in seconds, and in our measurements, we found a restart time reduction of 130x, which is quite impressive. This leads the ability for—if you have a cluster of Aerospike nodes and you need to do an upgrade on every single one of them, you can get that done in an afternoon instead of a week or two.
In terms of capacity, because the Persistent Memory DIMMs have a larger capacity per DIMM than typical DRAM DIMMs, you have a larger index on a single node, which also means that you can expand the amount of data you store per node. We've seen cases where customers have gone from a 20-node cluster down to 10 nodes. They're just more dense nodes. So, this is clearly a big TCO play here.
And then in terms of performance, I mentioned that Aerospike was very careful to know upfront that performance with Persistent Memory is not identical to DRAM, but it is close. So, people often ask, do you get a performance gain from using Persistent Memory? In this case, since we're replacing DRAM, the answer's no. There's not a gain, but you're able to maintain that performance while getting all the other benefits. Depending on what read/write mix you're using, you can expect to see about a five percent reduction in performance, like I said, while also getting the restart times and the larger capacity.
So, I will leave you with just a few resources here. Aerospike has written blogs on both of their releases around Persistent Memory. You can look at the actual libpmem source code on GitHub, and Intel's Developer Zone has an article written about how Aerospike implemented PMDK as well.
I'll leave that for a second, and I will go ahead and pass it off to Piotr, who is going to discuss Apache Spark SQL.
Thanks, Ginger. So, let's talk caching in Spark.
So, let's introduce Spark SQL a little bit. So, Spark is a big data analytics framework that distributes large queries over a compute server, but the key thing to understand for the purpose of this presentation is that we have queries that potentially consume very vast amounts of data and needs to be collocated alongside big data, so that the processing is as fast as possible.
So, Intel developed a version called Spark OAP and that provides cache aware scheduling, which means that the scheduling of those big queries is cache aware. Meaning that the data, the compute nodes that the queries are executing on, are collocated closely to the data, so that we don’t have to merge the cache leases. The other part of the OAP solution is off-heap caching. That is going to be the focus of the rest of my talk.
So, let’s look over what are some of the problems that we want to solve and how our solution improves that. So, in Spark OAP, we have the SQL unmodified layer, you have the Spark itself, and then we have the OAP, which is a planner for Spark, which provides the scheduling that I talked about. And then you can have multiple layers of caching. The most typical solution will just use the DRAM cache and is then backed by some local disk storage, and leasing to our DRAM cache will mean that we either have to look into the disk, or worse, go over the network, which would very much impact performance.
So, the solution with the Persistent Memory allows you to have vastly more capacity in your node that’s minimizing the cache leases, but now that we have PMEM, the problem becomes that cache leasing vast, large amounts of data isn't simple. So, what we did ourselves with Spark OAP was that we needed a very scalable and in-memory solution, because you typically have many other nodes, SAP nodes, in a node, and we also needed other fast algorithm so that all the data can be efficiently utilized.
Cache leases already exist, so why can't we just use some off-the-shelf solution? Well, most in-memory databases, in-memory caches use some form of malloc. That’s a manual memory management. Malloc, its original implementation isn't really that well suited for caching, which we will talk about later. The other problem with just using existing solutions is that PMEM is typically exposed by the operating system of the file, so you can't just connect the file to a heap. You have to memory map that file and then tell the heap to consume that, and not many existing heaps can actually do that, so we have to come up with a solution that would allow us to connect to a file of PMEM.
So, one of the problems of just using memory allocators, so typical implementations of malloc, is fragmentation, and fragmentation is when you allocate multiple blocks of some size, then free those blocks in a pattern that then prevents more memory allocation of larger sizes, so on this picture here, we allocate three objects, and then two of them are in parallel heaps, so that they cannot allocate a larger object than the smaller ones on the first instance. So, this isn't ideal if our workload doesn’t have the exact same sizes, and not all workloads suffer from fragmentation, but many do. So, the existing solutions try to solve those problems and they do that by either combating garbage collection, which can be expensive, and they are actually very difficult if your heap is potentially in terabytes, but they can also do defragmentation like Redis does by just trying to reallocate objects opportunistically so that maybe those objects happened to be in better places on the heap. This is also a potentially very expensive operation and has to be tightly integrated with the memory allocator.
Other solutions are the slab allocators, but those also rely on various amounts of sizes that the workload will actually allocate. If our workload doesn’t have that widespread of different allocation sizes, then fragmentation can still occur in much larger allocations.
So, in the solution that we proposed for Apache Spark we used extent allocation, which is what file systems typically do. So, instead of just having objects, it continues…a heap doesn’t allow us to allocate a large object. It simply allocates two smaller ones and connects them to our limitless kind of mechanism. So, typically this will be a problem because of small object sizes, but PMEM allows us to do that because we are not restricted by large units or very small sizes.
So, the other problem we had to solve was a scalable replacement policy, because, like I mentioned, we wanted our solution to scale across many cores. The replacement policy, like I mentioned, we wanted to implement on our local [unclear speech], and typically how you implement LRU is you take a doubly-linked list and then just move objects around, but the problem with hat implementation is lock contention. Every single time you read an object from the cache, you have to modify the link list, and if you do that for multiple threads, everything is going to be bottlenecked on the locks on the link list. So, this version wasn’t acceptable for us.
So, we addressed this problem by appending our ring-buffer onto the LRU list, and the ring-bugger was lock-free so that the capital operation will just simply append on elements of the ring-buffer, which was a lock-free operation, so multiple gets echoed just so you can append a new arm to the ring-buffer. And then that ring-buffer was processed whenever we needed to look into the LRU list and this allowed us to have a consistent LRU at a lower cost.
So, this the solution that I just described is actually implemented as of [unclear speech] library, we called it VMEMCACHE, and so a lightweight embeddable in-memory caching solution. It’s generic, it’s not…it’s just for Spark, and the interface it exposes is just a simple key-value store. It’s open source, so you can go and get hold of it if you want.
Let’s look at the Spark solution again with VMEMCACHE. So, again, the Spark itself remains unmodified. The OAP now has…with VMEMCACHE has a new working mode where data is actually located at the system memory, and that is backed by the Intel DIMM, and this solution allows you to have much greater amounts of data on your nodes, meaning that you can do more pure performance.
So, if there were two things that I'd like you to take away from this presentation is that cache-aware scheduling, or [unclear speech], is a very difficult problem, but the OAP solution tries to solve it, and that implementing a caching solution isn't as simple as it sounds, and there are many problems that might not be obvious and require careful considerations when considering tradeoffs.
So, the entire solution that I've just described, that includes VMEMCACHE, is also open source and you can find it under the links on the slides.
OK, with that, we can move on to the next presentation by Kuba. He will be talking about Twitter’s caching framework. So, Kuba, take it away.
Thank you, Piotr. Yes, Twitter’s caching framework, called Pelikan, this is a caching engine developed and used by Twitter in their infrastructure. It can be used as a replacement for Redis and Memcached, as it supports both protocols. Twitter reports that many of their cache instances are actually memory bound, meaning that increasing memory size would help in caching effectiveness. Increasing cache hit rates decreases cost of other infrastructure, and also decreases latency as most of the requests don't have to go to a database. There were two reasons why we saw value in integrating Pelikan with Persistent Memory. First is bigger capacity available with byte [00:27:24] Intel Optane Persistent Memory. It allows you to store more objects in cache or run more instances on a single machine. Second reason is that by utilizing the persistency feature of memory, it was able to rebuild the cache much faster after a graceful shutdown scenario, so in case of plant maintenance activities like binary replacement, or hardware upgrades. Please keep in mind that this is not the same as [00:28:02] algorithm, which is more complex and requires using libpmemobj from PMDK and features like persistent allocator and transactions.
Now let's move to details, how integration with AppDirect was done. I will focus on Twemcache Storage Engine, which was more complex than sim-cache, and is a better example. Both these engines creates a product called Pelikan. Twemcache is a slab-based storage to cache keys and values pairs in DRAM. The whole memory pool was allocated by application and divided into parts called slabs. Each slab consists of some amount of items, which actually stores data. Each item with existing data has its own entry in the hash table. Our modification introduced abstraction layer for managing large amounts of memory. With simple modification in the code, it is possible to use one of the two. Persistent Memory where slabs are allocated in-memory pool from memory mapped file on file system docks, or volatile memory, which is allocated using standard [00:29:29] API. So, when an application is configured to use Persistent Memory, it allocates all the slabs and items from there, while the hash table is still in DRAM, and it can be rebuilt after a graceful shutdown from information stored in Persistent Memory.
Here is a sample code which shows how abstraction layer for managing different memory pools was done. If path to directory on filesystem docs is not provided, then normal allocation happens. If PMEM is configured, then we use function from PMDK libpmem library to memory mapped file.
After modification in slab allocation part is done, it is necessary to take care of some actions on application teardown. The first step is to store some heap information in our data pool header. The most important is the address returned by the PMEM mapped file function, which is a base address of Persistent Memory pool containing all the slabs. The next step is calling msync on pool file to make sure that every modified data is safely stored on PMEM device, not in CPU cache.
Here is part of code where pool metadata, including previous pool address, is stored, and then datapool_close, this function is abstracting different memory pools, DRAM and PMEM, and different actions on them. On DRAM, it just goes free, while on PMEM it uses the PMEM_unmap function from libpmem PMDK library.
More steps need to be done on the application start-up. We, again, call unmap on our files stored on filesystem docs, but it is possible that a map returns different address than previously. It means that all the pointers that are stored in our memory pool are invalid and we need to fix them. This can be done by calculating pointer difference between current and previous mappings and applying change on every pointer. So, we iterate through all the slabs and all the items in every slab. We add items which are marked as linked to DRAM hash table, and also, we are rebuilding slab’s least recently used structure.
This slide shows the general idea of this modification. Slab_lruq_rebuild function first will register previously sized pool metadata, then there is a calculation of pointer difference between previous and current memory pool addresses. Then the difference is applied to every pointer here in case of LRU structure. Next in the code there is a loop where every slab is tracked. Slab lens is norm, so it is easy to jump to the next slab [00:33:26] no others of the first slab, and the first slab is stored at a fixed location after our pool metadata. The same store with items inside the slabs, on which we can iterate just by knowing their size. During this iteration, we can rebuild the hash table in DRAM by calling unmodified function linked item.
To summarize, all the changes that we introduced were modest. They focused only on a few places in the code. They didn’t modify the Pelikan algorithm, and finally, we got a single product that can be used on different hardware.
So, what about the results? We are considering two things. The first one is general performance measured by throughput and latency metrics. From ours and Twitter's experiments, we saw that performance is predictable. We got stable numbers when we increased the size of the value objects. Also, when total size of cache instance is increasing, we observed throughput and latency to be within where acquired SLA ranked.
When we compare these numbers with memory mode results, we see that with AppDirect, it's easier to control tail latency. The reason for that is that data placement in AppDirect is explicit. Also, we see that for the workload that we tested, the performance is comparable to DRAM. Most of the commands reaching the cache layer is a read command, and usually network is a bottleneck and is the main factor in latency.
Other metrics that we are considering is our restart time. In our tests, for a single instance with 100 gigabytes of slab data, restart time was four minutes. For concurrent access with 18 instances per host, it was five minutes totally. These numbers shows that an AppDirect solution can speed up maintenance by one or two orders of magnitude. This result shows that with simple and small change in application code, there's a big value in AppDirect adoption with PMDK.
The next part of my presentation is to talk about adopting Redis to Persistent Memory. When we think about Persistent Memory and how it can be used in Redis, the first idea that shows up is let's make Redis data persistent, but this application actually uses its own specific mechanism for delivering persistency, which is based on storing the data set on disk, which is done by creating compressed files or write [00:37:12]. What should we do with this mechanism? We’re making Redis persistent? So, do we disable these native persistency options and replace them with persistency provided by memory? Actually, there's no simple answer to such a question. Both RDB and AOF are optional, and can be disabled, even in runtime. So, our solution also should be so flexible. What's more, RDB mechanism is the same as used during replication when application is doing initial full synchronization. If we decide to leave native persistency features, then in many approaches that we consider, we are facing a problem with CopyOnWrite effect due to its specific implementation on filesystem docs. CopyOnWrite is frequently used when Redis is called when it comes to our RDB creation or AOF [00:38:20].
So, our first approach was to make Redis persistent with PMDK and libpmemobj library. We could use atomic allocations and transactional API from this library and allocate keys and values from PMEM. The main Redis hash table, which is used to look up keys, can be stored in DRAM and can be rebuilt on application start up by iterating to all the allocated objects. All these can be easily done with API from libpmemobj. With this solution, we are receiving the highest possible consistency level. Every key and value pair is guaranteed to be safely stored on [00:39:23], which is the same or even higher level as original Redis with AOF enabled with option fsync always.
So, as our data is already persistent, we disabled snapshots and logging mechanism. According to Persistent Memory Programming Model, memory pool is a memory mapped file on filesystem docs. When Redis enters into a scenario that is using CopyOnWrite, it duplicates pages that are modified. If pages that are going to be modified are mapped from filesystem docs, then actually duplicated pages are by kernel allocated on DRAM. It means that our data is finally migrating from PMEM to DRAM, and this solution is not acceptable for us, so this mechanism would require some workarounds and makes our solution complex and more difficult in adoption.
From the other side, disabling native persistency mechanism limits current Redis functionalities, like, for example, possibility to use RDP files as compressed files with point-in-time backups. Also, from the other side, resigning from a fully volatile configuration is not an option, as this part is also an important Redis feature, for example when Redis issues other cache.
The other approach, that’s the second column, we had when we experimented with Action API from PMDK library, and this API allows us to register a list of actions, like reservation of memory, updating the memory, freeing, and then we can publish all the actions in the atomic step, so this behavior is similar to RDP and AOF configuration options, where a user either is able to decide which level of consistency you would like to keep in case of crash. Also, we got quite interesting performance numbers with this prototype. From the other side, CopyOnWrite still requires some complex workarounds.
The third idea is to resign from persistency of device and focus on its huge capacity. Using AppDirect in volatile mode has a lot of benefits, including the fact that application has access to both DRAM and PMEM. To easily utilize them, we used Memkind’s library, which is a volatile memory allocator. We decided that user data, which is usually the bigger part, will be allocated from PMEM, while Redis metadata and some internal buffers will be on DRAM. We also did some complex workarounds of CopyOnWrite and it was possible to support RDB and replication implemented in Redis. However, in this scenario, the other problem was that there was no control of how much DRAM versus PMEM was actually lost. Depending on value size of the traffic, we got different DRAM and PMEM utilization ratio. So, many of our problems were solved when Linux kernel version 5.1 introduced a patch for exposing device memory as system RAM. New memory kind was added to Memkind, KMEM DAX, and we decided that placement of allocations in Redis will depend on its size, bigger objects in PMEM and smaller in DRAM. KMEM DAX also natively supported CopyOnWrite so there were no workarounds needed, and all Redis persistence in the replication mechanism are supported. It simplified code, adoption, and also future support.
Memkind is a heap manager based on jemalloc. It allows to allocate memory from various memory pools, DRAM, PMEM, and high-bandwidth memory. It introduces the idea of memory kind that represents each of these devices. The question, why we wanted to use Memkind for both DRAM and PMEM, instead of leaving DRAM allocations as they are originally managed in Redis, so the answer is that when some data is spread across both memories, we need to know how to free each allocation. We need to track each structure in the code, or, from the other side, we can rely on Memkind, which does it for us. It means that we can just call common free function and Memkind will be able to identify how the allocation was done, if it was in DRAM or in PMEM.
KMEM DAX is a new memory kind connected with how PMEM might be exposed by operating systems. When we configure Persistent Memory as system RAM, it becomes a label for every application as normal memory. However, it can be differentiated as a separate NUMA node in the system with no CPU, only memory, and this has one more consequence. One kernel is duplicating memory operators using CopyOnWrite, new operators are actually allocated from the same NUMA node to which application was originally memory bound, so it solves the problems that we were struggling with previously.
In the Redis version that we implemented, the user is able to configure DRAM/PMEM memory utilization ratio, for example one part in DRAM and four parts in PMEM. The application is gathering statistics from allocator and it periodically monitors them. It uses some internal dynamic threshold to decide which allocation goes where. Smaller to DRAM and bigger to PMEM. The threshold is adjusted to keep memory ratio on the level chosen by the user. As we don’t have to track location of each allocation, code becomes very simple and easy to maintain.
Here we can see it. In zmalloc function, which Redis wrapper for every allocation, we just check one condition to decide how to allocate specific structure. In both cases, we used Memkind malloc but with different kinds of input parameters. Here is KMEM DAX and MEMKIND_DEFAULT for DRAM.
In zfree wrapper we can parse null as a kind, or as here in this code, we can detect_kind for a pointer and parse this kind to allocator as an input. There is no difference if we use null or prefers to be called detect_kind.
So, as you see, allocations can be very easily done with Memkind. We also introduced some optimizations which improved performance. We observed that it is worth to keep some structure always from DRAM, even if they are big. Some of them are very frequently allocated and then freed. Some are very frequently accessed. For example, client structure in Redis, which is responsible for storing incoming data, processing it, and sending a response. Optimization is very simple, as you can see it here. It is just allocated from DRAM always. Also, once we know that the client structure is always on DRAM, we can improve performance by parsing this information to Memkind and we are freeing this structure.
So, our code is open source. It’s available on the community repository under name memKeyDB. All the implementations that I talked about are available in different branches under PMEM repository. These examples shows that PMDK gives a lot of possibilities to modify applications. Even one application can be approached in different ways.
So, thank you. I will now ask Usha to continue.
Thank you. Great presentation. Again, thank you Ginger, Piotr, and Kuba. We had a few questions come up during your talk. So, I'll try and ask a few for each topic. So, let me go to the Aerospike question. Ginger, this is for you. What happens if the machine crashes? Is Aerospike data protected against that?
Yes, that's a great question. So, the short answer is if Aerospike crashes or the system goes down unexpectedly, Aerospike would treat the index for that data as not trusted. So, if that happened and the system came back up, you would have to go through and rebuild the index. That was a tradeoff that Aerospike considered. The other option was to flush key data on every single write, but the overhead of performance was just not worth it. So, Aerospike says we trust that these unplanned crashes don't happen as often as planned shutdowns, so you do need to rebuild the index if there's an unplanned crash.
Great, thank you Ginger. A few questions for Kuba. How do we pool data into different slabs?
So, if I understand, this is a question about the Twitter Caching System?
Pelikan. And if I understand the question correctly, different slabs actually have different, let’s say, classes of sizes, so depending on what is the size of the item to be stored, then different slabs will be used. I'm not sure if I'm answering the question, because it looks more like a question about internals of Twitter’s implementation.
No, I think you're at the right path. Basically, I think the question is about how do you decide what the size of data goes into what kind of slabs, and how is that decided.
So, it’s by the size.
OK, cool. The next question I have is, again, it's pertaining to Pelikan, in the hash table, what would happen if the table size grows beyond DRAM size? [00:53:03] Persistent Memory holder, [00:53:06] size or would it write to disk?
So, I expect that since the hash table in Pelikan is allocated using standard [00:53:24] API, so it really depends on the configuration of the system and the parameter called [00:53:37], so generally we shouldn’t expect that the allocator will not allow to allocate some data from DRAM. We only make that situation that it is, for example, swapped somehow—somehow to disk for example.
A few more questions on…
Are all these open source modifications already in open source stream? Have they been upstreamed, and is this preliminary work that may or may not be adopted? I think it’s more about has it already been upstreamed, both Pelikan and Redis?
OK, for the Pelikan, the modification is already upstreamed on Twitter’s repository, so you can find it there. For Redis, there were different prototypes that I described and they are available on PMEM repository. PMEM/Redis, there are different branches connected to each of the implementations that I described, and that the implementation with KMEM DAX is actually developed in community repository in memKeyDB. Right now, it is still under development and what is going to be done next with this, we’ll decide after receiving some community feedback.
Thank you, Kuba. The other question to Ginger. Have any Aerospike customers deployed with their Persistent Memory? If so, what are their experiences?
Sure, sure. Yes, so Aerospike has a couple of customers that have gone through PLCs already with Persistent Memory, and some that have put Persistent Memory into production. One that I mentioned was able to reduce their clusters from 20 nodes down to 10, and that's because they're able to fit more data on each node. Overall, the performance has been as we expected within five percent of the implementation with DRAM, and we're hearing really good feedback from Aerospike.
Thank you, Ginger. I think that's what we had. There is one question just coming up. This for you, Piotr. I don't know how much you were aware of this, but has Spark been optimized in any way to determine where to allocate memory, DRAM versus MEM, to maximize performance?
So, that's a good question. So, as far as I know, right now, you have to pick one or the other, and especially with this solution. With VMEMCACHE, you have to configure Spark, or at least to use MDM [00:57:16] you can't create a multi-tier solution, but while this is not possible, as far as I know, with Spark, it is very much possible with the VMEMCACHE solution itself, because it allows you to stack one cache on another one.
OK, thank you. I think that's all the questions we had. So, thank you all for attending the webinar. We hope you found it useful. If you have any follow up questions, there are links to the Persistent Memory documentation, blogs, and also, we have a Google Group in this resource section that we're showing right now. But remember, the community is open to everyone and we look forward to seeing you there.
And so, with that, we'll wrap up the session. Thank you again and stay safe and healthy. Thank you.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.