Persistent memory is a technology that has both memory and storage characteristics, and has low latency nanosecond access times like DRAM, is byte addressable also like DRAM, and as high capacity non-volatile storage capabilities.
Applications can leverage persistent memory in many different ways. With up to three terabytes per CPU socket, not including the DRAM, applications can use persistent memory as volatile memory to displace or augment pricey DRAM. We can use it as a block storage device in high bandwidth, low latency, I/O intensive operations. And finally, applications can use memory mapped files to provide direct access to the in-memory data from the user space, and this bypasses the kerneled page cache, and avoids any interrupts and context switches for fast access.
As a byte addressable media, we can use persistent memory with remote Direct Memory Access, or RDMA, and developers can use libraries such as librpmem and librpma from the Persistent Memory Development Kit (PMDK) to enchain durability features to applications.
Welcome, everyone. Thank you for taking the time to attend this webinar entitled Building Durable Storage Solutions with Intel Optane Persistent Memory. It's part of a series of webinars that we do relating to Persistent Memory, and this session is being recorded. The slides and the replay will be made available through BrightTALK in the next few days. So, as a courtesy to the speaker, all lines are muted for the duration of the session. You may ask questions using the Q&A box at any time throughout the webinar, and we'll answer them at the end of the main presentation. Hang around to the end, we'll give you a link to download the new Programming Persistent Memory eBook for free.
So, hello, my name is Steve Scargall. I'm a Persistent Memory Software Architect at Intel and the primary author of the book, and I'll be the moderator and emcee for this session. I'm joined by my colleague, Chet Douglas. He's a Principal Software Architect for Persistent Memory at Intel, and Chet is one of the main contributors to the book, specifically around the remote Persistent Memory in chapter 18, which is the focus of today's webinar.
Persistent Memory is a technology that has both memory and storage characteristics, and has low latency nanosecond access times like DRAM, is byte addressable also like DRAM, and as high capacity non-volatile storage capabilities. Intel first released the Optane Persistent Memory along with 2nd Generation Cascade Lake Xeon CPUs in April of 2018, and they're widely available from server OEMs and resellers, with full operating system support from Microsoft Windows and most Linux distributions. On June 18, Intel announced the 3rd Generation Xeon CPUs with a 2nd Generation Optane Persistent Memory alongside many new and updated technologies. So, the Optane technology is being used today to accelerate data center and cloud hosted applications.
Applications can leverage Persistent Memory in many different ways. With up to three terabytes per CPU socket, not including the DRAM, applications can use Persistent Memory as volatile memory to displace or augment pricey DRAM. We can use it as a block storage device in high bandwidth, low latency, I/O intensive operations. And finally, applications can use memory mapped files to provide direct access to the in-memory data from the user space, and this bypasses the kerneled page cache, and avoids any interrupts and context switches for fast access.
As a byte addressable media, we can use Persistent Memory with remote Direct Memory Access, or RDMA, and developers can use libraries such as librpmem and librpma from the Persistent Memory Development Kit to enchain durability features to applications.
So, with that introduction, I'll hand over to you, Chet, to talk about the details of Persistent Memory with RDMA, and we'll come back at the end to answer any questions in the remaining time.
Thank you, Steve. So, yes, I'm Chet Douglas, and I've been working on Persistent Memory for a while and this is one of my areas of focus, and I'd like to cover a little bit about using Persistent Memory across a network, and I'll try to make this high level to just get your feet wet on the subject, and we'll follow this up with links to way more detail on the topic.
So, here's the agenda. So, I want to talk about the Persistent Memory Programming Model. This is a key part of how we utilize Persistent Memory, whether it's on a local system, or it's attached via network to other systems that have Persistent Memory as well, and a big key takeaway will be that the Persistent Memory Programming Model applies no matter how you're accessing the Persistent Memory. I'll talk a little bit about durable storage, and what does that mean, and what does that mean in the context of using Persistent Memory on the network, and then I'll cover some of the networking basics that we’ll get our feet wet on, well, how do networks work in general, and then I'll add on top of that, how can we implement the Persistent Memory Programming Model over a network. And then we'll go into the PMDK kit a little bit, and I'll go through a sequence of how we have implemented the Persistent Memory Programming Model over a network in our library, and then I'll talk a little bit about Next-Generation Remote Persistent Memory, what's coming down the road to make this better, and even more fast, or even more performant, I should say. And then I'll go through a quick summary and then we'll finish up with some resources to dive in further, and a Q&A session at the end, but you're welcome to send your questions at any time and our moderators will help answer them as we go.
So, the Persistent Memory Programming Model, so the Storage Networking Industry Association, called SNIA for short, a crazy acronym, basically standardized what we call the Software Persistent Memory Programming Model, and it really comes down to two things, and that is when you're writing data to Persistent Memory, in order for the data to be considered persistent, after you've written the data, basically software is responsible for flushing that data out of the CPU caches and into the Persistence Domain, and I'm going to show a sequence here in a second that will show what I'm really alluding to here.
The second part of the tenet really is that once the CPU caches have been flushed to write data, the platform is responsible for making sure that data has been made persistent, should power be lost, and so there's a platform component and a software component to basically implementing the Programming Model.
OK, so how does this look? So, here's my computer, and there's a number of ways we can accomplish this, and I'll talk about specifically on an Intel platform, but here's my computer. It has a core with a couple of L1 caches, a level two cache, and a level three cache, and then a memory controller and my Persistent Memory at the bottom of the picture here. So, how are we going to implement this Programming Model? So, basically, if you'd use a MOV instruction, which is a typical way to push data into an address and a location, if you follow that with one of the following, either a CLWB, CLFLUSHOPT, a CLFLUSH, or write back invalidate, these will all perform the flushing aspect. Remember that the rule says if you write data, you have to flush it out of the caches. So, what do these do? So, CLWB will basically write back whatever data is in the cache but it will also leave the contents in the cache. CLFLUSHOPT and CLFLUSH will both flush data out of the cache and invalidate the data in the cache. And then write back invalidate is a big hammer, if you use that, and you can use it in the kernel. If you use it, you will basically flush all of the caches. These first three instructions operate on a cache line, which is a 64-byte wide data, and so that gets you the flush, and then basically, that forces the data into the Persistence Domain.
OK, so here's my picture. So, I've done one of these instructions to flush the moved data out of the L one, L2, L3, and then here's the Persistence Domain. So, on an Intel platform, the Persistence Domain is basically the boundary between the top edge of the memory controller, and so software is responsible for pushing all of the data you just wrote into the memory controller, and that's basically what we call the Persistence Domain.
Now, there's another way you can accomplish this, and that is you can bypass the cache all together doing what we call a non-temporal store, or an NT store, and basically what happens then is, by design, you bypass the cache all together, and you go directly to the Persistence Domain. To tie this up, software is responsible for flushing data from CPU caches to the Persistence Domain, and the platform is responsible for making sure that the data that you pushed into the Persistence Domain is written to the Persistent Memory should power be lost. The finish of my sequence here is the platform is responsible for ensuring that if there's any pipeline in the hardware here, that all your write data that you committed to this Persistence Domain has been pushed out to the Persistent Memory.
This same model applies when accessing Persistent Memory over a network, and that's a key takeaway. If you learn nothing else from this talk, if you can take that away, that will be good, but basically, hey, this same model applies over a network. You either have to flush your data out of the caches and into the Persistence Domain, or you need to bypass the cache somehow, and how are you going to do that over a network, and that's really what this talk is about, is how do we apply this Persistent Memory Programming Model to a network connection?
Durable storage, so this is my setup. So, here's some more overall vocabulary and some concepts. When I talk about durable storage, I'm really talking about, certainly in cloud and a lot of enterprise use cases, your data is not really considered durable until you've written multiple copies to multiple different places. In traditional enterprise software, that’s usually done with Raid, either hardware or software Raid, and Raid 1 mirroring is a good example where you're writing the same data to multiple hard drives or multiple SSDs, and you're doing that, so if one of these hard drives or SSDs fails, you have another duplicate copy. That's making your data durable. It's found in multiple places.
And then also replicating data to a hard drive or SSD on two or more systems using a network, that's also an example of durable storage because you’re placing multiple copies in multiple locations to allow you to recover should one of those copies fail.
Highly available durable storage I'm going to say is really the same thing as durable storage, except that you're making it highly available, which means multiple people can access the data from multiple places at the same time, and so that's what I mean by durable storage and highly available storage, and then we've got a little sequence here.
So, here's my enterprise durable storage system over here on the left. I have an application it’s using a Raid 1 system, and basically you write your data to the Raid, and the Raid copies it to two disks, and then you get your write response back, and then the Raid sends the response back to the application and says, OK, your write is done, and so in this case we now have copies of your data on disk zero and disk one.
Now, here's my cloud durable storage network, and so in this case, I've got two systems, and there's going to be this cloud, this network blob in between that's going to connect these two systems together. So, in this case, the application writes this data to the Persistent Memory, gets its acknowledgment back, and then the application uses the network and writes a copy of that data to another system. In this case, I used Persistent Memory here, but this could be an SSD or a disk drive, it doesn't really matter, but the idea is I'm now using the network to accomplish the same thing. I now have a duplicate copy of my data someplace else and it is considered durable.
Now let's do the highly available one, it's the same picture again. The application writes to the local data, replicates the data to the Persistent Memory on the other system, gets the acknowledgement back, and then the application could be retrieving data from the volume, from the replicated data, but App 1 over here can also be retrieving data from its copyright. So, that's what I mean by highly available, and if you had a system or a network with hundreds of nodes in it, and your data is copied in lots of different locations, you can have applications reading from those different locations and you can load balance between them, and you can do a lot of cool things, but it's all reliant on using the network to make multiple copies of things, and then making those multiple copies available to different applications in different places.
So, that's my durable storage and highly available durable storage picture. So, let's move on.
Networking basics. So, this is the setup for talking about the network, and this is some of the vocabulary we're going to need to understand. So, when I talk about an initiator system, or an initiator, basically, in my discussion it is going to be the system that's writing the data to the target system, and the target system is the one that's receiving the write requests from the initiator system. So, just for vocabulary’s sake, and you'll see in my sequences too.
TCP/IP is called Transmission Control Protocol or Internet Protocol. It's been around forever. The hardware is fairly inexpensive, and one reason is we onload everything to the CPU. Basically, the networking protocol, moving the actual data back and forth, all of it requires software on both ends of the connection, and so in a TCP/IP network, which are everywhere, they're ubiquitous, they’re all over the world. TCP/IP and these cheap little network cards are great for general networking, but because you're onloading all of this work to the CPU, there's a performance penalty to be paid by both ends of the connection, and so I'm going to show a demonstration of that in a minute.
DMA is what we call Direct Memory Access. This is basically when you're offloading the CPU from having to move data around, basically, you use an engine and hardware to basically do the—there's a bug. That shouldn't say network hardware. That should just say dedicated hardware. But basically, you're using the…the CPU is being offloaded to that and you're using dedicated hardware to actually move the data, and so Direct Memory Access is typically used on a system to move data from DRAM into your final storage or to move data from your final storage into DRAM to make it available for somebody to use.
RDMA is just put the word remote in front of that and basically remote RDMA is basically you're offloading the CPU from moving the data across the network using a dedicated RDMA hardware, and then typically it’s used to move initiator DRAM traffic to target DRAM traffic or target DRAM to initiator DRAM, and I'll show you an example of that in a minute.
InfiniBand, iWarp, and RoCE, which is RDMA over converged ethernet—sorry for all the acronyms—these are these are the main three transports today that use RDMA on the network. A NIC is a Network Interface Controller, basically the hardware that's used with a TCP/IP network. Again, like I said, it's cheap, a very simple hardware and available everywhere.
RNIC is an RDMA capable NIC, and it's used with RDMA hardware and an RDMA network, and it's more expensive, but it's way more performant, because you're offloading all of the CPU from doing this work, and we'll talk about this as the backbone of the cloud. The cloud basically uses RDMA with our RNICs to really move all of the data around the cloud.
OK, here's my networking basics. So, here's my first picture. I have an initiator system on the left, target system on the right, and a TCP/IP network, and so these blue boxes are software pieces that have to run, and so here's my application. He writes state into DRAM, and then he fires off this request, TCP/IP request, and a bunch of software on this system runs, churns away, uses the hardware, interrupts this side over here and a bunch of software runs on this side as well, and you move all of this data into DRAM, and so you’ve used the CPU on both ends of the connection to get all this data into DRAM. But you're not done yet. It's not durable yet. So, then you basically have to wake up the storage stack and do a local DMA request, and basically, finish the operation of moving the data from DRAM into your final storage, which is the SSD, and then you get a response that comes all the way back, and basically, your applications know that his data has been moved over to that SSD on the other end. But look at all the software involved, and if you had 100 connections onto this, typically in a cloud, you might have 100 initiators talking to one target, say, or thousands perhaps, and if you have to invoke software for each one of these transfers, that’s a huge burden on this target system.
So, now here's my RDMA connection. I get rid of these TCP/IP blue boxes because they're not done in software anymore. It's all done by this RNIC, which is a piece of hardware. So, my application writes to DRAM, he fires off an RDMA request, and the RNIC moves all the data on its own, the CPU is off doing something else. We still have to wake the system up over here, though, because the storage device, the storage driver still has to run to push the data into the SSD, and then you get your RDMA response back, all the way back, but you can see how we've already eliminated a bunch of software, and this is a lot of software. There's a lot of kernel software to churn through and build these TCP/IP packets, and then take and tear them down on the other end, and we get rid of all of that by using RDMA. So, that's a huge winner for using RDMA.
I'm going to keep moving. So, for RDMA I want to introduce a couple of concepts. RDMA Write, which basically, you set up your resources ahead of time and you send this RDMA Write across and the system on the other end is not interrupted. In fact, he has no clue that you even have done it. Remember we talked about we're offloading all of this from software, so the CPU on the other end of the connection, he doesn't even have a clue that you've done this, which is really cool. You can optionally interrupt the target on the other end of the connection, but most likely you don't want to, and then RDMA Writes can be delivered out of order. So, you can have writes that pass other writes, writes from other connections, everything can pass other writes, and so your software has to do something to ensure that you take care of the fact that you can do things out of order.
The other thing, the first concept with RDMA Read, it's the same thing. You set up your resources ahead of time, so it's really fast, and there's no CPU involvement and the target system is not interrupted, he has no idea that you're reading stuff out of hardware on his system, and then the RDMA Reads are always delivered in order relative to the writes. So, we're going to take advantage of this fact when we apply the Persistent Programming Model to our RDMA connection.
So, how does that look? So, here's my network, again, with my initiator and my target, and I already talked about this. We write the data to Persistent Memory, so we put the data into DRAM, then we tell the NIC, hey, go start this request, and he does a RDMA Write request, sends that across. Now, you'll notice that the NIC basically responds as soon as he gets it, and he says, yes, I've got it, I've sent it on, but this last step here of actually sending it on is not part of this acknowledgement, and I'm going to talk more about that in a second here, and that's what I point out here, is this acknowledgement went back, if you notice the sequence, before the data was written to Persistent Memory. So, just like in the local case where writing the data doesn't guarantee it's been made persistent, in the remote case, we have the same problem. I've written the data, and I've even gotten an acknowledgment back, but that doesn't tell me the data has been made persistent.
So, this is one of our challenges, and that is, as the Programming Model states, the applications needs to execute a flush to make sure the data has been flushed to the Persistence Domain. Well, how are we going to do that across the network? Well, I just introduced you to the RDMA Read and the clue is in the RDMA Read, and that is, what we're going to do here is, after that write command, we're going to follow that with a small read request. So, here I tell the NIC go to an RDMA Read, and it doesn't really matter what location you read from, what you're doing is you're forcing the NIC. So, here's the read that comes across and the first thing it does is it forces any additional writes that were in the pipeline into Persistent Memory. This is just because of the ordering rules of an RDMA Read after an RDMA Write. It basically is a forcing function that forces all the previous writes out of the NIC and into the Persistent Memory Domain.
And so, then the read goes. So, after you've pushed the write data, then the read goes and then you get your read response, and the RDMA Read response comes all the way back to the application, and he now knows, oh, those writes have been made persistent, and you can do one read for a number of writes. You could send a whole bunch of writes over and follow it with one read, and it would basically, because of the ordering rules of the hardware in the pipeline, would force any write data ahead of it.
In my deck here, I'm calling this is the Appliance Remote Flushing Method. In the book, I think we call it slightly differently, appliance persistence method, but for the sake of this conversation, we call this is the appliance method, and I'll explain why in a second here.
And here's the challenge, and this is the challenge, and the caching implications, those previous pictures, I didn't tell you, but basically assumed that there was no cache on that target node. Remember the Programming Model, and a little bit about software being responsible for flushing caches after writing the data. The flow on the previous page only works if the target node does not utilize the CPU caches for writes. So, an RDMA Read—So, well, why can't I just use the RDMA Read still? Well, if you had a cache in the picture, the RDMA Read could simply just return data from the cache, and it wouldn't actually force data to be written out to the Persistence Domain. So, on an Intel platform, we have something called DDIO, or Data Direct I/O, and this is turned on by default and what happens is some amount of your RDMA Write data is placed directly into the target system’s CPU cache. You can disable it to give the resulting flow on the previous page, and in fact, we have customers that do that because it's highly performant, but what if it's not disabled? What if you do have a cache in the picture? Well, how are you going to flush it if an RDMA Read isn't going to flush it? So, I introduced an RDMA Send command, and this will be the last command we really talk about for RDMA, and that is, the data of the buffer resources are contained in the message. You don't have to set anything up ahead of time, you basically just send a blob of data across, and the applications on both end of the connection decide what that blob means, and we're going to take advantage of this to do this flushing technique when the CPU cache is involved. But basically, what happens is the target system is interrupted when it receives the RDMA Send message, so we're going to take advantage of that, and it's delivered in order relative to writes and reads, and it's treated just like an RDMA Read. It has the same ordering semantics, so that means it's going to push those writes ahead of it just like the read did.
How are we going to take advantage of this RDMA Send? So, here's how this looks. So, here's my same system again, and I do my same writes, but now notice—oh, before I start, I should say, the target system now has the CPU cache blob in the middle here now. We've got to now deal with this other piece of hardware that potentially has my data in it. It may not all be in my Persistent Memory, maybe some of it's stuck in the cache here. So, I put my write data into the DRAM, I sent my RDMA Write across, it goes into the CPU cache, and I get my completion back. Great. So, all I know at this point is the data has made it to the other side, and it's somewhere in this pipeline between the NIC, the CPU cache, and the Persistent Memory. So, it's hardly any indicator yet that I've made this data persistent.
So, now here's how we're going to do the flush. So, I've introduced now a piece of software over here I call a daemon. It's like a system software that runs in the background. It's sitting there running, waiting for somebody to connect to it and waiting for something to do. So, now what I'm going to do is I'm going to do an RDMA Send, the RDMA Send comes across, it gets to the daemon over here, and basically, inside of that send message I've put a list of locations that I want you to flush, and so inside of this RDMA Send, I sent you a list of locations to flush, and remember the RDMA Send will automatically wake the daemon up. So, he gets woken up. He says, oh, I've got this list to flush, and so he basically issues CLFLUSHOPT. Remember that instruction I talked about that's an optimized pipeline flush operation, and you issue this for every cache line that's in that list, and then that basically forces the data out of the cache and into the Persistent Memory member. Remember CLFLUSHOPT actually invalidates the cache. You could use CLWB and it would leave the data in the cache, but the most important point is you're forcing the data into the Persistent Memory domain. And then when it's done, the daemon has to send an RDMA Send command back to the initiator and say, OK, it's done. I flushed that list. All of the data that you sent me in that list can now be considered persistent and you can move on, and so you can see why this might be more painful than the previous one. The appliance method, there was no daemon you had to wake up over here. It was all done in hardware, and it's really quick. You do a small read and you get that read back and you're done. Here I have to do an RDMA Send, send it over. I've got to wake this guy up. He's got to flush everything, wait until it all flushes, and then he’s got to build an RDMA Send command. I've got to send it all the way back. This method is much slower than the other one, and we call this is a General Purpose Remote Flushing Method, at least in my slides here, and the reason we call this is a general purpose one is typically when you buy an Intel platform, that DDIO is turned on by default, and so you have the CPU cache involved, and so lots of people are not willing to turn off the DDIO feature. They want to keep the DDIO on. If you can turn it off, then you can use the appliance method.
Let me move on. So, basically, here's my summary. We have the Appliance Remote Flushing Method. DDIO was off, you do RDMA Writes followed by a small read, and that small read will push the writes ahead of it and make sure they're flushed, and then we have the general purpose method where you leave DDIO on, which is the default for an Intel platform, and you send RDMA Writes and follow that by a send command with a list of addresses to flush. You wake up the target, he flushes the data, and then he sends you a command back to acknowledge back to the application that all that data has been flushed and is now persistent.
There's performance implications. As I mentioned, the appliance method is 50 to 100% lower latency than the general purpose method because you only have to do an RDMA Read. In the general purpose method, I have to do a send, and another send in the other direction, so it's twice as slow. In the use of this you have to determine if DDIO mode is turned on as a target, because now the initiator has to know how he's flushing it. Am I sending a small read, or am I sending a send command with a list? And this is one of the complications of using this today, is that this information has to be passed to the initiator. He's got to know before he sends his I/O what mode this target is in, and this is one of the nuances and complications of this method, is that some of this information—ideally, it would be nice if the initiator didn't have to have any clue as to what mode the target machine was in, and he could just send one command and it would flush, it didn't matter how it worked. Now, I'll talk about that here in a minute when I talk about what's coming in the future.
But this is my wrap-up on the methods that Intel basically supports architecturally, and the methods we validate and the methods that we do our performance analysis on.
Let's jump into PMDK real quick. So, PMDK is the Persistent Memory Development Kit, and basically, it's a collection of libraries that make programming with Persistent Memory easier. The high-level goal is to make life easier for programmers, make them be able to utilize persistent memories as easily as possible, and this library will handle some of the nuances and subtleties that make it easier for you. And so PMDK has support for what I just went over, RDMA with Persistent Memory, implementing the Persistent Memory Programming Model over a network. Our PMDK solution implements both the appliance and General Purpose Flushing Method. Librpmem is basically on the initiator node, and I'll show you a picture in a minute of where it sits inside of the library architecture, but it basically implements a simple synchronous replication model, and so what happens is you write your data to your local memory and then when your application calls pmem_persist on the local mode, behind the scenes this library will say, oh, he's going to write to this other node, and I'm going to make a replica of that data before I return back and say that we've persisted your data, and I'll show you a sequence in a minute.
Rpmemd is a daemon, a service that runs on every target system that basically you're using to replicate data to, and so the librpmem and rpmemd communicate with each other using a secure socket connection. This could be over an RDMA connection and there could be a number of ways that you could have a secure socket, but essentially, these two components need to exchange buffer resource information with each other. Remember, you're setting up all the buffers and resources you need so the RDMA connection is all set up ahead of time, and I didn't talk at all about how that's done. It's not really important for this discussion, just realize that this is part of what's exchanged between these two applications, and it's outside of the RDMA spec as to how you do this. But these two apps communicate with each other, share this knowledge, and one of the knowledges that we share is that each of the targets tell rpmem, hey, I've got DDIO on, I've got DDIO, off, and then rpmema is going to decide how it's going to flush the data based on that knowledge. And again, you're obtaining all of this knowledge beforehand so that all of this out of the I/O path. The I/O path is purely a performance path and all of the exchanging of information to set everything up is done ahead of time.
The way we use RDMA today is we don't code all of the RDMA ourselves. There are a number of Open Fabric Alliance libraries in Linux that implement what we call libibverbs and Libfabrics, and these are two libraries that basically serve as a generic mechanism for working with RDMA that obfuscates away the details of our using RoCE versus iWork versus InfiniBand for your underlying connection. You hide those subtleties from the applications on the top, and so we utilize a number of open source libraries to make working with the network easier, and that's very typical, certainly, in the Linux environment to use these libraries.
Also, we take advantage of other PMem technology. There's the poolsets that we have that allow an administrator to set up the files and the buffer locations, and we utilize that here in our solution as well, and behind the scenes, you set up the…the administrators set up this network of PMem poolsets on each of these systems, and that's what PMDK is going to utilize for moving data around between the connections.
We're almost done. So, PMDK is basically what we implemented. This is our implementation. It is a very basic synchronous copy. We have one active initiator that basically has an application that’s writing data to a local Persistent Memory DIMM, and then basically behind the scenes PMDK is going to replicate that data to one or more passive target systems, and why I mean passive is that you're writing the data to these systems, but unlike my highly available picture that I showed before, we're not actually reading from these other systems. They're purely just replicating the data, and so it’s not a highly performant cloud type of application where you would be writing to lots of different nodes and accessing the data asynchronously from lots of different nodes. Our PMDK model is a very, very basic synchronous copy mechanism.
The performance of it is heavily dependent on the application that sits on top of it. So, on top of the PMDK library is an application that’s actually sending this local Persistent Memory writes and so if you send lots of tiny little writes, we're replicating each one of those across the RDMA network, and so you'll get horrible performance if you send really tiny writes and lots of them, and we have to keep flushing each one of those across the network. So, it has limitations, but it's a technology demonstrator. It provides sample code on how to implement the flush mechanisms, and I will say that architecturally, we have…it's a sound, proven architecture, and we have cloud customers that are shipping cloud solutions today that are making use of this technology. They may not be using PMDK, but they're using the technology that PMDK demonstrates.
So, here's my sequence. So, here's my initiator system, so I've got my NVDIMM over here—actually, that should be PMem, and my RNIC here, and on top of that, I've got libibverbs and libfabrics for each of those OFA, Open Fabric Alliance, libraries that make working with RDMA a lot easier. They just handle some of the low-level stuff of setting up the connections, managing the resources for the connections, and make our life easier.
On top of that is this librpmem that I talked about. This is the component that's going to replicate whatever data comes down from the app through libpmemobj that typically you would be going down this path on the left here, libpmem, and writing data to local Persistent Memory. We now introduced this path on the right where we're going to also send traffic down librpmem, and it's going to use the network to do the replication to the other side. The other side consists of my target system with an RNIC and Persistent Memory, and I have verbs and fabrics on top of that, and then I have this daemon, this service that runs in the background, silently sits over here waiting for somebody to connect to it on the secure socket, and when librpmem connects to it, they exchange resource information, they set up the connection. This target says hey, I've got DDIO turned on, and so librpmem over here says, OK, cool, I can use the RDMA Read to do my flushing. So, here the application sends its write data to the local Persistent Memory, or NVDIMM, it doesn't really matter, and then he does a PMem Persist call, like you'd normally would, to make his data persistent on the local machine, and so libpmem gets involved, figures out which flushing mechanism to use, even on the local machine, which instruction is the best, sends the flush, and basically, you've now flushed that previously written data to the local Persistent Memory.
Basically, at the same time—or as soon as that's done, then basically the libpmemobj also sends a request over to libpmem and then he sends an RDMA request down to do RDMA Write, and then basically the next-gen hardware pulls data out of the DIMM. And now I send the read. The library can go ahead and send the read. It doesn't have to wait for the write to complete. He just goes ahead and puts the read on the wire, because remember the NIC is going to do it in order, the RNIC, and so the read goes out. The NIC sends the write across. It goes into Persistent Memory. Now you send the RDMA Read to do the flush. The flush goes, he gets his read response back, and then basically, the library says, oh, the read’s done, and then basically we tell the library the read’s done, and he tells the application, your PMem Persist is done. And now at this point, the application…well, actually, he doesn't even know. He didn't have to make any code changes to necessarily utilize the RDMA. You just had to turn on all these PMem poolsets and configure a system, but the application itself, all it did was replicate data to the local system and call PMem Persist like he always does, but in this case it took a lot longer because behind the scenes we did an RDMA Write, an RDMA Read. We waited for the read to come back, and we had to wait for that to come back before PMem Persist could complete. It’s of limited implementation. It's very simple, what we did, but it allows you to see how these flushing mechanisms work and it can be highly performant. If you send large writes down, you can get very good performance out of this, but if you send tiny writes down, then the performance starts to go downhill.
But remember, this is a very simple initial solution to get people interested in what's going on here.
I have one slide on futures and really what's going on is what we call RDMA Memory Placement Extensions. Everything I showed you on the previous pages you can do with today's RDMA hardware. You don't need anything special to utilize Persistent Memory, but in the future, we've made changes to the networking protocols. The standards have actually been changed to now support an RDMA Atomic Write, which provides an eight-byte automaticity guarantee for the remote NIC that basically…and we use it for writing pointer updates usually in the cloud, typically—or maybe not typically, but one way to use it is basically you update data and then you send a pointer update and software on the other end is waiting for this pointer to be updated. This is a lockless way of doing synchronization of data, and so this RDMA Atomic Write guarantees that the NIC will write the full eight bytes of data to persistent or volatile memory. This is something folks have been asking for, for a long time, for use with volatile or Persistent Memory. And then we now have an RDMA flush which basically flow all the previous writes, and unlike an RDMA Read or an RDMA Send, you send this command instead and the NIC will understand it. It’s a new command that the RNIC will understand, and because of that, it can basically control the ordering and make sure that that flush command isn't sent until all the previous writes have been sent, and it basically takes away some of the waiting and that read that the application had to do is much more efficient now. And then there is RDMA Verify, which basically allows you to verify data that you've previously written without having to use the network to do it. Basically, the NIC on the other end verifies all the data and they send you a response back saying whether the verify was successful or not.
So, this is cool stuff that's coming down the road that makes us even more performant.
Here's my summary. RDMA is a critical underpinning of today's cloud architecture. I'd mentioned that a little bit, but really, TCP/IP is everywhere, it's cheap, it's inexpensive, it's very easy to use, easy to set up. All of the operating systems have complete support for it, but it requires software to run, and so if you have a lot of connections, like typically in the cloud where you may have a hundred or a thousand connections to a single server, if you have to have all those TCP/IP software running on that other end, it's really going to slow things down.
Persistent Memory can outperform an SSD when accessed over a network. I showed you that picture where that NVMe stack had to be woken up and it had to do a DMA operation to move the data from the DRAM into the SSD. With Persistent Memory, you can do it all on hardware, you just write it directly to the Persistent Memory and it's persistent. You don't have to write it to DRAM first and then move it to the SSD, and so it can be significantly faster than accessing an SSD over a network.
The Programming Model still applies. When you're accessing Persistent Memory over a network, everyone should realize that that same Programming Model that the PMDK library adheres to—and it's a standardized model that's out there now, and it's used going forward for Persistent Memory in the future, realize that same model has to be adhered to when you're using Persistent Memory over a network. That's probably the most important thing to remember from this.
And that PMDK implements the Persistent Memory Programming Model, and it has a simple synchronous data replication model, where we'll write data to multiple remote nodes, and if a system crashes when the application restarts, basically, it will resynchronize and resynchronize all the nodes again. So, we have some basic fundamental synchronization and replication built into PMDK, but, like I said, it's a simplified model to get started.
So, that’s my presentation. Hopefully, you found it interesting and it wasn't too complicated. I know, there's a lot of buzzwords I threw out there, but I think that the Persistent Memory programming book, there's lots of details in there, and so there's a detailed chapter 18 on the RDMA networking parts of PMDK, and it covers in detail what I just went over at a very high level.
So, here's the resources—
—yeah, go ahead.
I was going to say great talk there, Chet. I think you really got the message across and, as you say, there's plenty of information out there with the resources that we have listed here on the slides plus the book, and we've got Google Groups and documentation and Slack channels and stuff that everybody's able to join the community and ask really good questions. Yes, a great presentation.
We did have some excellent questions actually come in, so we've got about maybe 10, 15 minutes, if we can go through those real quick, Chet. We'll try and get through as many as we can. Yes.
Go for it.
So, I'll go through them in order because some of them tend to lead onto each other, which is nice. So, the first question we got was, “I noticed that the PMDK RDMA support is currently marked as experimental. Can you explain a little bit more about what the experimental tag means?”
Yes, and, in fact, Steve, you may be able to help me out here, but I'll take a gander at it. The reason it's marked experimental in our current solution is it really goes to what we implemented, as I said. It's a very simple synchronous replication model. In a way, it was really a technology demonstrator for ourselves as much as anything, as we learned about how to use Persistent Memory. It was our first attempt at it, and I think, in general, we tend to leave it experimental. We don't really have a major customer using it, or we don't have a really compelling use case that's driving customers to use our PMDK solution. I think the main reason it's marked experimental today is because of that. I really didn't mention that we have our—we have a new library that's being implemented right now that's a much simpler, thinner layer that I think will be more palatable to more application developers, and so it's going to get easier, it's going to get easier to use, and we will remove the experimental label from that new library certainly.
Yes. No, I mean, just to echo that, it’s really just customer demand, and the feedback we get is it is a technology demonstration, as you mentioned, and we're happy to receive feedback. It's an open source project. If you have improvements or you want to take it and implement something different, feel free. It definitely seems to be working for the customers that are currently using it today, but yes.
Yes, exactly. It's very performant. If you do it right and take advantage, and you're an application, like a cloud vendor who knows and really is involved and knows what their app is doing, if you do it right and take advantage of this technology, you can get amazing performance, and we have customers that are doing it.
Yes, very much so. Yes, there's quite a few customers out there. Not all of them tell you that they're using PMDK, but most of them are in some degree or another. So, yes, that's great.
So, actually, there was—I'm going to jump ahead because there was a good question that was along these lines that I think I can answer. So, the question was more of can apps use libpmem with rpmem directly, or do they have to use libpmemobj mandatorily to access the rpmem? So, the way that our current implementation is that librpmem, as shown on this diagram, uses the poolset implementation, so this is the ability to use one or more memory mapped files, whether it be local or remote, and we use this remote concept to take any updates that are given to us from libpmemobj into our local pool, and then push those over the wire to the remote Persistent Memory pool. So, currently, librpmem does require libpmemobj, but like Chet said, the new librpma library does not. It becomes a very low-level library that anybody can use to issue those send and receives that we talked about.
So, hopefully that answers that question. So, I mean, we covered this as well, Chet, but a couple of questions relating to performance and latency and that type of stuff.
Let's see. Early estimates on performance improvements with new RDMA extension. So, is that the first one or…?
It’s more to do with performance than latency of Persistent Memory and RDMA, in general.
Yes. OK, yes, so I can tell you that on a 100-gig network, RDMA network, the time to do that RDMA Read, that small RDMA Read, adds around two microseconds or so, in that ballpark, and on a 40-gig network, it was in the four to six microseconds ballpark. So, not really talk about the speed of the actual Persistent Memory, but the additional latency over the network of doing that small read to flush your data adds two to six microseconds, depending on your network, and if you use the RDMA Send method, the general purpose method, it doubles that. But you can easily achieve the—whatever the back end Persistent Memory write speed is, you can easily saturate the Persistent Memory over the network. There's no problem in doing that.
Yes, and it depends on the number of hops and where geographically your initiator and your target are. If they're halfway across the world, fo course you can have more latency, but if they're relatively local, performances can be pretty good compared to current technologies.
So, just going back to some of the early slides there where we talked about the CLWBs and CLFLUSHOPTs and stuff, we had on there the plus SFENCE, and I don't think we covered what the SFENCE actually meant.
Oh, yes, so I did skip over that, and so basically, it's another instruction you need to send. So, after you flush your data, you need to put a fence in place that basically to the rest of the hardware and the rest of the fabric, the underlying hardware underneath the thing, that basically you put a fence there that makes sure that other reads and writes can't pass this and somehow screw up your ordering, and so the fence is put in place to basically sequentialize other reads and writes and make sure that they don't get in front of this somehow and screw things up. So, yes, I didn't mention that, but it's another instruction, typically SFENCE, but there are several fence instructions that you need to use. That's part of the Programming Model as well.
Yes, and the PMDK libraries take care of this for you, which is one of the benefits of using the kit, is that all of this architectural information is embedded into these libraries, so it will correctly and semantically do the correct operation, whether it needs an SFENCE or not, and it will optimally choose whichever machine instruction, whether it be a CLWB or a CLFLUSHOPT, depending on the CPU and the architecture that you're running on. So, developers don't need to really worry themselves with architectural differences between platform A, platform P, or generation A, generation B.
And we continue that model with the RDMA. As we have those details of how you flush all this stuff, we're doing the same thing. We're going to hide that from the application to make his life easier.
Yes, sounds good. So, next question was, “Does a daemon, and that's going to be the rpmemd, have to run on the main CPU, or could an intelligent NIC flush through the cache into the PMem in your model?” So, it covers the diagram that you're showing here plus the futures model that we've—
Yes, I guess…I mean, I think rpmemd would still be needed to set up the resources and the connection, and today, it's not part of the performance path. You set up the…so I'm not sure that that would really change, but a SmartNIC certainly could. Like the RDMA Verify, that seems like a good thing where maybe a SmartNIC would handle that for you, but I think rpmemd will always be needed, because there'll always be a software overhead of setting up an RDMA connection that doesn't go away even in the future. All of the complications of setting the connection up will still be there.
The next one is, “If we have a more responsive mechanism to guarantee that security could affect the performance a lot, or do we have a good way to improve the performance at this time?” So, I think this is more about the protocol. As we're sending data across the wire is, is that data or could that data be encrypted and encrypted by the NIC, I guess, or maybe by some software?
Yes, I think certainly today you can put encryption on top of this. RDMA is just the tunnel, just a pipeline to move data back and forth across the network. You certainly can have software or hardware. I don’t know if there are RNICs today that are encrypting data across the wire, but you certainly could architecture it. You could have a piece of software or a piece of hardware on each end of the connection to encrypt and decrypt data, perhaps. So, it's not part of RDMA. It's something you would implement on top of RDMA.
Yes, kind of like TCP, really. You're just using it as the mechanism to get from point A to point B, and what you're sending is really part of that payload, so it's up to the app and libraries to encrypt it. Sounds good.
Next question was, “How much is the performance gain from RDMA Send versus the RDMA Flush?” I think that's the read after write operation.
Yes, so I think I already mentioned that the small RDMA Read after a flush is twice as performant as doing—I'm sorry. An RDMA Read instead of an RDMA Send was twice as performant, and RDMA Flush is probably of similar performance to an RDMA Read, but the beauty of the flush is that the NIC, you give that instruction to the NIC and it allows you to pipeline these instructions, and I didn't really go into the detail of that, and that would be a follow-up presentation, or it's in the book as well, but basically, there's other ways of flushing it that you have to take care of.
Yes. So, the next question is a little interesting, but I'll read it verbatim because I think they know the intent, but [00:56:45] the persistent seems to be deliberately not permanent, [its just so on] like Persistent Memory have a use, or permanent memory have a use [00:56:54]? So, I think basically what they're saying is, why Persistent Memory—or why do we have to do the read after write with Persistent Memory today versus the futures, and that's really just historical, right, Chet, is that Persistent Memory is brand new to the market, RDMA is not, and we’re working now to update all the NICs and the protocols and everything else to make it Persistent Memory aware.
Exactly, and it takes time, and the networks, they move slowly because they're conservative, which you want to be, and not introduce too many new technologies. So, yes, that's just it. It's the Persistent Memory is already available, what tools do I have today with an existing RNIC and an existing RDMA implementation that I can use? And that's what we use. I could use an RDMA Read or an RDMA Send. An RDMA Flush is the future, and it will make things better, but yes, today, we had to take advantage of what was already available on the network.
Yes, like you say, it's just the evolution of technology. So, yes, that's good. We've only got two minutes left, so I think we'll call it there so we don't go over time, and just again, thank you, Chet, for your time and putting this all together, and a lot of work. It’s very informative. So, again, thank you to all the attendees of both this live session and anybody watching this in the future with a replay, and again, we hope you found it informative.
If you have any questions, follow-up questions after this session, there's plenty of links in the documentation. Like I say, we've got some blogs, we've got the book obviously, we have Google Groups, an open forum, we've got Slack channels. So, come join the Persistent Memory community, learn from others, ask questions, help others. We definitely advocate feedback from some of the open source projects that we have out there. So, yes, looking forward to some very, very good discussions with some of you.
I think with that, we'll wrap it up today, and thanks again, Chet, and everybody else and have a good rest of the day.
Thank you, guys. Everyone have a good day. Thanks.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804