Performance Guidelines for Intel® Optane™ Persistent Memory

In this webinar we explain how to write persistent memory optimized applications in a way that maximizes performance. We will also be sharing techniques of achieving fail-safety without compromising performance that are used throughout the Persistent Memory Development Kit codebase.

  • Important performance characteristics of Intel® Optane™ DC Persistent Memory.
  • How to design your algorithms in a way that optimizes performance on persistent memory.
  • High performance techniques for achieving power-fail safety in your code.

Hello, everyone.

Thanks for joining us today.

Welcome to today's webinar entitled "Performance Guidelines for Intel Optane DC Persistent Memory." My name is Steve Scargall.

I'm a persistent memory software architect in the Intel Data Center Group and will be the moderator for this session. Our speaker today is Piotr Balcer. He's a lead software developer and architect of the Persistent Memory Development Kit.

Now, each quarter, the data center group at Intel presents persistent memory-related topics, and you can find links to previous webinars in this series on our software.intel.com/pmem page. We are recording this session, which will be available along with the slides and the resources in a week or so and you'll receive an email with a URL when it's available.

To avoid interruptions and background noise, we will mute all lines for the duration of this session. You can use the Q&A feature to ask questions throughout the session, and I'll put forward to Piotr at the end of his presentation.

Now, Intel officially launched the Intel Optane Data Center Persistent Memory product in April this year. Server platforms based on Intel's second generation Xeon Scalable Processor, code-named Cascade Lake, and Optane's Data Center Persistent Memory are now available in the market from your preferred OEM service supplier.

More recently, we announced a second generation of persistent memory product, code-named Barlow Pass, along with newer Cooper Lake and Ice Lake Xeon CPUs to support it.

So look out for more announcements in early 2020. And with that, I'll hand it over to you, Piotr, to begin today's webinar.

Thank you very much.

Thank you, Steve, for this wonderful introduction. So today, we are going to be talking about performance aspects of persistent memory programming. That includes various instructions for a cache flushing data into persistence; the transfer unit of Intel Optane DC Persistent Memory and how it affects bandwidth; the importance of avoiding cache misses; various approaches to achieving failure atomicity at a relatively low cost; how memory interleaving can impact performance; and finally, we will show how to correctly measure performance of the persistent memory applications. Throughout this presentation, we are going to be showing the impact of suggested approaches on an example of a single-producer single-consumer ring buffer.

This is a relatively simple data structure, which is a contiguous buffer to store [INAUDIBLE].. It also uses two methods that make the data viable, a read position and the write position. The read position is used to indicate the element from which the data is read from, and the write position is used to indicate the elements from which the data is written to.

After the respective operation is finished, the corresponding position is incremented. We are going to show two implementation of this data structure later on as we learn about different performance optimizations.

With that out of the way, let's move on to the cache flush instruction, clflush.

It's the basic instruction that invalidates the cache-line and stalls the CPU because of the built-in fence. In the example shown on the slide, we can see two cache-lines being written to and then flushed out of the CPU cache and into the persistent domain.

The persistent variable in the data structure is used to indicate that the previous stores are persistent. And this example, there's no need to issue any extra order instructions because the clflush itself is ordered. It's not the best choice for performance, but it gives us a baseline for optimizations. Now, let's see how the MQ operation of persistent ring buffer looks like when using clflush. The full source code for this solution can be found under the link displayed on the slide. First, we load the two position variables using atomic operations to ensure thread safety. To ensure that the data is not only visible from this thread but is also persistent, we have to flush it. We can issue only one flush because the variables are located on the same cache-line.

Afterwards, we calculate the next write position by incrementing it.

I'm using a [INAUDIBLE] for entries, which is faster equivalent of module for the values that are [INAUDIBLE] or to.

If the next write position is equal to the read position, then the ring buffer is full, and we cannot proceed. But if it's not full, we can calculate the pointer for the entry we want to write to and then copy the data. Once that's done, we flush it.

Because clflush operates on a single cache-line, we have to create a loop to cover the entire entry. And finally, we'd write and flush the new position. We are not benchmarked that used one-consumer and one-persistent thread, but both times it's about for bigger loads of data.

That operation took about nine seconds. This is going to be our baseline for comparison with an improved version of the same code.

But let's get back to the cache flush instructions. The next instructions is clflushopt. Opt stands for optimized.

This instruction is similar to clflush but does not have a built-in fence, allowing applications to take advantage of hardware parallelism. To ensure ordering, the sfence instruction has to be used exclusively, as shown on the example. The next instruction is clwb, cache-line write back. In addition to having the same properties of clflushopt, this instruction does not guarantee that the data is going to be evicted from the CPU cache. This means that in certain circumstances the data will be retained in the cache. On the read after a flush, it won't [INAUDIBLE] guaranteed cache miss.

This instruction is available on Cascade Lake CPUs, but its current implementation is identical to that of clflushopt.

This might change in the future generations of Xeon Scalable. Using row compiler intrinsics to properly operate on the system and model is rather difficult. For this reason, we've developed Persistent Memory Development Kit, and more specifically libpmem, which creates convenient functions that automatically detect the platform features under the right thing depending on the situation.

The flushing primitives exposed by this library have a decade on the alignment of the buffers. So developers don't have to think about cache-lines anymore.

In addition to flushing functions, libpmem also has an implementation persistent memory optimized, memcpy and memset. Those functions, in addition to copying data, also flush the data when told to do so only when it's necessary. That was it about cache flush instructions. Let's now talk about bandwidth on the ECC block size. On this slide, I cite a paper entitled, "Basic Performance Measurements of Intel Optane DC Persistent Memory." I encourage everyone that's interested to read this publication.

It's an excellent third-party evaluation of Intel's hardware. From this situation, we can conclude two things. The benefits are symmetric.

Writes are slower than reads, and that the bandwidth peaks when using 256-byte transfer units. The reason why the bandwidth can only get saturated at 256-byte transfers is because the hardware itself operates on 256-byte ECC blocks.

This means that applications that push to fully take advantage of their available bandwidth must operate on 256-byte blocks.

If that's not the case, the DIMMs have to first touch the remaining part of the ECC block, modify or read the desired cache-lines, and for writes, try it back the entire block. The best approach to ensure full utilization of available bandwidth is to take advantage of non-temporal stores.

Most applications take advantage of non-temporal stores by using libc memset/memcpy.

This happens transparently.

But it also means that the application itself has no control over if and when stores are actually non-temporal.

That's one of the reasons why we implemented persistent memory optimized memcpy in the libpmem. It allows the application to explicitly request that the operation uses non-temporal instruction. Let's see an example in which we want to modify 256 bytes on persistent memory media. There are many ways of accomplishing that. We could use memset, we can do it by hand, or we can use pmem_memset.

As we discussed, the hardware operates on 256-byte blocks, so that's the transfer unit we shall be aiming for. But also, we have to be mindful of cache missing, which can happen on the writes of less than a cache-line. Non-temporal stores help us accomplish both of those goals. Here is a little bit more complex example of code that wants to reliably store some data into persistent memory.

It uses a persistent flag to indicate whether or not the data itself is written up correctly. This example is functionally correct but suffers from many performance issues, such as large amount of potential cache misses on fences that can be avoided. Here's how it can be improved.

Just like in the previous slide, we want to reliably store some data into persistent memory. But instead of relying on a persistent flag to indicate whether the data is correct, we are going to be using a checksum to verify them. So first, we prepare our buffer on the stack with the data that we want.

We calculate the checksum and sort alongside the data using a single non-temporal copy. On recover here, we simply check if the checksum is correct. The performance of this code is going to be better because it has no cache misses and requires only a single fence at the end of the memory copy. Now, let's see how we can apply the things that we just talked about to our ring buffer implementation. Again, the full source code is available at the link displayed on the site.

We streamline the implementation. The places here are fast with pre-emphasis. Then we replace libc memcpy with non-temporal pmem_memcpy. And the result is that the new implementation is over three times faster than the baseline. The last cold example I'm going to show is about failure atomicity of a single cache-line. The current CPUs guarantee that stores placed in the same cache-line cannot be the orders we should expect for each other, and we can to take advantage of that when conservative-pessimistic algorithms. In this example, we're back to both the data and persistent flag into the same cache-line, and then we simply store the flag loss, and we can rely on persistent flags to be set only if the [INAUDIBLE] have been missing out completely.

Current generation of Intel Optane DC Persistent Memory can be configured with some interleave size of 4 kilobytes. Optimized application can be made aware of that and avoid constantly writing out to the same physical DIMM, improving overall throughputs.

One example of such an optimization is changing one global variable, which our applications ideas into a [INAUDIBLE].

We can see the impact of interleaving with our ring buffer benchmark.

If we set the entry to be 2 kilobytes as opposed to 4 kilobytes, we can observe the lower performance because it is now more likely that two threads will operate on the same physical DIMM.

That's it for optimization guidelines. Let's now talk about how to properly measure performance. When using App-Direct, make sure that the filesystem is mounted with the DAX option.

Forgetting it might silently make the filesystem fallback to using page cache, causing performance fluctuations. Avoid benchmarking in shared or unstable environments. We prefer in PMDK to measure performance always environmental.

For best reproducible results, disable hyper-threading and set a fixed clock-rate for the CPU.

This doesn't necessarily mean that you should set the clock-rate to a highest possible rate, but you have to set it to a value that is stable for our cores so that there are no fluctuations whatsoever when writing the benchmark. Make sure that all memory pages are allocated and faulted prior to performance measurements.

You can accomplish this by using the MAP_POPULATE mmap flag or manually faulting on all the pages by touching them. This helps to avoid the performance fluctuations that you might have when the performance is measured with the kernel time.

You also have to remember to set the appropriate NUMA node to execute the application at.

Ideally, the benchmark should be run on the same NUMA node to which the physical DIMM is connected to.

When benchmarking PMDK, make sure to use release libraries and that the prefault configuration option is enabled.

Let's now see how improperly run benchmark can impact our results.

If we forget to pin the application to the NUMA node, we get massively lower results because the kernel is now free to schedule the application threads to any CPU it wants. If we forget to mount with DAX, the results actually improve. This is because the benchmark now uses the [INAUDIBLE],, and performance will fall only if we were to use touches and fencing, which will be required, and once the workload decreased the available DIMM memory capacity.

In summary, efficiently using Persistent Memory can be tricky.

But luckily, Persistent Memory Development Kit provides the necessary tools to fully utilize the hardware. Yeah.

Thanks, Piotr, for that detailed and very informative presentation.

So this time I'll open it up for questions. As a reminder, the phone lines will remain muted, so please ask your questions using the Q&A box, and I can ask those to Piotr.

I see that we had several questions asked during the presentation, which is great to see, and thank you very much.

So let's just start off here.

Regarding the three flushing operations, when using the PMDK and libraries, do I need to choose and specify the flush operation, or will PMDK choose one for me?

So that's an excellent question.

In [INAUDIBLE] we implement the logic of automatically the best of features or CPU and then automatically detect which instructions will be used. So in most platforms, the [INAUDIBLE].. Fantastic.

So did that answer question?

Yes?

Yes, cool.

Question on the pmemobj prefault configuration option, can you explain or briefly explain how that works? So for example, if we open a pool with the prefault operation enabled, will the open return immediately and the prefault happens in the background, or will the open call return only once the entire memory mapped address space has been faulted?

So [INAUDIBLE] obj and this much are our other libraries, we wait for defaulting to finish because we mostly use it for benchmarking purposes.

So we set the variable, and we do want to wait before we start the time measurements. It cannot happen in the background. The implementation is fairly simple. We simply touch out the individual pages, and we write to them so that subsequent tries [INAUDIBLE].. OK, interesting.

So there was a follow-up question to that. So the prefault operation, is that serialized through a single thread, or is there some parallelism involved there?

It's single-threaded.

Well, so we have to be careful using that there is [INAUDIBLE] in the kernel in [INAUDIBLE].. So even if we just threw a couple of threads while it's faulting, then it's still [INAUDIBLE] because every single serialized [INAUDIBLE].. OK, cool.

Thank you.

So next question is, so the code examples given in the presentation are mostly written in C, or all of them were in C, I think. So what other languages can we use for persistent memory programming?

So the languages that we presented at the webinar are universal.

If you don't use PMDK, for example, then the same lessons apply to everything, [INAUDIBLE] the language. But PMDK itself is mostly implemented in C. For some libraries, we have bindings for higher-level languages like C++, Python, and so on. Yeah.

OK, fantastic.

OK, cool.

Thank you.

Next question, so are non-temporal stores efficient for small write-up operations, such as 256 bytes? Yes.

So the nice thing about non-temporal stores is that they allow us to avoid the cache miss. And this might seem unintuitive because for DRAM and non-temporal stores can actually reduce performance from some small stores.

With [INAUDIBLE] we found that if we can manage to avoid that cache miss using non-temporal stores, the performance difference can be massive. Like for implantation of transactions in APMJ, that was originally implemented using the bad atomicity method that I described in one of the slides. And then move to the method, which I presented of a good atomicity performance. So orders as [INAUDIBLE]---- well, maybe not orders like you develop, but there's something more than five acts performance improvements after we changed the algorithm from the naive one that goes the many cache misses to the one that never had any cache miss, and ours is not the first stores or even for small stores maybe not even especially for small stores. OK, great.

That's cool.

So OK, fantastic.

There's a follow-on question then. So we just talked about non-temporal stores. What about non-temporal loads?

Are they suggested and recommended? So loads, I'm assuming you mean copying from pmem from persistent memory [INAUDIBLE].. And in that case, you are going to have a cache miss anyway, [INAUDIBLE] because you are [INAUDIBLE] about the data.

So no, I'd say that you should just use libc memcpy to transfer data from PM to DM.

OK.

Thank you.

Perfect.

There was a footnote on the clflushopt, clwb slide. So is clwb actually implemented, or there is a future CPU implementation feature?

That's a tricky question.

So yes, a clwb is present in the CPU. Invoking that instruction doesn't cause immediately going to action exception.

But that's about it.

The implementation is exactly the same as [INAUDIBLE].. As I said, this might change in the future generations of CPU. Right.

That's cool.

Thank you.

Question here, could you explain those no-drain and no-flush flags?

No-flush and no-drain flags.

Sure.

So memcpy, normal libc memcpy, has to comply to a certain definition, right? And if we are using non-temporal stores in memcpy, at the end of that memcpy, we have to issue a fence to comply with the definition, because otherwise, the memcpy wouldn't be coherent, meaning that once the memcpy finishes, well, the definition is that once memcpy finishes, other tests need to be able to on architecture need to see the changes that we made in the memcpy. So we have issued a fence so that the behavior between temporal source and non-temporal source [INAUDIBLE] is identical, because-- and for these applications, normal applications, it does not even matter if we use temporals and non-temporals. It's mostly a performance [INAUDIBLE].. So that's why, in each memcpy, if each decides to use a non-temporal source, we just have to issue a past fence to order the fence of things and [INAUDIBLE] height coherence. So in case of the system unaware of the [INAUDIBLE],, keep a memcpy.

We want to retain the same semantics as memcpy has. So by default, if we use non-temporal source, we issue a fence of the end.

Meaning that we wait for the source to finish. But in some scenarios you might not need that fence. For example, if you don't care about ordered fence, or if you don't care about ordering between two different data stores.

So in the examples on the slides that I used, and on ordering flag, it was mostly used when I wanted to store two separate variables using memcpy.

So I used memcpy with an ordering flag, another memcpy was an ordering flag. Then I called a drain because I wanted to wait for data to [INAUDIBLE].

Then I modified a persistent flag and then I [INAUDIBLE] in, which means that I know that I didn't care about the stores between two main copies being ordered.

That doesn't matter.

All I really cared about is that those two stores, those two main copies that were going to be finished when I setup the persistent flag. I hope that answers the question. And then there's a fairly difficult question at least it makes that has to do with as well just another programming model.

So if that answer wasn't comprehensive enough, I suggest that you go to our website, [INAUDIBLE] Intel, and you can find much more material on this topic. Yeah, that was a great answer.

Thank you very much.

There's an open question here about how much effort that may be needed to modify existing code for persistent memory.

It really depends on the application, right? So-- Yeah.

And what area we want to do.

I guess I can start to solve the answer by saying we have a book available, Persistent Memory Programming and Developers.

If you go to pmem.io/book, you can download the preview of many of the chapters today.

And we'll be releasing that book hopefully in December, so you'll be able to download the eBook PDF for free and purchase the printed copy if you like the paper version. But that's my thought on that.

PMDK makes things easier, right?

I mean, that's the whole point of the library. I don't know if you want to add anything to that, Peter? Yeah, so we made-- so modifying an application-- putting any changes to existing software is difficult, right? And obviously introducing completely new technology is going to be even more challenging. So that's why PMDK exists.

So if you see fit in your application, if you see the fit between one of the PMDK components on your application, then [INAUDIBLE] because we really tried to make it as simple as possible so that you don't have to care about stuff like we discussed today.

Because that's very advanced things that really only people that [INAUDIBLE] the PMDK have to care about. But in general, if you want to really take advantage of the system more, optimize every little thing, yeah, it's going to depend how listen to the sign-on efforts. Yeah, and I agree, it really depends on the app and what features and functionality you need or want to implement with system memory, so. But we're happy to answer that question, I guess, in more detail if you want to hop on over. I think we have slack channels for persistent memory. We also have Google forums as well. You can find those available in the resources. We'd be happy to answer those questions online. So yeah, and provide more information we can do to help you more.

So sounds good.

More of a hardware related question on this one. So App Direct with interleave mode uses multiple DIMMs. Can you explain a little bit more about how the interleaving mode works? Yeah.

So interleave is-- so the fundamental, it's just prepping the hardware.

So as a default, you'll eventually have another DIMM. And from the perspective of the application, [INAUDIBLE] of others.

In general, application shouldn't care about in fact interleaving because then you'll-- so obviously this is a little bit contrary to what I said on the webinar, but fundamentally, applications should be used in a [INAUDIBLE] way.

Obviously, if you are fully optimizing for this particular hardware, then what really matters is that whatever-- for this particular generation of hardware, what matters is that for every 4 kilobytes, you have another DIMM.

And that alignment is physical.

Though I'm not sure if I can expand on that much more. Yeah.

I mean, I would just summarize it like you said. A right group, right?

I mean, if the interleaving of system memory DIMMS is no different than interleaving of DDR, so if you have six DIMMS on a socket, we will interleave the IoT across those six DIMMs. I believe, I could be wrong here though, the interleave strip size is going to be about 1K.

So we'll write 1K to the first DIMM and then we'll move on to the second DIMM, et cetera, so. But that might change as generations come and go, so. So I hope that helps answer that particular question. I guess there's a few more here.

So a question about utilizing DRAM and MVM. The question was specifically around a DRAM cache for persistent memory.

So I can start by answering this one. There is, if you put the system into memory mode, we use the capacity of the DRAM as a new less viable cache between the CPU and persistent memory. So the operating system sees the capacity of the persistent memory and does not see the DRAM capacity. But we use that DRAM intelligently as a new less viable cache.

In App Direct mode, which is really where Peter's talk was, you have access to DRAM and persistent memory. So it's entirely up to the application as to where it locates data and data structures. If they're really hot, they should probably be located in DRAM for a faster access, and then you can tier off into persistent memory and your SSDs and MVMs.

So hopefully that answers that question. There were a few questions saying that some of the links don't work, so we'll fix those and update the slide deck before we push out with a recording of this webinar as well.

So apologies for that.

I think that's all of the questions that I can see here, so.

There was one question about the status of pmemkv. So obviously somebody's looking at the wider project we have. So pmemkv for those that are not aware is the persistent memory key value store. It's not directly part of the PMDK bundle yet. It's a separate project maintained separately, but that is becoming available or is available today for download.

So you can use that.

It has multiple different storage engines that you can use, and it is all based off of PMDK. So great example of how to use PMDK and production quality code.

And that kind of wraps it up.

I think that's all the questions that I can see. Some of them are duplicate, so will have similar ones. So I think we covered most of that. So again, thanks, everyone, for attending. We'll make the slides and presentation available for you in the next week or two.

And we will leave you to have a good rest of your day. Thanks again for attending.