Error Recovery in Persistent Memory Applications

By Eduardo Berrocal Garcia De Carellan,

Published:04/01/2020   Last Updated:04/01/2020

Introduction

In this article, I revisit the topics of recovery and fault tolerance that were discussed in Recovery and Fault-Tolerance for Persistent Memory Pools Using Persistent Memory Development Kit (PMDK). In that article, I introduced pmempool, a tool available in the Persistent Memory Development Kit (PMDK) to prevent, diagnose, and recover from unplanned data corruption caused by hardware issues. The article covered pmempool check, pmempool sync, and error injection in Linux* using sysfs.

This article is the result of new information made public, as well as the introduction of new features in PMDK since the last article was published. In particular, the topic of memory errors is revisited, this time using standard and stable tools such as, ndctl, daxio or fallocate for injection, detection, and repair, instead of relying on sysfs. This article also covers the issue of unsafe shutdowns and how to handle them from the application’s point of view.

 

Memory Errors for Mapped Memory

When a running application encounters a memory error in a memory device (persistent or not), two things can happen. Either we get an unrecoverable machine check, or a recoverable one. Although the typical case is to get a recoverable machine check, there are some rare circumstances where it is unrecoverable and can cause a system crash. Fortunately, the number of non-recoverable cases decreases with every new generation of CPUs. If we get a recoverable machine check, the OS will send a SIGBUS signal (bus error) to the application, killing it.

Catching a SIGBUS is possible, and some advanced applications do it in order to keep running without crashing. Handling SIGBUS may also be useful if you want your application to write some fatal message or flush something before dying (there is, however, a risk of running into another SIGBUS while handling a previous one). For most applications, though, catching SIGBUS is too error prone to be useful. It is very hard, and sometimes not even possible to code this case correctly. The recommended course of action for persistent memory applications is to die and handle the error on restart.

However, for persistent memory, simply restarting your application won’t work. In the case of volatile memory, the OS removes the affected physical pages out of the pool of available ones to make sure that the same memory does not get allocated again. It can do that because volatile memory can be considered empty, meaning its content has no meaning before it is allocated. This is not true in the case of persistent memory. Even though the OS is involved during the memory mapping phase, it is the application who is telling the OS what pages to map. These pages usually belong to a pre-allocated file that has meaningful data1. If an application that was killed due to a memory error in a persistent memory device gets executed again, it will encounter the same error over and over while accessing the same portion of data unless something is done about it.

 

Discovering Bad Blocks

If your persistent memory application crashes due to a memory error, you will see the following:

Bus error (core dumped)

After that happens, the affected blocks will be marked as “bad”. If your application is using a pool created with PMDK, you can check it with pmempool info

# pmempool info --bad-blocks=yes /mnt/pmem1/pmem-file
Part file:
path                     : /mnt/pmem1/pmem-file
type                     : regular file
size                     : 16777216
bad blocks:
        offset          length
        96              8
...

Note: For the devices corresponding to the namespaces created with ndctl, blocks are always 512 bytes long.

In this case, we can see that we have 8 bad blocks starting at offset 96; the bad blocks section lists one affected offset per line.

The same goes for a pool located in a devdax device:

# pmempool info --bad-blocks=yes /dev/dax0.0
Part file:
path                     : /dev/dax0.0
type                     : device dax
size                     : 811746721792
alignment                : 2097152
bad blocks:
        offset          length
        96              1
...

In this case, only one bad block is listed at offset 962.

If your application is not using PMDK pools at all, you will need to query bad blocks relative to the namespace, and then figure out how those offsets translate to relative offsets within your file. Given that, with devdax, there isn’t any file system at all, the relative and global offsets match. For example, for the devdax case above, the bad block relative to the namespace is also 96 (namespace0.0 corresponds with device /dev/dax0.0):

# ndctl list --media-errors
[
  ...
  {
    "dev":"namespace0.0",
    "mode":"devdax",
    "map":"mem",
    ...
    "badblock_count":1,
    "badblocks":[
      {
        "offset":96,
        "length":1,
        "dimms":[
          "nmem2"
        ]
      }
    ]
  },
...
]

With fsdax, offsets differ (namespace1.0 corresponds with device /dev/pmem1, mounted at /mnt/pmem1):

# ndctl list --media-errors
[
  {
    "dev":"namespace1.0",
    "mode":"fsdax",
    "map":"mem",
    ...
    "badblock_count":1,
    "badblocks":[
      {
        "offset":295008,
        "length":1,
        "dimms":[
          "nmem8"
        ]
      }
    ]
  },
...
]

To figure out the relative offset to the file, you can use filefrag:

# filefrag -v -b512 /mnt/pmem1/pmem-file
Filesystem type is: ef53
File size of /mnt/pmem1/pmem-file is 16777216 (32768 blocks of 512 bytes)
ext:     logical_offset:        physical_offset: length:   expected: flags:
  0:        0..   16383:     294912..    311295:  16384:
  1:    16384..   32767:     311296..    327679:  16384:             last,unwritten,eof
/mnt/pmem1/pmem-file: 1 extent found

We pass -b512 for block size of 512 bytes, and -v for verbose. We can see that the file has two extends of 16384 blocks each. The first extend corresponds to the logical (relative) offsets 0-16383 and physical offsets 294912-311295. Physical offsets are what interests us here. If we subtract 294912 from 295008, we get exactly 96.

 

Fixing Bad Blocks

Cleaning bad blocks is done by finding healthy physical blocks within the device and reassigning the mapping of bad bocks (bad offsets) to these new healthy physical blocks. The exact details of how this is done internally—a process called clearing the poison—are product specific and vary by vendor. From the application point of view, the process involves writing to the affected offsets,­ replacing the lost data by restoring from backups or some other redundant copy. One way to do this in fsdax mode is to simply overwrite the pool file with the backup. This can be accomplished with a regular file copy operation. If the pool is very large (e.g., terabytes), or if we are using dexdav mode, a second option is to overwrite only the affected blocks.

The recommended logic for applications to overwrite (i.e., repair) only the affected blocks is the following:

  1. If a previous repair operation was interrupted, go to 3.
  2. Record persistently what ranges (offset + length) are under repair.
  3. Clear the poison (see next section for more details):
    • For a devdax device: Zero-initialize the affected blocks directly.
    • For a fsdax device (file system mounted with dax): Deallocate the affected blocks and allocate them again. This operation will zero-initialize the blocks.
  4. Write data from a backup copy.
  5. Delete persistent record indicating that a repair is in progress.

If restoring from a backup is not possible, the above logic will miss step 4 and the repaired blocks will be all zeros. In that case, the application will need to repair any damaged data structures. If the application is using a pool created with PMDK, and the pool’s metadata is corrupted, recovering may be possible with pmempool check. To check for metadata corruption, run:

# pmempool check -v /mnt/pmem1/poolfile
checking pool header
incorrect pool header
/mnt/pmem1/poolfile: not consistent

To attempt a recovery, pass the -r and -a options:

# pmempool check -v -r -a /mnt/pmem1/poolfile
...
/mnt/pmem1/poolfile: repaired

Find more information about pool repairing in the article Recovery and Fault-Tolerance for Persistent Memory Pools Using Persistent Memory Development Kit (PMDK).

Finally, another option available for pools created with PMDK is to use pmempool sync. This will only work if the application is using a pool set with replicas. For more information about poolsets, please read the man page for poolset.

To use a healthy replica to recover from bad blocks, run:

$ pmempool sync --bad-blocks ./poolset.file

Note: This feature is available only for pools created using libpmemobj.

 

Clearing the Poison

Devdax

We can clear the poison on a devdax device using either ndctl or daxio (included with PMDK). The latter offers a little bit more flexibility in terms of what data can be written to the device.

To use ndctl to zero-initialize the affected blocks in namespace0.0 (device /dev/dax0.0), run:

# ndctl clear-errors namespace0.0

The following will do the same using daxio, but only for one offset at a time:

# daxio --output=/dev/dax0.0 --zero --seek=96 --len=512
daxio: copied 512 bytes to device "/dev/dax0.0"

Similarly, you can use an input file with --input instead of zeroing the blocks with --zero. For more information, run daxio --help.

Note: Using dd (or write()) to clear blocks by writing directly to  /dev/dax0.0 is not supported. Devdax has limited functionality; data can only be read and written through memory mapping.

 

Fsdax

To be able to write to bad blocks on a file system mounted with the dax option, we need to first deallocate those blocks (this process is also known as punching a hole), and then allocate new ones. For this we use fallocate.

Note: Writing to bad blocks in a file directly using dd—or write()—before calling fallocate first will produce an Input/Output error (EIO). This happens because the persistent memory aware file system checks for bad blocks and returns EIO before reads or writes are attempted against a bad block.

To deallocate/allocate at a particular offset, run:

# fallocate --punch-hole -o 49152 -l 512 --keep-size /mnt/pmem1/pmem-file
# fallocate -o 49152 -l 512 --keep-size /mnt/pmem1/pmem-file

For fallocate, offsets and lengths need to be specified in bytes. In the snippet above, 96x512=49152. The same can be done programmatically:

...
fd = open (filename, O_RDWR);
fallocate (fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, offset, length);
fallocate (fd, FALLOC_FL_KEEP_SIZE, offset, length);
close (fd);
...

After this, we can go ahead and write to the blocks using dd (or write()). For example, to restore the corrupted blocks from a backup.

Note: You should not use dd (or write()) to clear blocks by writing directly to a device (e.g., /dev/pmem1) with a file system on it. Doing so has the potential to corrupt the file system.

If unmounting the filesystem first is possible, we can use ndctl as we did for the devdax case:

# ndctl clear-errors namespace1.0

 

Bad Block Injection

To test the robustness of your application against data corruption produced by bad blocks, you can use ndctl to inject them into a namespace:

# ndctl inject-error --block=295008 --count=1 namespace1.0

To know where to inject when targeting a file residing in a persistent memory aware file system, use filefrag as explained in the Discovering Bad Blocks section above. To check that the injection was successful, run the command with the --status option:

# ndctl inject-error --status namespace1.0
{
  "badblocks":[
    {
      "block":295008,
      "count":1
    }
  ]
}

After that, calling mmap() for the file will not fail. However, the OS will send a SIGBUS to your application If a read from an affected block is attempted.

Removing an injected error is possible. However, if the OS detected the problem before un-inject has been run, and the bad blocks are already accounted for in the system, this command will not make the OS forget about them. You will need to go through the process described above for fixing bad blocks:

# ndctl inject-error --uninject --block=295008 --count=1 namespace1.0
Warning: Un-injecting previously injected errors here will
not cause the kernel to 'forget' its badblock entries. Those
have to be cleared through the normal process of writing
the affected blocks

{
  "dev":"namespace1.0",
  "mode":"fsdax",
...

 

Unsafe Shutdowns

Flushing data out of the CPU caches does not imply that the data will be written all the way to persistent media. Before that, the data will reside, at least for some time, in the write queues of the memory controller (MC)

To avoid corruption, MC queues are protected with Asynchronous DRAM Refresh (ADR). In the case of Intel® Optane™ Persistent Memory modules, ADR ensures that, on a power failure, all the queued data in the MC is written to persistent media. ADR also sets DRAM in self-refresh, ensuring that data is backed up correctly in other persistent memory products such as non-volatile dual in-line memory modules (NVDIMMs).

Although extremely rare, ADR can fail. For example, consider the scenario where the heating, ventilation, and air conditioning (HVAC) system where your server is located is down. Due to high temperatures, CPUs are throttled, making ADR unable to finish flushing before the stored energy runs out. A failure in ADR produces an unsafe shutdown. Unfortunately, there is no way to know what is and what is not corrupted after an unsafe shutdown; there is no list of affected files or blocks. If an unsafe shutdown is detected, the recommended course of action is to delete the data and restore it from a backup, if possible.

 

Discovering Unsafe Shutdowns

We can check the health state of our modules with ndctl:

# ndctl list -DH
[
  {
    "dev":"nmem1",
    ...
    "health":{
      "health_state":"ok",
      "temperature_celsius":31.0,
      "controller_temperature_celsius":33.0,
      "spares_percentage":100,
      "alarm_temperature":false,
      "alarm_controller_temperature":false,
      "alarm_spares":false,
      "alarm_enabled_media_temperature":true,
      "temperature_threshold":82.0,
      "alarm_enabled_ctrl_temperature":true,
      "controller_temperature_threshold":98.0,
      "alarm_enabled_spares":true,
      "spares_threshold":50,
      "shutdown_state":"dirty",
      "shutdown_count":3
    }
  },
  {
    "dev":"nmem3",
...

For brevity, only one module is shown in the above snippet. We can see that the value of shutdown_state is dirty for module nmem1. Likewise, shutdown_count is 3. This counter is increased for every new unsafe shutdown detected.

The recommended logic to detect unsafe shutdowns in persistent memory programs is the following (this is the same logic used in the PMDK libraries):

  1. On file creation: store as metadata the UUID of the namespace and shutdown count for every module in the namespace.
  2. On file opening:
    • Check UUID of namespace. If different, file was moved. Store new UUID and shutdown count for every module in the namespace.
    • If UUID is the same, check shutdown count. If higher, the previous shutdown should be considered unsafe.

This is how libpmemobj-cpp will fail on pool opening when an unsafe shutdown is detected:

# ./pmem-program /mnt/pmem1/pmem-file
terminate called after throwing an instance of 'pmem::pool_error'
  what():  Failed opening pool: an ADR failure was detected, the pool might be corrupted
Aborted (core dumped)

To change the shutdown_state from dirty to clean, a new safe shutdown of the system is needed.

 

Unsafe Shutdown Injection

To test the robustness of your application against unsafe shutdowns, you can use ndctl to inject them into any module:

# ndctl inject-smart nmem6 --unsafe-shutdown
[
  {
    "dev":"nmem6",
    "id":"8089-a2-1835-00002529",
    "handle":4097,
    "phys_id":50,
    "security":"disabled",
    "health":{
      "health_state":"ok",
      "temperature_celsius":30.0,
      "controller_temperature_celsius":31.0,
      "spares_percentage":100,
      "alarm_temperature":false,
      "alarm_controller_temperature":false,
      "alarm_spares":false,
      "alarm_enabled_media_temperature":true,
      "temperature_threshold":82.0,
      "alarm_enabled_ctrl_temperature":true,
      "controller_temperature_threshold":98.0,
      "alarm_enabled_spares":true,
      "spares_threshold":50,
      "shutdown_state":"clean",
      "shutdown_count":3
    }
  }
]

The inject-smart command will output the current health state of the module, which does not reflect the injected error. In order for the error to manifest, a system shutdown is necessary (a simple reboot is not enough).

If we change our mind, we can also remove an injected error:

# ndctl inject-smart nmem6 --unsafe-shutdown-uninject
[
  {
    "dev":"nmem6",
    "id":"8089-a2-1835-00002529",
...

 

Summary

In this article, I revisited the topics of recovery and fault tolerance that were discussed previously in Recovery and Fault-Tolerance for Persistent Memory Pools Using Persistent Memory Development Kit (PMDK). This was necessary given the new information and tools available. I showed how to discover, fix, and inject bad blocks using standard and stable tools such as ndctl, daxio, and fallocate. The issue of unsafe shutdowns was also covered, showing how unsafe shutdown errors can be detected and injected.

 

Footnotes

  1. They can also belong to a persistent memory device if the device was configured in devdax mode.
  2. The difference is because of the amplification effect caused when a file system block is set as bad. In this case, the file system block is 4096 bytes long versus 512 bytes for the namespace.

 

Notices

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

This sample source code is released under the Intel Sample Source Code License Agreement.

Intel, the Intel logo, Intel Optane, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

© 2020 Intel Corporation

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.