Modern Memory Subsystems Benefits for Data Base Codes, Linear Algebra Codes, Big Data, and Enterprise Storage

The overall advantages of Multi-Channel DRAM (MCDRAM) and High-Bandwidth Memory (HBM) are:

  • They offer higher bandwidth to Intel® Xeon Phi™ cores than is available from off-package dynamic random-access memory (DRAM).
  • You can use them to cache off-package memory.

In terms of performance, the overall advantages that may result from future 3D XPoint™ memory devices are:

  • Density: You may be able to position terabytes of memory near the processors.
  • Persistence: You may be able to keep data through a power failure.

In terms of cost of ownership, the overall advantages that may result from future 3D XPoint technology are:

  • Fewer processors and motherboards needed to supply a required amount of memory
  • Lower cost per bit
  • Lower idle-power per bit

The overall advantages of Intel® Omni-Path Fabric (Intel® OP Fabric) are:

  • You do not need special code to move data between nodes.
  • Code still runs, and runs well, if a few accesses are across nodes.

NUMA conceptual modelThe following sections address specific advantages for:

 

Database Codes

Database systems must:

  • Translate user requests into very specific algorithms.
  • Obtain the data needed to run the algorithms.
  • Run the algorithms.
  • Persist changes to the data.

The higher capacities, performance, and persistence capabilities of modern memory subsystems can improve each of these functions.

Of special interest is the impact of Intel OP Fabric, and future 3D XPoint technology and similar non-volatile dual inline-memory modules (NVDIMMs) on the problem of persisting data.

Database servers use data replication onto some form of persistent storage to achieve reliability – usually by sending data to at least two memory subsystems with far-flung nodes so a physical disaster does not damage both. The number of nodes, their proximity, and the storage they use to hold the data depends on the perceived threat and required reliability. For instance, replicating merely into DRAM on a nearby machine with a separate power supply may be adequate for some purposes, while replicating over several continents may be necessary to meet more stringent requirements.

3D XPoint Memory and Storage Devices

Future 3D XPoint technology may be able to speed replication onto persistent media because:

  • It may match the speeds of the fastest long-distance interconnects.
  • Locally, it may be faster to write to persistent regions or storage regions on 3D XPoint dual inline-memory modules (DIMMs) than to write to 3D XPoint solid-state drives (SSDs) which, in turn, are likely to be faster and more reliable than other SSDs and hard disk drives (HDDs).

As the database server runs, there is less delay when scanning large tables because you can keep more data close to the processors. More threads can keep data in the larger memory, so there is more pending work available when some thread is delayed by waiting for data.

When starting a node, you may merely check that data in 3D XPoint memory is up to date instead of fetching it from a remote node. This may dramatically decrease startup time.

Intel OP Fabric

The Intel OP Fabric makes it possible for the same code to efficiently spread data across a room as well as across continents. You can fetch data from a remote site into a local cache with minimal programming and at high speed. Similarly, the Intel OP Fabric makes it easier and faster to send data for replication to multiple remote sites.

Linear Algebra Codes

Linear algebra codes can vary from running in seconds on single systems to running for months on huge systems.

MCDRAM and HBM

The combination of the many cores of the Intel Xeon Phi processor with MCDRAM or HBM on-package memory allows easy and effective programming involving large data structures accessed by the many cores. Often you can tile the code to keep most memory accesses within the per-core caches, and use the MCDRAM or HBM to cache large data structures stored on more distant DIMMs. As a result, you can attack significantly larger problems on a single node, especially if future 3D XPoint memory dramatically increases the storage of the local DIMMs.

Intel OP Fabric

For problems that require more computation or data than fit on a single Intel® Xeon® or Intel Xeon Phi processor, the usual approach is to distribute code and data using MPI.

The Intel OP Fabric makes it easier to share data using code that works well if threads are running on a single node or on many nodes.

3D XPoint and Intel OP Fabric

For applications that run for hours or days, it is highly desirable for computation to continue even if you deliberately stop one or more machines or you lose a machine because of a crash or power failure.

You may be able to use future 3D XPoint memory to provide a faster checkpoint than using SSDs.

The Intel OP Fabric can quickly replicate data onto other nodes, so you can quickly shut down a node and resume its work on another node.

Big Data (Graph 500, Social Media, Apache* Hadoop* Framework, and Some Biochem)

These frameworks and applications often combine many aspects of the database and the linear algebra codes situations, with the additional complication that data is less regular and algorithms are less focused.

3D XPoint Memory and Storage Devices

The expected capacity and lower cost-per-bit of future 3D XPoint memory may make it possible to have more indices available locally than before, reducing the access times for more distant data. More threads can keep data in the larger memory, so there is more pending work available when some thread is delayed by waiting for data.

3D XPoint technology may also speed replication onto persistent media because:

  • It may match the speeds of the fastest long-distance interconnects.
  • Locally, it may be faster to write to persistent regions or storage regions on 3D XPoint DIMMs than it is to write to 3D XPoint SSDs (which, in turn, are faster and more reliable than other SSDs and HDDs).

When starting a node, you may merely need to check that data in 3D XPoint memory is up to date instead of fetching it from a remote node. This may dramatically decrease startup time.

Intel OP Fabric

The Intel OP Fabric makes it possible for the same code to efficiently spread data across a room as well as across continents. You can fetch data from a remote site into a local cache with minimal programming and at high speed. Similarly, the Intel OP Fabric makes it easier and faster to send data for replication to multiple remote sites.

Enterprise and Cloud Big Storage

These frameworks and applications often combine many aspects of the database and the linear algebra codes situations, with the additional complication that the range of applications is more diverse within a single system.

Fortunately, future 3D XPoint technology may be useful and Intel OP Fabric technology is useful across a wide spectrum of applications to unify everything from an Intel Xeon Phi system solving a difficult optimization problem to a highly scalable, highly reliable storage subsystem holding business critical data.

Summary

The previous article,NUMA Hardware Target Audience, showed most applications can benefit from new memory subsystem hardware technologies. This article dives deeper into how specific application styles can benefit. The next article, Memory Performance in a Nutshell, starts providing the information you need to get the best performance out of the new hardware.

About the Author

Bevin Brett is a Principal Engineer at Intel Corporation, working on tools to help programmers and system users improve application performance. He enjoys improving performance by finding out where applications are spending the most time, and measuring the performance gains that result from finding better algorithms and great ways to implement them on available hardware.

For more complete information about compiler optimizations, see our Optimization Notice.