Part 1: Efficient Data Movement and Control Using the oneAPI Programming Framework
In this blog, we look at data management, which is especially critical when targeting cross-architecture systems that include hardware accelerators, such as in the oneAPI programming model. Data Parallel C++ (DPC++), oneAPI’s cross-architecture language and associated runtime, provides easy-to-use methods for data management that should be familiar to developers who program in C++ or other languages such as CUDA* and OpenCL™. (For a detailed look at DPC++, see our last blog on the language and its kernel-based approach.)
Data Management Problem Across Device Architectures
In an open, standards-based development solution, functional portability is a key priority. The DPC++ framework is set up to provide you with an easier path to getting code running on different types of accelerators from multiple vendors, thus eliminating single-vendor lock-in.
Let’s start with memory. Figure 1 shows that traditional single-socket CPU-based systems have a single memory, but accelerator devices often have their own attached memories that cannot be directly accessed from the host. That means that parallel programming models supporting discrete devices must provide mechanisms to manage these multiple memories and move data between them.
Figure 1. Multiple Discrete Memories
Parallel programs running on a device prefer local accesses so they can read and write data to memory attached directly to the device. Remote accesses tend to be slower because they must travel over data links with lower bandwidth and higher latency. So, there is an advantage in co-locating a computation and its data, but that means you must somehow ensure data is copied or migrated between different memories—to move it closer to the computation.
Explicit and Implicit Management of Multiple Memories
Managing multiple memories can be accomplished implicitly by the runtime, or you can manually set it up explicitly. Each method has its own advantages and drawbacks.
With implicit data movement, which is controlled by a parallel runtime or driver, the DPC++ runtime does the heavy lifting and automatically transfers data to the appropriate memory before it is used. The benefit of this approach is less programming effort, and program errors are reduced in getting your application to take advantage of memory local to the device.
The implicit approach, however, could negatively impact your program performance because a you have little to no control over the runtime’s implicit mechanisms. This is both the advantage and disadvantage. The runtime provides functional correctness, but it might not ensure maximal overlap of computation with data transfer, and it doesn’t have as much information as you do about the application.
Alternatively, with the explicit approach, you manually insert explicit copies in code between different memories, and you can use this approach with optimization and tuning to achieve your desired performance. But in which situations should you do this?
- Explicit copies/transfers are only needed for unified shared memory (USM, we talk more about this in just a bit) and “device” memory.
- Buffers and other types of USM also support “copy” functions, but these are only needed to make copies of memory for algorithm correctness.
- Any copies relating to transfers between memory spaces get inserted by the runtime automatically.
For example, with a discrete GPU, where you explicitly copy data from host memory to GPU memory (Figure 2), after the kernel computes new results, the data must also be copied back to the CPU before the host program can use that data. If you insert explicit code to trigger the data movements, you have full control over when data is transferred between the different memories. From there, you may optimize overlapping computations and data transfers to obtain the best performance.
Figure 2. Data Movement and Kernel Execution
Now, getting all this data movement correct upfront can be time-consuming and tedious. And transferring an incorrect amount of data, or not ensuring that all data is transferred before a kernel begins computing, can lead to incorrect results.
You can use a mix of explicit and implicit methods for different data in a program. You might choose to begin with implicit data movement to simplify porting an application to a new device, or to simplify writing an application from scratch. Once the application is working, and when it’s time to tune performance, you could replace implicit with explicit data movement in performance-critical parts of the code, identified using a profiler or other approach.
Unified Shared Memory, Buffers and Images
DPC++ provides three abstractions for managing memory: USM, buffers, and images.
- USM is a pointer-based approach. An advantage of using it is easier integration with existing C++ code that operates on pointers, and familiarity for C and C++ developers.
- Buffers describe one-, two-, or three-dimensional arrays. They are a “raw” representation of memory and provide an abstract view of it that can be accessed on either the host or a device.
- Images are much more abstract and “opaque” used for a limited set of special functions, such as reading images using sampler objects, etc.
For this blog, we focus more on USM and buffers, which are sufficient for most applications.
Choosing a Data Management Strategy
The first decision in defining your data management strategy is whether to use explicit or implicit data movement. The implicit approach is generally easier, because DPC++ will handle the data movement, letting you focus on expressing the computation. But if you’d rather have full control over data movement from the beginning, then the explicit approach using USM device allocations is the way to go. Just be sure to add the necessary copies between host and devices.
When selecting an implicit data movement strategy, you can choose to use buffers or USM. If you’re porting an existing C/C++ program that uses pointers, USM might be an easier path because most code won’t need to change.
Another decision is how to express dependences between kernels. Choose buffers if you prefer to think about data dependences between kernels. Choose USM If you prefer to think about dependences as performing one computation before another. To order computation, DPC++ supports in-order queues that execute kernels sequentially in the order they are submitted, and out-of-order queues that require explicit dependences between kernels and may have multiple valid execution orders.
- In-order queues are simpler and intuitive, but they constrain the runtime, may run slower due to a lack of overlapped execution.
- Out-of-order queues are more complex because you must define the execution dependences, but they give the runtime more freedom to reorder and overlap executions, and therefore may run more efficiently.
Out-of-order queues may be the right choice for performance critical code with complex dependences between kernels—and if you’re OK with managing the complexity. Otherwise, if your program runs many kernels one after another, in-order queues are easier to reason about and may be the better option.
Watch for more blogs ahead on different aspects of cross-architecture programming. For an even deeper guide, see our 4-chapter preview book on DPC++, and watch for the full book from Apress later this year.
To get started using DPC++, you can access the language and APIs in two ways:
- Intel® DevCloud: Use to prototype your project, test code and workloads across a variety of Intel® processors and accelerators—FREE.
- Intel’s oneAPI reference implementation: Intel® oneAPI Toolkits Beta
More DPC++ Development Resources
- oneAPI Programming Guide
- Video: Data Parallel C++: An Open Alternative for Cross-Architecture Development [12.05]
- Free oneAPI and DPC++ webinars and how-to’s
- Dive Into Data Parallel C++: An Open, Standards-Based, Cross-Architecture Programming Solution
- DPC++ Data Management across Multiple Architectures Part 2: Ordering Execution Using the oneAPI Programming Framework
Ben Ashbaugh is an Intel software architect focused on parallel programming models for general purpose computation on Intel graphics processors, including oneAPI and DPC++. He has authored numerous OpenCL™ extensions and is an Intel representative within The Khronos Group, where he contributes to the OpenCL, SPIR-V™, and SYCL™ industry standards.
James Brodman is an Intel software engineer researching languages and initiative and DPC++, and he has written extensively on programming models for SIMD/vector processing, languages for parallel processing, distributed memory theory & practice, programming multi-core systems, and more.
Mike Kinsner is an Intel software engineer working on parallel programming models for a variety of architectures, and high-level compilers for spatial architectures. He is an Intel representative within The Khronos Group where he contributes to the SYCL™ and OpenCL™ industry standards, and is currently focused on DPC++ within the oneAPI initiative.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Intel technologies may require enabled hardware, software or service activation. No product or component can be absolutely secure.
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.