Researchers at Universidad de Málaga find that programming algorithms is up to five times easier with Data Parallel C++ (DPC++) than OpenCL™ to develop Markov Decision Process (MDP) solvers for low-power systems.
Heterogeneous computing with multiple types of processors can offer benefits for large-scale and supercomputers to embedded systems. However, with smaller systems (such as robots), compute-intensive, automated decision-making algorithms need to be efficient, and schedulers need to be aware of power as they assign tasks to different processing architectures to optimize throughput. These aspects require new approaches to optimizing solutions for low-power systems.
The oneAPI programming model creates new opportunities to improve performance and efficiency in low-power systems. Professor Rafael Asenjo and Ph.D. student Denisa-Andreea Constantinescu, both from the Universidad de Málaga’s Department of Computer Architecture, are optimizing algorithms used in reinforcement learning (RL) to develop a heterogeneous scheduler for low-power use cases.
“The novelty of our solution,” Professor Asenjo stated, “is that we solve large problems on low-power SoCs. We actively seek to achieve energy efficiency and not just make code run fast.”
During development, they evaluated three different programming models, including oneAPI using DPC++ programming language. When the scheduling strategy is carefully selected, they found DPC++ to be five times easier to program in while incurring only three to eight percent of overhead.
When you think of oneAPI use cases that leverage the performance of heterogeneous computing, you might first envision powerful workstations and multi-megawatt supercomputers loaded with big CPUs, GPUs, and FPGAs training complex neural networks. However, Professor Asenjo and Ms. Constantinescu are working on the opposite. They’re thinking of solving problems in systems that use compute-intensive RL algorithms optimized for battery powered heterogeneous computing devices.
Ms. Constantinescu is passionate about robotics and mechatronics. Professor Asenjo has focused for many years on productively exploiting heterogeneous chips leveraging Intel® Threading Building Blocks (Intel® TBB) as the orchestrating framework and developing heterogeneous scheduling strategies. They are part of a team that includes Professors Juan-Antonio Fernández-Madrigal and Angeles Navarro and Associate Professors Francisco Corbera and Ana Cruz exploring methods in robotics and optimization to create efficient solutions using multiple processors in low-power systems. They are using the oneAPI programming model, the Intel® oneAPI Base Toolkit, and Intel® DevCloud.
“The announcement of oneAPI,” Asenjo explained, “was immediately received in our group as an enticing opportunity to raise the level of abstraction in our implementations of heterogeneous schedulers. I see oneAPI as a strong endorsement to SYCL and modern C++ as the basis for the ‘homogeneous programming of heterogeneous platforms’ idea.”
Their project is to create optimized methods for autonomous decision-making under uncertainty. That involves algorithms, data structures, and low-power heterogeneous computing. Their solutions will enable an intelligent agent (such as a robot) to act autonomously in environments where the effects of its actions are not deterministic (Figure 1). For example, a rover taking samples from the surface of Mars may not know if a sample is worth taking or if the direction of travel will lead to worthy specimens. Or, a drone looking for survivors trapped after a natural disaster may not know if it is following a path that will eventually lead to a person.
Figure 1. Robots must learn to navigate in uncertain environments, such as the ones above, using reinforcement learning methods. (The black object is the robot.)
“For low-power embedded systems and mobile robotics computing,” Constantinescu said, “memory and power are scarce. Despite this, applications often need close to real-time performance, while balancing power use. Our goal is to make it feasible and easy to implement intelligent agents and autonomous decision-making applications on mobile platforms.”
Their work will end up benefiting developers who care about runtime performance and energy efficiency. Their solutions will help enable future products like domestic robots that can navigate the home, pick up items, and put them where they belong—without ever having been in your home.
Developing the Solution—Navigating New Spaces
“I like robots and optimizing processes,” Constantinescu added. “For this project, we looked at which problems in robotics would highly benefit from a runtime and energy-efficient heterogeneous implementation.”
They discovered real challenges worth tackling. Many automated planning and decision-making algorithms rely on Markov Decision Processes (MDPs) and Partially Observable MDPs (POMDPs). MDPs describe how an agent with a goal in mind learns by doing even without knowing the map of its environment. POMDPs and MDPs are able to cope with uncertainty, such as not knowing what lies ahead or whether its actions will be beneficial. The literature showed them that solving real-world problems for these are not even considered for low-power platforms, and computing an optimal solution for medium to large-sized problems is intractable.
“These cover the majority of the interesting use cases and have many practical applications, including autonomous robots, deep-space navigation, search and rescue, inspection and repair, toxic-waste cleanup, and much more,” Constantinescu concluded.
They started with the Value Iteration (VI) algorithm, commonly used in MDPs and a core kernel in many RL methods, optimizing its data structures for memory use and access. Then, they approached parallel implementation strategies on low-power SoCs (System on Chip) to improve the VI runtime and energy use. They initially implemented multicore parallelism, using OpenMP and Intel TBB, and then included GPU accelerators programmed with OpenCL.
Next, they added heterogeneous scheduling to balance and optimize the computing resource utilization to minimize runtime and energy consumption.
Figure 2. RL flow with a Value Iteration algorithm.
To date, they have covered the MDP category of problems and are working on POMDPs. Once they achieve that milestone, they will implement benchmarks on a real robot. The resulting software will be open-source under a GNU General Public License, most likely published on a public GitHub or bitbucket repository.
Evaluating Three Programming Models
Asenjo and Constantinescu approached the scheduler code development using three different programming models: OpenCL, oneAPI with SYCL-style buffers written in DPC++, and oneAPI with unified shared memory (USM) written in DPC++.
“This last addition makes the code more readable, shorter, and easier to debug,” Asenjo added. “And, we think that a process or method that is compact, easy to understand and reproduce is far more valuable than a method that works slightly better, at the cost of added complexity.”
They found that because DPC++ code is more compact and efficient to program than OpenCL, it was as much as five times easier to program than OpenCL. With a careful scheduling strategy, it only added three to eight percent overhead.
“We use the Cyclomatic complexity (CC) and Programming Effort (PE) metrics to measure how easy (or difficult) it is to program a code, “Constantinescu explained. “CC is the number of predicates plus one, and PE is a function of the number of unique operands, unique operators, total operands, and total operators. The operands correspond to constants and identifiers, while the symbols or combinations of symbols that affect the value of operands constitute the operators. Higher values for CC and PE mean that it is more complicated for a programmer to code the algorithm.”
They used platforms based on Intel® Core™ i5-8250U processors with Intel® UHD Graphics 620 in Intel® DevCloud to evaluate and measure their work with oneAPI.
“We chose Intel low- to medium-power processors as test beds in our experiments because they are energy efficient and powerful enough to run at least some AI benchmarks onboard a mobile robot,” Asenjo explained. “Besides, the quality and ease of use of Intel’s profiling and debugging tools is, in our opinion, ahead of competitors. Intel® Parallel Studio and other supporting tools now included in the Intel® oneAPI toolkits—Intel® VTune™ Profiler, Intel® Advisor and its Flow Graph Analyzer feature, Intel® Inspector, and others—are also part of the “productivity” factor that we consider key to democratizing parallel and heterogeneous programming.”
Figure 3. Speedup and energy results across multiple implementations of heterogeneous schedulers using OpenCL and oneAPI.
Figure 3 shows results with MDP problems using various heterogeneous schedulers (HO, HD, HL)
“HO and OpenCL have nothing else under the hood,” Constantinescu explained. “It's brute force programming with problem-specific know-how and optimizations. There’s no overhead. Performance is good, but more painful and time-consuming to get it right. However, applications using oneAPI and HL are extremely easy to code, even though we pay with some performance loss compared to HO-OCL due to the abstraction overhead,” she said.
According to Asenjo, from the three scheduling strategies evaluated, static scheduling performs best in terms of performance and energy efficiency, though it requires exhaustive offline searching. Adaptive scheduling provides good results with no previous training, in particular when using the USM approach to code the kernels and scheduler and for large problem sizes. As future work, they are looking into more complex decision-making procedures, that is, partially observable Markov decision processes (POMDP).
“The main result of this project is a set of lessons learned from evaluating and testing the limits of different cross-platform strategies for parallel and heterogeneous execution of the Value Iteration algorithm on low-power SoCs. Our plan is to have a generic time- and energy-efficient POMDP solver by the end of this year,” Asenjo concluded.
They have also created a set of robot navigation benchmarks for testing decision-making for basic navigation to a goal. Their findings were published in conferences and journals, including “Efficiency and Productivity for Decision-Making on Low-Power Heterogeneous CPU and GPU SoCs” and “Performance Evaluation of Decision-Making Under Uncertainty for Low-Power Heterogeneous Platforms.” The code is available on bitbucket.
- Intel® oneAPI Base Toolkit
- Intel® oneAPI HPC Toolkit
- Intel® Parallel Studio
- oneAPI 2021.1-beta03
- Processor Counter Monitor (PCM) library
- Intel® VTune™ Profiler
- Intel® Graphics Driver, 9th generation
- Intel® Core™ i5-8250U processor with Intel® UHD Graphics 620
- Intel® Core™ i7-5557U processor with Intel® Iris® 6100 GPU
- Intel® Core™ i7- 5775C processor with Intel® Iris® Pro 6200 GPU
- Intel® Core™ i9-9900K processor with Intel® UHD Graphics 630
Resources and Recommendations
Video Tutorials with Code Samples
- DPC++ Part 1: An Introduction to the New Programming Mode
- DPC++ Part 2: Programming Best Practices
- oneAPI and DPC++ Training by Colfax International (modules 1 to 4 include code samples)
PCM library: opcm/pcm: Processor Counter Monitor
Intel® GPU Compute Samples: An open source set of example applications that illustrate general-purpose computing on Intel® Processor Graphics.
- Data-Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems Using C++ and SYCL | James Reinders et. al. (chapter 1 to 4 preview is freely available)
- Pro TBB: C++ Parallel Programming with Threading Building Blocks | Michael Voss et. al.
1 Professor Rafael Asenjo is a co-author of Pro TBB: C++ Parallel Programming with Threading Building Blocks along with Michael Voss and James Reinders.