An old Jewish fable tells about a poor man asking for advice from the rabbi. The family is large, the house is small, and it feels very crowded. The rabbi tells his follower to bring a goat into the house, and to come see him again after a month. The man is confused, but does not argue. He brings a goat to live in the house. A month later, the rabbi tells the man to take the goat back out of the house and to see him again in a week. Sure enough, a week later the man thanks the rabbi for feeling much better, and the house doesn’t feel as crowded anymore.
As long as GPGPUs can only be programmed using exclusive languages, programmers may feel that the reason for their inability to write code that can be used across both CPUs and GPUs is the lack of a language that supports all HW targets. This is the situation of having a goat living in the house. Removal of the goat merely restores the original problem and only provides a false sense of a solution.
We are seeing some excited reactions to advances in languages, such as OpenCL™, OpenACC, and soon, the ability to program GPUs in C++. Each of these advances is typically accompanied by papers and blog posts about portable code, and even about performance portability. There is some recent news about advancements in C++ support for GPGPUs, which might lead the reader to conclude that it is becoming possible to maintain a single-source code base across CPUs and GPUs.
What is the real problem?
Even if there was a common language that supports CPUs and GPGPUs, the problem is that the HW architecture between most CPUs on the one hand and most GPGPUs on the other hand are divergent in multiple ways, precluding performance portability.
The difference between just “portability” and “performance portability” is, of course, performance. In practical terms, we can each decide what it means for ourselves depending, at least to some extent, on application requirements. Suppose you have two different applications, one for a CPU and one for a GPGPU, and they perform reasonably well. You are, however, aware that long-term maintenance of two code bases is more expensive than maintaining a single code base, and would be interested in a performance portable application to replace both. How much performance would you be willing to give up by replacing the two optimal applications by a single one?
Probably more of us would be willing to lose five percent performance and fewer of us would be happy to lose five times performance.
Some of the differences in HW architecture support for parallel computing are well known and broadly understood:
CPUs such as dual-socket, current-generation Intel® Xeon® processor-based servers provide a few tens of cores, that are two-way hyper-threaded. The cores are high performance Out-of-Order cores. By contrast, GPGPUs provide hundreds of-low performance scalar processors.
GPGPUs tend to use hyper-thread switching to hide latency of long latency operations, including memory access. CPUs, at most, use much fewer hyper threads, and rely on cache architecture to absorb memory latency.
GPGPUs tend to use the HW for switching between threads, whereas CPUs use OS threading and user mode schedulers for scheduling tasks.
Many CPUs expose two levels of parallelism architecturally to SW: the core count and SIMD (single instruction, multiple data) support. SIMD allows for processing multiple data elements within a single instruction. By contrast, running the same sequence of instructions on multiple cores allows them to execute independent of each other and progress at different paces.
These HW architectural differences dictate different algorithmic choices. Algorithmic choices are typically made in conjunction with choices on how to parallelize the algorithms. As is well known, some algorithms are more efficient as sequential algorithms but cannot be parallelized, or not parallelized efficiently for some HW targets, while alternative algorithms are less sequentially efficient and parallelize better.
The implications of the HW differences on algorithmic choices are many, and it is hard to provide a succinct summary. The implications on the parallelization of algorithms are also many, but at least a subset of them is well understood:
Parallelization for GPGPU requires many more parallel tasks than for a CPU.
Parallelization for GPGPU supports fine grain, while parallelization for CPUs is coarse grain, to amortize the overhead of the SW scheduler. For CPUs, vectorization is fine grain, while for most GPGPUs parallelization and vectorization are done jointly, via kernel programming.
Cache efficiency of the parallelization using techniques such as tiling, blocking, and prefetching are relatively more impactful for CPUs. A GPGPU programmer may be able to get closer to optimal performance without working as hard on the cache efficiency of their implementation.
In summary, deficiencies in language syntax can certainly get in the way of productivity while trying to optimize performance, and at the same time there are considerations in performance engineering, especially in parallel programming, that are outside the scope of the language. Advancements in languages and their cross-platform availability are good progress, but do not by themselves deliver performance portability.