In the past couple of years I've noticed a trend to "re-invent" technology or re-brand old ideas and concepts from previous computing generations. For example, IBM had virtual machines on their big mainframes and as part of their OS back in the '80s; cloud computing seems to be a revamping of the idea behind computer service agencies in the '70s that leased machine time to those that couldn't afford to buy their own mainframe; and some parts of the recent machine learning phenomenon are neural networks that I studied back in the '90s. Not that this is all a bad thing. I expect that some of the ideas that were expressed in the previous century, while highly popular and intense areas of research, just didn't have the computing power to sustain them or make them viable, but today we have multi-core processors to drive the needed computations. Now, I think I've found another old idea that could be revived for today's computing environment.
Back in November 2016 I had the honor to take part in a half-day tutorial at SC16. This was "Programming Irregular Applications with OpenMP: A Hands-on Tutorial" with Tim Mattson (Intel), Alice Koniges (NERSC) , and Jeremy Kemp (Univ. Houston). As the title denotes, this tutorial went into some detail about using OpenMP tasks, especially in situations with irregular computations that can be assigned to OpenMP tasks and coordinating between those tasks. Afterwards, on my flight home, I was trying to think up other examples that could be used in the tutorial if we were given the opportunity to present the training in the future. I remembered an old parallel architecture model called "systolic arrays" that I thought could be an example for OpenMP task dependences.
A systolic array is a "mesh" of processors arranged in some network configuration. The configuration and how data flowed through the array determined what the hardware computed. The name systolic comes from the idea that data was "pumped" through the array of processors in lockstep fashion, much like blood is pumped through the body. (Fun fact: Systolic is the top number when you get your blood pressure taken.) This restriction of lockstep execution wasn't all that implausible since even theoretical models of parallel computation (e.g., PRAM) typically relied on the idea of lockstep execution of instructions across the model in order to make it easier to prove things about an algorithm's timing or correctness.
The original systolic arrays were implemented in hardware (when actually implemented and not just theorized on paper). However, I was thinking that an OpenMP task could serve as a stand-in for the systolic processing element (PE) in software. The dependence between tasks, as designed by the programmer, would provide the coordination of data movement between PEs. Whether it would be in lockstep or not, carefully crafting the tasks and their data dependencies should give a valid working algorithm that is based on a given systolic array design and execution.
One reason that systolic arrays didn't take off was due to the need for specialized hardware that was set into a single algorithm. This led to a dearth of algorithm development and eventually the idea was pushed aside by cheaper and more generic parallel platforms. Now I'm wondering if the computational power is at hand to make this idea more marketable. I eventually realized that FPGAs could pick up the idea and more easily implement a systolic array computation, but development of new algorithms could be better undertaken by synthesizing a virtual systolic array though OpenMP tasks before commiting anything to a design. As I pondered longer, I realized that any kind of parallel algorithm that relied on a network configuration within a parallel machine (i.e., mesh, hypercube, grid, torus, etc.) could now be simulated by designating OpenMP tasks as virtual processors and using task dependences to define the network topology. I've got dozens of old parallel computing books that are chock full of network-defined parallel algorithms, so there is a plethora of ideas that can be investigated.
Can I say whether or not this approach will amount to anything? Will it be better than an equivalent algorithm that simply utilized shared-memory for data motion and coordination between tasks? Nope, I can't. At least not without trying it first. The one thing I do know is that the architecture available will drive the design of an algorithm. If you consider a different architecture, you design a different algorithm (think shared-memory versus distributed memory, Pthreads versus GPU). Perhaps our parallel algorithm forebearers had great ideas, but were hampered by the restrictions they faced in the computing hardware available to them. With today's faster processors, more numerous cores, and software that can connect implicit PEs in any way we desire, we might be able to dust off some of those ideas and put them to effective use.
That's my new idea for an old technology. Now I just need to rename it something that sounds really cool.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804