I got an email request to write a blog about three things I would advocate to a programmer that could be used to speed up her program. My first flippant thought was “Location! Location! Location!” That got me thinking about real estate and led my meandering mind to answer the original query with a paraphrase of Blake (played by Alec Baldwin) from the film Glengarry Glen Ross: “A-B-C. A-Always, B-Be, C-Concurrent. Always be concurrent.”
I quickly realized such a simple quote was packed with so much more than just three simple things. I’ve written dozens of IDZ blog posts on individual items (granularity, load balance, task decomposition, parallelizing loops, etc.) that would be much more relevant. (If you can find them online, feel free to pick three and don’t bother to finish reading the rest of this post. Or, better yet, finish reading this one now and search online later for more focused recommendations.) I even wrote a book on the topic of concurrent and parallel programming. If you’ve got a copy, take out any three pages from the latter half of the book and you’ll likely have three different things you can do to speed up code. (Also, now that you’ve ruined your copy by tearing out those three pages, feel free to buy another copy.)
Putting aside my money-grubbing levity, I reconsidered the challenge. Thinking a bit deeper, I came up with three things that programmers should know and do to speed up their code’s execution. These are 1) Know the architecture, 2) Organize your data structures, and 3) Organize your computations. Allow me to expand and explain since these topics are a bit “meta” and cover many different, specific details.
While I was involved with teaching parallel programming and the use of Intel software tools, whenever there was a new Intel processor released, one of my colleagues would put together some slides on the details of the new architecture. These presentations were set for a full hour and went into excruciating and precise detail on all the bits and pieces and (non-moving) parts that went into the new processor. Before the third revision of this material, I told the course planners that, as a software person, I was being put to sleep 10 minutes into all of this hardware detail. And since we were teaching software developers, if we couldn’t relate bus speeds and clock ticks and memory refresh rates to anything concrete that a programmer could do to improve program performance on the chips, then I suspected the audience was nodding off, too.
On the other hand, no programmer, whatever the field of science or artistic endeavor they hail from, should be completely ignorant of at least some of the basics of processor architecture. Even though the specific details may change, it is important to know some rudiments about how cache memory works, how the memory hierarchy is structured, how registers are used in the CPU, and, most recently, how vector registers are different and what you can accomplish with them. In a computer science degree program, students should take a computer architecture course and get an in-depth exposure to these topics. Students in other disciplines that learn programming as part of their education will likely not get that exposure.
I figure about a 10 minute overview would be sufficient to give a programmer enough knowledge of these topics and how they relate to code writing and execution. These ideas and principles will go a long way to improving the execution of applications, whether they be serial or parallel and across most any programming language. After they’ve had this ten minute exposure, programmers have my permission to fall asleep during anything more in-depth on the architecture du jour.
With an understanding how data is moved within the computer system, it behooves the modern coder to ensure that the organization of her data is aligned with that data motion during program execution. Most times, the way humans think about how to organize the data within their applications will be just fine. Other times, the “obvious” way may actually work against the way the processor will access data and result in suboptimal performance.
My “go to” example for this is working with a collection of complex numbers. A complex number has both a real and imaginary value. The obvious way to work with a complex number in a program is to create a structure/class that has a part/member to hold the real value and a part/member to hold the imaginary value. When you need to use a collection of complex numbers, you create an array of these structures. This array of structures will not hold all the real values in consecutive memory locations, nor will it hold all the imaginary values in consecutive memory locations. They will be interleaved with each other. If your code computes across all the real parts or just the imaginary parts, it will be most beneficial for loading cache and for vectorization to have the relevant data values in consecutive memory locations.
The fix for the above “problem” is to reorder the data structure you are using to be a structure of arrays. That is, create an array of real values and an equivalent array to hold the imaginary values; then collect these two arrays into a single structure. Since all the real (and imaginary) values are within a single array, they will be in consecutive memory locations. For computations that run through the whole set of real (and imaginary) values the compiler can generate code that loads the cache more efficiently and can even realize that vector instructions could be possible for those computations.
The downside for this fix, if you haven’t already realized it for yourself, is that the code needed is not “obvious” anymore. Such a change may require a modicum of effort to alter the entire code base or to perform maintenance at a later date. When contemplating such a major data structure organization transformation, weigh the potential execution benefits with the work needed to perform the proposed code changes. For a code that will be run hundreds, thousands, or millions of times after such a code change, the cost-benefit ratio should tip toward making the change.
To take best advantage of modern coding principles and the processors that support such features, you may need to reorder some of computational steps within your code. Modifying the structure you put onto your data will necessitate some self-evident coding changes. But I’m thinking beyond that.
Take for example loops of independent operations. These are easily exploited for parallel execution. What about a loop where only most of the loop body is independent? It may be an all or nothing proposition depending on the logic of the statements in the loop. Or, based on the execution order required by the algorithm, it may be possible to break up the loop into separate loops, some of which can be parallelized. If a loop body is long and complex, it might be advantageous to break that loop into smaller pieces for clarity and a better chance that some of those smaller loops will contain independent iterations.
With regards to placing data in consecutive memory locations, you also need to access that data in consecutive order. If you have a logically multi-dimensional data structure that is stored in a one-dimensional block of memory, it can hurt performance of the code if the order of access to all of the elements within the structure isn’t done in the same order as the values are laid out in memory. Can you reorder the nested loops that run over the indices of the array to get a better access pattern, for instance, reverse the loop order to change access from column-wise to row-wise? Some algorithms require a specific order of access to compute the correct answer, so you may need to get creative in changing up your code for the algorithm to ensure you are accessing elements in a beneficial order a majority of the time.
If you’ve read through to the end of this, you might be wondering what each of the above advice topics has to do with “location.” The most important part of knowing processor architecture, for purposes of getting better execution performance, is to understand the relation of the parts of the memory hierarchy, i.e., their “location” relative to each other. Influencing the layout of data structures for better access is controlling the location of the values within the structure. Finally, the location of code blocks and how those interact with the data structures can not only boost performance of the serial execution, but can give the compiler a better chance of recognizing code that can be executed in parallel and code that can be vectorized. For me, the three things I’ve touched on above are at the heart of any successful attempt at code modernization.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804