Establish standard coding practices in terms of things to avoid (as opposed to desirable actions, which are covered in a separate item, identified below).
Make the following five practices standard procedure:
Avoid cache-stripe conflicts. When data is loaded from RAM into cache, it is moved in chunks called "cache stripes." The size of cache stripes vary with the processor architecture and sometimes with the specific model of chip. On Intel® Pentium® 4 processors and Intel Xeon® processors, the stripe is 128 bytes. The way the cache system works is that every time a stripe in cache is modified, a check is made to see whether that stripe is loaded into any other caches. If so, those caches receive a signal to refresh their copy of the cache stripe. This process is known as cache coherency; it ensures that all versions of a data stripe are identical across the system.
Suppose now that two different threads each maintain a private data field located in the same 128-byte stripe. If both threads are modifying their private fields, the stripe will be constantly updated in all caches, even though these updates serve no purpose. The first thread does not need to be informed that the other thread has modified its own private field. On a multiprocessing system, these constant and pointless cache updates will soon swamp the other work. This problem can cripple performance and it must be carefully watched for. The solution is simple: fields used by separate threads should not be collocated in the same cache stripe. Know the size of the cache stripe and code accordingly.
Do not use excessively tight loops. For years, developers have been told to make their loops as tight as possible: The fewer the instructions, the faster the loop. On today's processors, however, excessively tight loops can actually slow performance. The classic case is the spin-wait loop, which simply keeps checking a variable to see whether its value has changed. (Such loops are often used to synchronize activity between two threads.) On today's processors, such tight probing causes problems. The principal problem is fundamental: the execution pipeline is completely stalled, looping until the variable changes. At an execution rate of 3GHz, a spinning loop that lasts even a fraction of a second deprives the program of a considerable execution opportunity. If your loop is spinning just to check a variable, consider pausing the thread for half-second (or longer) intervals. This move frees up the processor to do other work while the spinning thread is suspended.
If you must keep a loop spinning, use the pause instruction. Out-of-order execution can make numerous requests probing the same variable difficult to resolve. Avoiding the situation where multiple probes are all waiting on each other, the pause command creates a one-clock delay in each spin of the loop; enough time for the requests to resolve properly.
Avoid overloading the cache. It is important to recognize that every item loaded into cache dislodges one already there. Consequently, overuse of the preload mechanism will lead to slower performance when the displaced items are still needed. As a result, Intel recommends that for most applications, the cache's own built-in preloading mechanism be allowed to do its work without interference. This means that developers should preload only the initial data they need and then only immediately before use.
Don't forget to use SIMD instructions. Since the original Intel Pentium® processor was released, Intel has extended the instruction set to provide single instruction, multiple data (SIMD) operations. The first generation, called MMX™ technology, was followed by Streaming SIMD Extensions (SSE) for the Intel Pentium® III processor, and Streaming SIMD Extensions2 (SSE2) for the Pentium 4 processor and Intel Xeon processor. All these extensions have the ability to apply the same arithmetic instruction across multiple data items. Four 32-bit values or even 16 eight-bit values can be divided in one swoop via SIMD.
Developers often overlook SIMD instructions, because they have an unusual syntax and are rarely discussed in the programming trade press. Almost all the SIMD instructions are callable directly from C and C++, however, and they do not require knowledge of assembly language. They can be coded using Microsoft*, Borland*, and Intel® Compilers and they are accessible on Windows* and Linux* platforms.
The importance of using SIMD instructions becomes obvious when you consider the fact that the code most in need of performance optimization frequently crunches numbers. In addition, once you identify the handful of SIMD instructions you need, you will find yourself using them frequently.
Avoid thinking you know what the performance issues are. Processors today are so complex that performance snags can occur in places that even experienced developers would never consider.
Beyond the places already discussed, there are still many pitfalls. Consider, for example, efforts to recode a C function in assembly language. One temptation might be to make more extensive use of the enlarged register offered by the Intel® NetBurst® microarchitecture. This use of registers must be done with extreme care, however – it is no longer true that keeping many items in registers automatically delivers better performance.
Processor performance can be adversely affected by excess register allocation; using too many registers makes it difficult for the processor to move data around for optimal execution sequencing. This situation, known as register pressure, is nearly impossible to detect, except by the intractable diminished execution speed. (For this reason, most C/C++ compilers today ignore the 'register' keyword, which at one time was a command to compilers to place specific variables inside registers.) The Intel® VTune™ Performance Analyzer is one of the few tools that can diagnose register pressure.
The prevalence and impact of cache-stripe errors, excessively tight loops, register pressure, and many similar performance snags suggest that automated hot-spot location is vital to the developer's enterprise. The VTune Analyzer or other good profiling tools are clearly the best means of locating and fixing hot spots, and for quantifying the improvements. Developers who rely only on their experience are likely to spend fruitless hours tweaking the wrong code.
The following item is related to this article: How to Establish Sound Coding Practices: Things to Do.
When Performance Really Counts: 5 Things to Do, 5 To Avoid