ChallengeEstablish standard coding practices in terms of desirable actions (as opposed to things to avoid, which are covered in a separate item, identified below).
SolutionMake the following five practices standard procedure:
- Make sure your algorithm works with the hardware, not against it. Common wisdom about which algorithms to use was formulated many years ago. In those days, no great disparity existed between the speed of processors and that of memory access. Hence, algorithms were assessed strictly based on the number of operations they performed. Today, this approach is no longer valid. Processors run at 3GHz, while fast memory, such as DDR 2400, runs at 300MHz. This means that in theory, the processor can execute 10 instructions while waiting for memory to return data from a single fetch request. In fact, the actual ratio is a fair bit worse than this. This differential is referred to as the ' memory gap ', and it must become an important factor in algorithm selection.
Consider, for example, the problem of resolving collisions when inserting an element into a closed hash table. The traditional wisdom has been that linear (that is, sequential) searches for an open slot were less efficient than rehashing the entry and probing the new hashed-to location. This view was based on the perception that because hash tables entries often clump together, sequential searches might have to probe numerous slots before finding one available; whereas rehashing would more likely find an empty slot on the first try.
Today, this logic no longer holds true. The inspection of a random slot will frequently mean that the slot is not in cache and that the processor will have to wait while it's fetched and loaded into memory. If that slot happens to be full, a new rehash and memory fetch will ensue. With the linear search, the next slot is likely to be in level 1 cache (the innermost cache in the processor's execution pipeline), meaning no delay at all will be needed to access it. As such, the linear probe executes at the fastest possible execution rate, while the re-hash runs at the slowest execution rate – that is, more than an order of magnitude slower. In fact, because the linear probe can frequently be executed out of order by the processor's speculative execution capabilities, it can take even less time than suggested here.
In view of this striking difference, reject conventional wisdom and chose algorithms to accommodate the two overwhelming needs of the processor: the data and instructions in cache.
- If you cannot change the algorithm, preload the data. The processor core contains a hardware capability that predicts data fetches and preloads the expected data into cache. This mechanism is triggered once it detects a pattern of memory accesses; it then starts loading the next blocks in the pattern into cache in anticipation of need. While undeniably useful, this mechanism needs to observe several memory fetches before it recognizes a pattern and starts up.
To avoid the performance penalty of these first few memory fetches, developers can preload the data into cache in anticipation of the processor's need. The instructions to do this – the prefetchx series – were introduced with the Intel® Pentium® III processor. Their use is described in detail in Chapter 10 of the IA-32 Intel Architecture Software Developer's Manual—Volume 1: Basic Architecture. The prefetch instruction is implemented as an intrinsic function, which means it is an assembly instruction that can be called as a function from C and C++ using the Microsoft or the Intel® C/C++ compiler. Use it when you know a data structure is about to be accessed repeatedly.
- Reduce jumps and make them predictable. Fetching instructions is the only standard operation the processor performs more slowly than fetching data from memory. When instructions execute on Pentium® 4 processors and Intel Xeon® processors, a mechanism scans upcoming instructions and executes them out of order, wherever that can be done efficiently. The results of these instructions are then folded back into the stream of retired instructions, and the results are reconstructed just as if all the instructions had been executed in their original order.
This mechanism enables more than one instruction to be executed by the processor and is frequently used on branches. For instance, the processor might see an if - else test coming up, and so it speculates which branch will be taken. It then races ahead and executes as many of the instructions on the predicted branch as it can and it stores the results. At the point where the main execution pipeline comes to the branch, the processor compares the actual branch path with the one it guessed. If it guessed right, the processor puts the already executed results into the stream of retired instructions and resumes processing at the end of this series.
If it guessed wrong, however, a large penalty looms. The processor has to flush its results queue of all pre-executed instruction and jump to the new branch. There, it must start execution. Since the processor pipeline is 20 stages long, however, it takes the first jumped-to instruction 20 clock cycles to be executed. During that 20 cycles, the processor retires no instructions. The problem is vastly compounded if the jumped-to location is not in cache. Then the processor must wait for the memory fetch before beginning the 20-cycle wait. In such a scenario, the first jumped-to instruction will have taken more than 30 clock ticks to process, meaning that for that one brief period. the processor ran at 1/30th its usual speed.
To avoid this, design code so that jumps (such as if/& else clauses, switches and for loops) all perform the minimum number of comparisons. Try to make the tests as predictable as possible, and make the branches short.
- Use threads. By means of threads, the processor can execute other tasks while one task, for example, is waiting for disk I/O to occur. Any time there are functions that could occur simultaneously, you should consider using threads. Thread candidates include tasks like downloading files, handling multiple GUI events, and disk activity, as well as processing that involves queues, stacks, and other double-ended data structures.
There is no doubt that threads accelerate performance, but they do come at the cost of complexity. Threads make programs more challenging to design correctly, to debug, and to test. They are unarguably the wave of the future, however, due to Intel's delivery of Hyper-Threading Technology for the desktop. Within a few years, all desktop programs will use threads, as servers applications already do. Developers must get up to speed on programming for threads and put them to use today.
- Start designing loops with Hyper-Threading Technology in mind. Analysts expect Hyper-Threading Technology to become a mainstream desktop feature quickly. This technology executes two separate threads simultaneously on a processor's execution pipeline. As long as the two threads are not contending r the same resources, they will both run fine. If, however, one thread is using resources required by the other, the second thread has no choice but to wait. In such a case, there is little performance gain.
To avoid this situation, put threads with similar needs on different physical processors. Ideally, a processor would have one thread performing floating-point math while the other compares strings. Such different operations will cause very little contention, and good leverage of the technology will ensue. However, if both threads are doing floating-point operations, they will interfere with each other, and the benefit from Hyper-Threading Technology will drop off significantly.
The following item is related to this one:
How to Establish Sound Coding Practices: Things to Avoid.