Determine the optimum size for cache blocks in applications for systems that do not support Hyper-Threading Technology. For optimal performance, cache blocks should be sized such that each data element is reused within the data block.
This issue is addressed for systems that support Hyper-Threading Technology in a separate item, How to Optimize Cache Block Size on Processors that Support Hyper-Threading Technology.
Generate cache blocks to be approximately one-half to three-quarters the size of the physical cache for systems that do not support Hyper-Threading Technology. In general, it's better to err on the side of having too small of a block size than too large. Refer to the IA-32 Intel Architecture Software Developer's Manual – Volume 2: Instruction Set Reference for instructions on how to gather cache size information using the cpuid instruction.
The figure below shows the results of cache blocking with varying block sizes on a sample application. At the sweet spot around 450-460 KB, block size matches very closely with unified L2 cache size, and the application almost doubles in performance. The block-size sweet spot for any given application will vary based on how much of the L2 cache is used by other cached data within the application, as well as cached instructions from the application.
The data cache-blocking technique performance scales well with multiple processors if the algorithm is threaded for data decomposition. Fortunately, the fact that each block of data can be processed independently with respect to other blocks lends itself to being decomposed into separate blocks that can be processed in separate threads of execution.
The figure also shows the performance improvement of the cache-blocking algorithm for two threads running on a dual processor system with two physical processors. The performance curve for two threads match very closely the performance curve for a single-processor system, with the sweet spot for the block size at around 450-460 KB per thread but at approximately twice the performance.
Assuming that very little synchronization is necessary between the two threads, it is reasonable to expect that the block size sweet spot would not vary significantly. Both processors have independent caches of equal size. In this case, both processors have 512KB of L2 cache available.
This item is meant to be used in conjunction with a data-blocking technique, an example of which is given in the separate item, How to Create Cache-Data Blocks.