Determine whether performance degradation (or lower-than-expected performance benefit) from Hyper-Threading Technology is due to exceeding the write-combining buffer capacity. A write-combining (WC) store buffer accumulates multiple stores in the same cache line before eventually writing the combined data farther out into the memory hierarchy, to accelerate processor write performance.
If an application writes to more than four cache lines at about the same time, the write-combining store buffers will begin to be flushed to the second-level cache. The Intel® Pentium® 4 Processor and Intel® Xeon® Processor Optimization Guide recommends writing to no more than four distinct addresses or arrays in an inner loop (in essence writing to no more than four cache lines at a time) for best performance.
Use the VTune™ Performance Analyzer to obtain clock-tick and instructions-retired event sampling runs. Conduct this sampling for the same workload with one and two threads on a single processor, or with two and four threads on a dual processor, with Hyper-Threading Technology enabled. Those functions that execute about the same number of instructions in the two runs, but require far more processor clocks when running two threads per physical processor should be examined more closely.
While there are several other possible causes for increased execution time with Hyper-Threading Technology, it should be possible to examine the annotated source code from the analyzer and quickly get a good idea whether the likely cause is a stall due to exceeding the capacity of the write combining store buffers. If the loop, as run multi-threaded with Hyper-Threading Technology enabled, has a significantly higher CPI (ratio of clock-tick to instructions retired events) on an inner loop than it has when run single-threaded or multi-threaded on a dual physical processor system, it is likely writing to too many locations to achieve write combining, resulting in processor stalls.
A method for ensuring that write-combining store buffers are optimally available for use by the execution engine on processors with Hyper-Threading Technology is covered in a separate item:
How to Take Full Advantage of Write-Combining Store Buffers on Hyper-Threading Technology-Enabled Systems: It is also possible to look at the disassembly view of code from the analyzer, to see how many assembly instructions within a loop actually write to memory, and how far apart those writes are. With 64-byte cache lines, two writes fairly far apart may still be able to combine in a single write-combining store buffer. A compiler may use processor registers to hold frequently used local variables, which would not count against the number of writes using the write-combining store buffers.
Conversely, some compilers may apply optimizations that combine multiple inner loops into a single inner loop, negating attempts to apply loop fission. This generally happens when the compiler has enough information to realize that the order of writes between two or more inner loops does not matter, because all writes are to different areas of memory. It should be possible to write the code in such a way that the compiler is unsure whether the order of writes is important, such as by passing pointers into a function so that the compiler cannot be certain the data blocks pointed to do not overlap.
Retesting with the analyzer can quickly reveal whether performance has been brought up to match or exceed single -threaded performance.
SourceHyper-Threading Technology and Write Combining Store Buffers: Understanding, Detecting and Correcting Performance Issues