Challenge Ensure that write-combining store buffers are optimally available for use by the execution engine on processors with Hyper-Threading Technology. To accelerate processor write performance, a write-combining (WC) store buffer accumulates multiple stores in the same cache line before eventually writing the combined data farther out into the memory hierarchy.
If an application writes to more than four cache lines at about the same time, the write combining store buffers will begin to be flushed to the second-level cache. The
Intel® Pentium® 4 Processor and Intel® Xeon® Processor Optimization Guide recommends writing to no more than four distinct addresses or arrays in an inner loop (in essence writing to no more than four cache lines at a time) for best performance. The following code loop would violate that precept by writing to six distinct locations:
Now each inner loop writes to just three distinct locations per iteration, effectively using only three write-combining store buffers, and allowing effective write combining.
With Hyper-Threading Technology-enabled processors, the WC store buffers are shared between two logical processors on a single physical processor. Therefore, the total number of simultaneous writes by both threads running on the two logical processors must be counted in deciding whether the WC store buffers can handle all the writes.
Source Threading Methodology: Principles and Practices
Hyper-Threading Technology and Write Combining Store Buffers: Understanding, Detecting, and Correcting Performance Issues