Write-Combining Store Buffers on Hyper-Threading Technology-Enabled Systems

Submit New Article

April 30, 2009 9:00 PM PDT


Challenge

Ensure that write-combining store buffers are optimally available for use by the execution engine on processors with Hyper-Threading Technology. To accelerate processor write performance, a write-combining (WC) store buffer accumulates multiple stores in the same cache line before eventually writing the combined data farther out into the memory hierarchy.

If an application writes to more than four cache lines at about the same time, the write combining store buffers will begin to be flushed to the second-level cache. The Intel® Pentium® 4 Processor and Intel® Xeon® Processor Optimization Guide recommends writing to no more than four distinct addresses or arrays in an inner loop (in essence writing to no more than four cache lines at a time) for best performance. The following code loop would violate that precept by writing to six distinct locations:


for( )<br />
{<br />
for( i = 1 to N )<br />
{<br />
A[i] = data1;<br />
B[i] = data2;<br />
C[i] = data3;<br />
D[i] = data4;<br />
E[i] = data5;<br />
F[i] = data6;<br />
}<br />
}

Solution

Split inner-loop code into multiple inner loops, each of which writes no more than two regions of memory. Generally, look for data being written to arrays with an incrementing index, or stores via pointers that move sequentially through memory. Writes to elements of a modest-sized structure or several sequential data locations can usually be counted as a single write, since they will often fall into the same cache line and be write combined on a single WC store buffer.

The inner loop of the code given in the Challenge section above needs to be split into two loops, like this:

for( ) // over the height of frame<br />
{<br />
for( )<br />
{<br />
A[i] = data1;<br />
B[i] = data2;<br />
C[i] = data3;<br />
}<br /><br />
for( )<br />
{<br />
D[i] = data4;<br />
E[i] = data5;<br />
F[i] = data6;<br />
}<br />
}

Now each inner loop writes to just three distinct locations per iteration, effectively using only three write-combining store buffers, and allowing effective write combining.

With Hyper-Threading Technology-enabled processors, the WC store buffers are shared between two logical processors on a single physical processor. Therefore, the total number of simultaneous writes by both threads running on the two logical processors must be counted in deciding whether the WC store buffers can handle all the writes.

Source

Threading Methodology: Principles and Practices

Hyper-Threading Technology and Write Combining Store Buffers: Understanding, Detecting, and Correcting Performance Issues