In the context of Nehalem/Sandy Bridge CPU architecture.
One thread (bounded to core 1) writes data to the same cachline in sequence:
while(1)
{
..........
Write A
Write B
Write C
}
Another thread (bounded to core 2) read the same data in the same sequence:
while(1)
{
...........
Read A
Read B
Read C
...........
}
It is assumed that the cacheline in which A,B,C reside marked as Shared(S) before the first core
starts writing A,B,C.
Very intresting question is when C is going to be retired(written to L1 cache) so it becomes visible by the second core?
Logicall it should be retired right after a write to C appeared in the store buffer, but I could imagine CPU might not retire data from the store buffer right after they appeared in it, but after say the buffer has at least 2 elements to retire so it could combine B and C in one shot.
In my dev environment I have an issue with update to C being read with a 80 ns delay so knowing how store retirenment works might help to find a way to improve the latency.



