This code is not multithread safe and no guarantees or predictions can be made on the order of retirement without locks or sync points in place.
Store buffers are per hardware thread structures, and provide ordering only at that level. In the case of your writer thread code, there are no requirements for when A,B, and C need to be retiredsince there is no read instructions in that same thread. In fact, it is equally possible that the loop will continue entirely without writing back to cache at all since HW will notice that the writes to A,B, and C are never read before being re-written. It is likely thatthe cache write back (which will become visible to the read thread) only happens due to a completely unrelated event, like a thread pre-emption for the first thread.




Memory store retirenment
In the context of Nehalem/Sandy Bridge CPU architecture.
One thread (bounded to core 1) writes data to the same cachline in sequence:
while(1)
{
..........
Write A
Write B
Write C
}
Another thread (bounded to core 2) read the same data in the same sequence:
while(1)
{
...........
Read A
Read B
Read C
...........
}
It is assumed that the cacheline in which A,B,C reside marked as Shared(S) before the first core
starts writing A,B,C.
Very intresting question is when C is going to be retired(written to L1 cache) so it becomes visible by the second core?
Logicall it should be retired right after a write to C appeared in the store buffer, but I could imagine CPU might not retire data from the store buffer right after they appeared in it, but after say the buffer has at least 2 elements to retire so it could combine B and C in one shot.
In my dev environment I have an issue with update to C being read with a 80 ns delay so knowing how store retirenment works might help to find a way to improve the latency.