The Intel Architectures Software Developer's Manual, Aug. 2012, vol. 3A, sect. 8.2.2:
Any two stores are seen in a consistent order by processors other than those performing the stores.
But can this be so?
The reason I ask is this: Consider a dual-core Intel i7 processor with HyperThreading. According to the Manual's vol. 1, Fig. 2-8, the i7's logical processors 0 and 1 share an L1/L2 cache, but its logical processors 2 and 3 share a different L1/L2 cache -- whereas all the logical processors share a single L3 cache. Suppose that logical processors 0 and 2 -- which do not share an L1/L2 cache -- write to the same memory location at about the same time, and that the writes go no deeper than L2 for the moment. Could not logical processors 1 and 3 (which are "processors other than those performing the stores") then see the "two stores in an inconsistent order"?
To achieve consistency, must not logical processors 0 and 2 issue SFENCE instructions, and logical processors 1 and 3 issue LFENCE instructions? Notwithstanding, the Manual seems to think otherwise, and indeed supports its opinion with a seemingly clear example in sect. 22.214.171.124.
Does this mean that every cache at every level snoops all writes to every other cache, even across cores, even across packages? If so, would this not imply that every store to a valid local line of cache must lock the global address bus? This does not sound right. Is it right? After all, what is the point of multithreading, what is the point of caching, when ordinary stores must always lock global resources? I am confused.
(Incidentally, I have indeed checked the Manual's latest errata, which do not seem to address the matter.)