I realized that I don't fully understand the nature of cache line contention performance problem, and failed to find a solid paper on the subject, maybe experts in this forum can help.Note, I'm not talking about the false-sharing case, which is trivial to discover, estimate it's impact and fix. Rather I'm talking about regular spin-lock mutex vs. queuing spin-lock. Consider the regular spin-lock case first. Say thread1 unlocks the mutex (sets line in state E, RFOs neighbors and Modifies the line), then thread2 and thread3, that were spinning on the variable in this line would see it's Invalidated, would get a new line content and set it to Shared. Now both of them want to enter the critical region and they know that the mutex is unlocked, so they'll both need to try to modify the mutex variable. Where would the overallperformance impact come from? Both caches (for cores which run thread2 and thread3) would try to set Eclusive state for the cache line? Do they waste cycles doing this or only one "E" is allowed and the second one just sees it immediately and assumes an "I" for the line? In some other discussions that I've read on the subject I'm seeing that the performance problem comes from "line/snoop traffic". Is it a measurable value? In other words, if I know how many lock()-s/unlock()-s happend in one second and I know the average number of threads waiting on the mutex, would I be able to estimate this "traffic value"... it looks like I should be able to, at leas for MESI single-socket case, no?Any help is greatly appreciated. And yes, I can be wrong about anything I wrote while describing the example, so don't hesitate to correct me or rewrite the whole scenario. Thanks!P.S. I mentioned the queuing mutex and then never got back to it. The point was that when looking at the traffic, how different is spin-lock traffic value from the queuing spin-lock traffic value.
What is the nature and performance impact of cache line contention really?