On a single core system with HyperThreading the KMP_BLOCKTIME (unexpired) would be doing something like a SpinLock which could functionaly called a SpinGo. To accomplish thisthe stalled threadmust be looking at a shared memory variable and inorder to force cache coherency with the other processors (the other HT thread in this case) an instruction is issued (LOCK?) that forces all processors to invalidate there cache so as all can see the potential change in the value of the variable used for the "SpinGo".
There are two opposing forces in effect. The waiting thread wants to get going as soon as possible. In which case it performs a "Are we there yet, are we there yet, ..." such that to get the answer as soon as possible. The other opposing force the the other thread that is trying the execute to the synchronization point and flag "we are there now". But in the process to getting "there" it's cache keeps getting flushed due to the activities of the impatient thread.
On way to fix this is to reduce the frequency of poling the flag.
The question I have is which way is implimented?
The reason I ask this is OpenMP on a single core HT processor runs significantly slower than a single thread. From the literature it would seem that some improvement would be expected (10%-20% depending on applicaiton).