Spinning

Spinning

Although this is directly related to "Additions to atomic" I decided to spin it off into a thread of its own (please forgive the word play).

I have been confused about the best way to spin either for changing an atomic variable or for acquiring a mutex. Notice the different characteristics in the duration of the critical section (extremely short resp. the programmer thought it would be fairly short). Another characteristic that may be relevant is the level of contention, and this is currently being considered dynamically by the use of a backoff mechanism. Then there's the consideration for possible other work in another thread on the same core, so there's not just bandwidth to be considered when spinning. And at the hardware side: do all platforms in scope use write-and-update rather than write-and-invalidate, and what happens on each one for a failed CAS, and for the whole algorithm?

TBB itself recently made a change (details omitted here) that would indicate that its current approach needs further vetting. In my most recently contributed change for atomic I made a big mistake (quickly retracted) that alerted me to the fact that I'm not quite the expert on the matter (either?).

When I was wondering about AtomicBackoff/ExponentialBackoff (they're actually almost identical, so that refactoring could make one a subclass or typedef of the other, doubling the previous delay each time pause() is called), I thought that perhaps atomics/mutexes should be approached like what is done on a shared medium like Ethernet (the original shared coaxial cable anyway). But there are some differences in the techniques employed. On Ethernet, the network card first listens for traffic, and when it detects a silence it waits for a random amount of time before it will send anything (the level of congestion affects the mean value of the delay, but it's still random). On TBB, there's no listening step (which may lead to decreased overall performance due to increased chatter), and there's no randomness (which introduces unfairness against, potentially up to the point of starvation of, threads that have been waiting the longest).

Unfortunately, any introduction of complexity where contention is low will probably needlessly increase latency, and for lack of the resources and dedication that a scholar might have I'm leaving out the randomness here, except for the consideration that medium delay should probably be bounded so as to avoid gross unfairness and starvation (TBB's LOOPS_BEFORE_YIELD is not the same because it will increase unfairness even more, and it won't even do anything useful unless there are other users than TBB on the system), and also wondering whether __TBB_Pause() should not have a default implementation other than __TBB_Yield(). Likewise, I don't know how useful and difficult it would be to base the initial pause on the last pause of a previous acquisition.

So, without further ado, here's a proposal for your consideration about a first possible improvement, only for mutex at this time:

for( ExponentialBackoff b; read()==HELD /*short-circuit*/|| test_and_set()==HELD; b.pause() ) {
 do {
 fixed_small_pause(); // be nice to other hyperthreads, save the planet
&nbs
p; } while( read()==HELD );
}

For reference, the TBB status quo is:

for( AtomicBackoff b; test_and_set()==HELD; b.pause() );

If you're not already familiar with the issues, remember that, individually, a thread can get better service by never reading or pausing, but, like on Ethernet (the original one anyway) or Internet (where you can install some evil device on your computer to browse faster), overall throughput will suffer, especially if everybody does this. If you are an expert, please chime in, because this may well be just reinventing the wheel.

Benchmarks, anyone?

17 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Thanks for another interesting idea, Raf.

Though initially I had doubts if the proposed mechanism will be efficient in case of spinning (lost package in the network should cost much more time that unlucky CAS), the results shown by an old TBB timing test for the spin_mutex on {2x4 core WinXP 64, full utilization} are impressive.

Here's how I changed __TBB_TryLockByte:

inline uintptr_t __TBB_LockByte( unsigned char& flag ) {
#if 0
if ( !__TBB_TryLockByte(flag) ) {
tbb::internal::AtomicBackoff b;
do {
b.pause();
} while ( !__TBB_TryLockByte(flag) );
}
#else /* 0 */
const size_t pause = 10000;
for( tbb::internal::AtomicBackoff b; flag==1 /*short-circuit*/|| !__TBB_TryLockByte(flag); b.pause() ) {
do {
__TBB_Pause(pause);
} while( flag==1 );
}
#endif /* 0 */
return 0;
}

The current TBB implementation gives the following results (column delay denotes the lock duration):

p delay flavor time
8 0 spin_mutex 159.94
8 1 spin_mutex 456.44
8 10 spin_mutex 826.10
8 100 spin_mutex 1046.29
8 1000 spin_mutex 3644.30

And here are results for some of the pause values:

pause: 100
8 0 spin_mutex 355.10
8 1 spin_mutex 361.72
8 10 spin_mutex 696.43
8 100 spin_mutex 740.71
8 1000 spin_mutex 1261.12

pause: 200
8 0 spin_mutex 110.02
8 1 spin_mutex 96.84
8 10 spi
n_mutex 218.88
8 100 spin_mutex 415.71
8 1000 spin_mutex 1031.56

pause: 500
8 0 spin_mutex 31.97
8 1 spin_mutex 34.51
8 10 spin_mutex 64.42
8 100 spin_mutex 175.61
8 1000 spin_mutex 1039.08

pause: 1000
8 0 spin_mutex 23.29
8 1 spin_mutex 23.47
8 10 spin_mutex 43.19
8 100 spin_mutex 121.86
8 1000 spin_mutex 930.04

pause: 10000
8 0 spin_mutex 18.13
8 1 spin_mutex 19.20
8 10 spin_mutex 32.07
8 100 spin_mutex 104.88
8 1000 spin_mutex 832.86

After I changed the test to eliminate caches effects the picture remained essentially the same for shorter locks, but for longer ones the difference became much less prominent (but still is noticeable):

Current TBB:
8 0 spin_mutex 126.83
8 1 spin_mutex 299.21
8 10 spin_mutex 309.74
8 100 spin_mutex 501.65
8 1000 spin_mutex 2551.02

With Raf's fix:
pause: 10000
8 0 spin_mutex&
nbsp; 18.56
8 1 spin_mutex 19.12
8 10 spin_mutex 42.05
8 100 spin_mutex 250.16
8 1000 spin_mutex 2311.13

I was able to achive similar speedup by tweaking the AtomicBackoff implementation, but even in the best case it worked twice slower in case of short locks.

LOOPS_BEFORE_YIELD = 8192, count is reset after each yield
8 0 spin_mutex 47.30
8 1 spin_mutex 44.34
8 10 spin_mutex 55.22
8 100 spin_mutex 269.65
8 1000 spin_mutex 2358.83

At last in the oversubscription case (I simply used optimal values from fully utilized case, which is obviously incorrect, so these data are for quick demonstration only):

LOOPS_BEFORE_YIELD = 16
16 0 spin_mutex 229.82
16 1 spin_mutex 413.27
16 10 spin_mutex 404.77
16 100 spin_mutex 640.70
16 1000 spin_mutex 2693.99

LOOPS_BEFORE_YIELD = 8192
16 0 spin_mutex 98.14
16 1 spin_mutex 162.45
16 10 spin_mutex 140.06
16 100 spin_mutex 330.84
16 1000 spin_mutex 2485.79

With Raf's fix:
pause: 10000
16 0 spin_mutex 83.35
16 1 spin_mutex 89.01
16 10&nbsp
; spin_mutex 120.91
16 100 spin_mutex 315.96
16 1000 spin_mutex 2620.73

To make any final conclusions, it will take additional investigations and much more testing on different platforms, but your proposal looks very promising. I'll contact other people in the TBB team who have more experience in this particular area.

BTW, Arch is out of office until the beginning of August, so he may not be able to comment on your proposal until then.

Thanks for putting this to the test, and I'm happy it worked out well, at least on that "old TBB timing test for the spin_mutex". I do think that the expression for reading "flag" should be const_cast(flag) instead of just flag==1, but perhaps the compiler didn't optimise the read away after all... probably worth trying, anyway. Also, what I meant where I mentioned LOOPS_BEFORE_YIELD is that you might look into cutting off ExponentialBackoff::pause() to some maximum value for a number of iterations before starting to __TBB_Yield(), and deciding on the length of that maximum pause based on the number of contestants (threads), because the cut-off should probably be shorter for fewer threads.

Of course, what I am really looking for is optimising spinning atomics, so I'm not there yet, but this seems like a validation of the kind of solution that should be explored.

I agree, the cast to volatile reference is necessary. Though most of the compilers are unlikely to use registers because flag is the pointer (reference) and __TBB_pause may have side effects from their perspective. I've talked to Wooyoung who is working on a new suite of performance tests, and your proposal is in the pipeline now :)

Before extending the measurement range of "pause" (which is a bit confusing next to ExponentialBackoff::pause()) to tune it to produce an optimum (for various levels of contention!), ExponentialBackoff's implementation should also be changed, to pause() for a time proportional to, e.g., a fixed value count_min (perhaps the more likely explanation of why times improved with larger "pause" values) plus rand_r()%count (with count the value that grows exponentially between calls, starting at count0) to avoid starvation (although there is still a bias in favour of newcomers) up to a count_max for N iterations before it starts to yield(). Apparently there is some time available (more than enough?) for computing a good sequence of values, so... It would be nice if a thread-specific state for rand_r() could be kept between ExponentialBackoff instantiations, unless there is some reason that competitors for the same lock will not start simultaneously; likewise it would be nice to make all those parameters be self-tuning by, e.g., making count_min a fraction of the previous winning count_min+rand_r()%count value. There's a good opportunity to do some research here and perhaps make good use of what knowledge already exists about stochastic processes. Maybe some (minimal) statistics relevant for self-tuning could be recorded next to the lock by the previous winner?

It seems this very promising solution has been left out in the cold!  Current stable release does not show anything of this proposition..

We have a similar spinning problem here where there is oversubscription.

It might be interesting to test hybrid spinning as suggested in this article.

A quick test with time_locked_work.cpp on my 32 cores AMD opteron yields significant speedup.

Thanks for bringing this to the surface again, but have you tried adapting __TBB_LockByte in include/tbb/tbb_machine.h (assuming you are not using Itanium) with my suggestion as well as Dmitry's? It may or may not improve matters, because on recent hardware the underlying hardware instruction may already incorporate the idea to first obtain a read-only copy of the cache line, as has been suggested to me. Perhaps the benchmark should be tried again (rhetorical question)? Even if test-and-test-and-set is not relevant anymore on Intel hardware, the implementation should still be modified for more general usefulness.

It also depends on the particular algorithm (duration of holding a lock, etc.) and the hardware being used of course (NUMA?), and especially in the case of oversubscription hybrid spinning is probably a lot more significant than any improvements during the initial active spinning, so I don't expect much there. Then again, maybe your attention should primarily be spent on getting rid of oversubscription rather than merely trying to alleviate some of its ill effects?

Hi Raf, I'm glad you are still around to debate on this topic!

Our problem arise primarly because of overscription in a mixed bag of thread priorities. We are working on this to have a more efficient use of the resources.  Meanwhile, it does not make sense that all cores starts spinning around without doing any real work.  I'd like to point out that this problem is more evident when thread active cpu time slice is larger (Windows Server for instance).

I've tried the approach you suggested and it did have little impact on the synthetic spin_lock test (only validated on my 32 cores opteron but our 64 cores intel beast can't wait to try it out ;) ).

Another interesting approach I've seen suggests to store the id of the tread currently holding the lock and calling ResumeThread (windows) on it after some spinning.  This could help with overscription and unlucky context switches.

I suggest that you inspect all uses of tbb::spin_mutex and change those that hold their resources for more than a negligible fraction of a time slice to tbb::mutex, which may very well already have a hybrid implementation, although I didn't readily find that confirmed for Windows.

Problem comes from tbb arena itself.  When a significant amount of threads launch tbb algorithms (parallel fors, sorts, reduce, etc..), everything gets jammed in arena_in_need() and similar functions.

On a side note, your suggested implementation performs really badly when mixing thread priorities with oversubscription. Hybrid approach works better in this case.

I already indicated that test-and-test-and-set is no solution in this case, but I hope that at least it does not perform worse than the existing implementation?

A lot worse in my case..  I had to kill the process because it was taking too much time.

Note that I modified the test to lower one fifth of the thread priorities.  I also run the test with 128 concurrent threads.  This way, we will preempt a thread holding the lock.  In this case, you implementation is a lot worse than tbb's default.  However, running all threads with normal priorities, your implementation perform similarly or a tad faster on AMD and 5-10 times better on Intel.

Most peculiar! It was suggested to me that recent Intel processors internalise the TTS strategy, so I would have expected no difference there and a possible improvement on AMD and others. I have no idea how the algorithm could possibly perform worse in case of increased convoying, because TTS should leave those hapless contenders spinning without any attempt to obtain a writeable cache line, thus avoiding a lot of coherent-cache data traffic. Maybe somebody from the TBB team could shed some light on that?

It does not have anything to do with cache lines.. It backoff and yield cpu less often than original implementation.  Then if the CPU is busy executing code, preempted thread will not have the chance to release the lock.

Granted, it's a corner case, but one we stumble upon too often to ignore.

If you want to have fun experimenting with spinlock strategies, here a quick and dirty modification to the test to reproduce it:

<code>

//! template test unit for any of TBB mutexes
template<typename M>
struct TBB_Mutex : TestLocks {
M mutex;

double test(int testn, int threadn)
{
srand(threadn);
for(int r = 0; r < repeat_until(testn); ++r) {
do_work(outer_work[testn]);
{
typename M::scoped_lock with(mutex);
do_work(/*inner work*/value);
#if 1
if (rand() % 5 == 0) {
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_BELOW_NORMAL);
__TBB_Yield();
}
#endif
}
}
return 0;
}
};

</code>

Of course... sorry, I should have checked what I actually wrote at the time. Maybe you could just limit the duration of the inner while loop to a fraction of a time slice before yielding?

The main idea in this initial step was to avoid incurring a pathological increase of the backoff period by the mere duration of exclusion, and instead only backing off when actual contention is detected (to avoid the problem so colourfully named "thundering herd"). In the current implementation, many contenders will very quickly and equally degrade unnecessarily in their ability to react to a release, which is probably the reason why it seems to perform less well than my proposal.

The full idea is a bit more complicated, because it tries even more strongly to avoid favouring new contenders over old ones, without going as far as enforcing absolute fairness. But I haven't yet explored it further.

Laisser un commentaire

Veuillez ouvrir une session pour ajouter un commentaire. Pas encore membre ? Rejoignez-nous dès aujourd’hui