I think that this is about flushing out the writes somehow. so that the other thread gets to see the new Ready value, which makes at least some sense to me. But I'm also in favour of using real atomics, and you can write those without a single "volatile" (I did, anyway), because they would be redundant with the inline-assembler store instructions. Still, what prevents the optimiser from reordering even those with later code, perhaps code that waits for a value that cannot appear before the store is seen by another thread, leading to deadlock? I checked my code again, and I thought that I had at least a compiler fence at both ends to prevent just that, but not so, apparently, and I now consider that an oversight. I'm aware that language-level atomics can do other kinds of things, if only the specification were readable...
(Added) "I did, anyway": or not yet... they're still there in tbb_machine.h.