Support for std::atomic to enable data race detection

Support for std::atomic to enable data race detection

Hello,

We are trying to verify some of our programs that use TBB for data races using ThreadSanitizer data race detector (https://clang.llvm.org/docs/ThreadSanitizer.html). However, currently we are seeing lots of data race reports inside of TBB library itself which makes further verification impossible. See below for examples. The root problem is that TBB uses home-grown implementation of atomic operations which is not understood by ThreadSanitizer.

Have you considered implementing TBB's atomic on top of language/compiler-provided atomic operations? If not, I would like to make a feature request for this. This can probably help with testing of TBB itself as well.

On implementation level this is just a matter of mapping one interface onto another and can use either C++'s std::atomic, C's atomic operations or the new style __atomic compiler built-ins (https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html) (supported by both clang and gcc). Any of these will work for ThreadSanitizer. This probably should be a build-time option enabled when building at least for ThreadSanitizer.

Thanks

---

WARNING: ThreadSanitizer: data race (pid=720823)
Read of size 8 at 0x7b30000093b0 by thread T10:
#0 load_with_acquire tbb/include/tbb/tbb_machine.h:613:23
#1 __TBB_load_with_acquire<long> tbb/include/tbb/tbb_machine.h:708
#2 operator tbb::internal::rml::private_worker * tbb/include/tbb/atomic.h:301
#3 propagate_chain_reaction tbb/src/tbb/private_server.cpp:161
#5 tbb::internal::rml::private_worker::thread_routine(void*) tbb/src/tbb/private_server.cpp:228:11

Previous write of size 8 at 0x7b30000093b0 by thread T11:
#0 store_with_release tbb/include/tbb/tbb_machine.h:619:18
#1 __TBB_store_with_release<long, long> tbb/include/tbb/tbb_machine.h:712
#2 store_with_release tbb/include/tbb/atomic.h:328
#3 operator= tbb/include/tbb/atomic.h:492
#4 tbb::internal::rml::private_server::wake_some(int) tbb/src/tbb/private_server.cpp:391
#5 propagate_chain_reaction tbb/src/tbb/private_server.cpp:162:13
#6 tbb::internal::rml::private_worker::run() tbb/src/tbb/private_server.cpp:266
#7 tbb::internal::rml::private_worker::thread_routine(void*) tbb/src/tbb/private_server.cpp:228:11

WARNING: ThreadSanitizer: data race (pid=57663)
Read of size 8 at 0x7b7000000c08 by thread T5:
#0 tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task(long&) tbb/src/tbb/custom_scheduler.h:240:35
#1 tbb::internal::arena::process(tbb::internal::generic_scheduler&) tbb/src/tbb/arena.cpp:154:21
#2 tbb::internal::market::process(rml::job&) tbb/src/tbb/market.cpp:677:16
#3 tbb::internal::rml::private_worker::run() tbb/src/tbb/private_server.cpp:275:23
#4 tbb::internal::rml::private_worker::thread_routine(void*) tbb/src/tbb/private_server.cpp:228:11

Previous write of size 8 at 0x7b7000000c08 by main thread:
#0 tbb::internal::machine_load_store_relaxed<tbb::task**, 8ul>::store(tbb::task** volatile&, tbb::task**) tbb/include/tbb/internal/../tbb_machine.h:687:18
#1 void tbb::internal::__TBB_store_relaxed<tbb::task**, tbb::task**>(tbb::task** volatile&, tbb::task**) tbb/include/tbb/internal/../tbb_machine.h:738
#2 tbb::internal::generic_scheduler::leave_task_pool()
#3 tbb::internal::generic_scheduler::cleanup_master(bool) tbb/src/tbb/scheduler.cpp:1224
#4 tbb::internal::governor::terminate_scheduler(tbb::internal::generic_scheduler*, tbb::task_scheduler_init const*) tbb/src/tbb/governor.cpp:229:12
#5 tbb::task_scheduler_init::terminate() tbb/src/tbb/governor.cpp:385
#6 tbb::task_scheduler_init::~task_scheduler_init() tbb/include/tbb/task_scheduler_init.h:124:13

19 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

ping

Hi Dmitry,

Sorry for the late response. You can enable compiler specific builtins with TBB_USE_GCC_BUILTINS or TBB_USE_ICC_BUILTINS define macros. Try this and tell what you got, because something that you need may be missed.

Thanks,

Nikita

Great! Thanks, will try.

Number of races greatly reduced with TBB_USE_GCC_BUILTINS, however we still see few like:

WARNING: ThreadSanitizer: data race (pid=608931)
  Read of size 8 at 0x7b3000009430 by thread T11:
    #0 load_with_acquire tbb_machine.h:613:23
...
  Previous write of size 8 at 0x7b3000009430 by thread T10:
    #0 store_with_release tbb_machine.h:619:18

 

 

It seems we are hitting this case in machine/gcc_generic.h:

// TODO: implement with __atomic_* builtins where available
#define __TBB_USE_GENERIC_HALF_FENCED_LOAD_STORE            1
#define __TBB_USE_GENERIC_RELAXED_LOAD_STORE                1
#define __TBB_USE_GENERIC_SEQUENTIAL_CONSISTENCY_LOAD_STORE 1

 

We are also using clang, and it seems the header file only checks gcc version when selecting between old and new atomics. My clang 5.0 pretends that it is a very old gcc, so it probably gets the old builtins which can't express relaxed stores, etc. I think we also need something along these lines:

#ifndef __has_builtin
#  define __has_builtin(x) 0
#endif

#if __TBB_GCC_VERSION < XXXXX || __has_builtin(__atomic_load_n)
  ... use new builtins ...
#endif

 

 

 

 

To make it clear, we only care about the case when compiler support the new atomics and the new atomics can directly express relaxed/acquire/release store/loads. Compilers that don't have new atomics, don't support ThreadSanitizer either (and are just too old to bother).

Thanks Dmitry, we will take a look. And if you know some commonly used macro that can be checked for presence of new atomic built-ins, I will appreciate if you share with us. The __has_builtin() is an option, but maybe there is some feature macro available.

I don't know what's the right __has_feature value for this. After searching internet I see that some code just checks __has_builtin for all atomic function, e.g.:

https://code.woboq.org/llvm/libcxx/src/include/atomic_support.h.html

and some code checks __has_feature(c_atomic), e.g.:

https://gist.github.com/galek/68480cb04b432443cbad

However, I think strictly saying __has_feature(c_atomic) tells about presence of __c11_atomic_* functions, which is a parallel set of functions with the different prefix.

I would say __has_builtin(__atomic_load_n) should work well in practice.

Is there any progress on this?  We really want to use TBB for our application, but we can't because of the errors thrown by the ThreadSanitizer.

up

still relevant

Hi Dmitry,

I have recently evaluated if TBB can be analyzed with TSan. Indeed there is a lot of diagnostics, and some should be addressed on our side. But, as a part of it, I tried to apply TSan to a C++ implementation of Peterson's lock algorithm, and to my surprise TSan did not pass this test and reported a race, i.e. a false positive diagnostic.

I started with your own implementation I took from https://www.justsoftwaresolutions.co.uk/threading/petersons_lock_with_C+... (as far as I understand, Anthony took it from some other place and adjusted to use C++11 atomics); then I modified it as in the attached file (without losing correctness, I believe) to be a closer match (operation- and fence-wise) to the THE work stealing implementation in TBB. For both implementations, TSan reported races for the variable protected by the lock. I used GCC 7.3 and Clang 6 for testing; both gave the same result.

So, unfortunately, for now it is not possible to analyze TBB programs with ThreadSanitizer.

Attachments: 

AttachmentSize
Downloadtext/x-c++src test_peterson_lock.cpp1.6 KB

Yes. This is true.

This code should work under tsan if we replace:

        while (flag1.load(std::memory_order_relaxed)

        std::atomic_thread_fence(std::memory_order_acquire);

with:

        while (flag1.load(std::memory_order_acquire)

This should have the same performance on x86 (faster on Itanium?), but can slow down ARM for unsuccessful steal operation.

The question is: how many such places are in TBB code? If we are talking about few, then we could do some #ifdef's. E.g. a weird:

#ifdef THREAD_SANITIZER

        flag1.load(std::memory_order_acquire);

#else

        std::atomic_thread_fence(std::memory_order_acquire);

#endif

should work too.

But if you are up to not relying on standalone acquire/release fences then, of course, we can both make it work with tsan and not sprinkle the code with ifdef's.

`while (flag1.load(std::memory_order_acquire)` was in your original code, and as far as I recall it still has false positives.

The work stealing protocol is the main place where we try to keep atomic operations as lightweight as possible. There are also a few cases where we use epoch-based lazy synchronization - but overall, not so many perhaps.

Using ifdef's is in principle possible, though improving the tool to better handle standalone fences has its own value I believe. If not that, I think we would need to better understand what exactly cannot be handled well by the tool, in order to recognize these patterns in the code and find out how to change it.

Also, I think the model where one need to recompile TBB in order to analyze their own code for thread safety is, let's say, imperfect. First, it's extra burden on application developers; second, and perhaps more important, is that they do not test what they ship. I believe most of our customers who want to use ThreadSanitizer do not want to analyze TBB but interested in their own code. From that perspective, having some way to teach the tool about semantics of TBB operations - something like a config file specified via command line - could be useful. It would be pretty much the same as handling e.g. standard memory allocation routines, which "synchronize-with" each other in a certain way. Perhaps if the tool has some way to learn and take into account such relations for 3rd libraries like TBB, we could provide some configuration file (and also formally document these relations along the way).

> `while (flag1.load(std::memory_order_acquire)` was in your original code, and as far as I recall it still has false positives.

This version works without false positives.

Attachments: 

AttachmentSize
Downloadtext/x-c++src test.cpp1.03 KB

> Using ifdef's is in principle possible, though improving the tool to better handle standalone fences has its own value I believe.

Not handling standalone acquire/release fences is somewhat intentional. The problem is that they have global effect on thread synchronization. Namely, a release fence effectively turns all subsequent relaxed stores into release stores, and synchronization operations are costly to model as compared to relaxed atomic operations. An acquire fence is even worse because it turns all preceding relaxed loads into acquire loads, which means that when we handle all relaxed loads as not-yet-materialized acquire loads just in case the thread will ever execute an acquire fence later.

Even if we implement handling of acquire/release fences in tsan as an optional feature, chances are that we won't be able to enable this mode on our projects because they are large in all respects (to put it mildly).

Combining memory ordering constraints with memory access also leads to more readable code. Which is usually a better trade-off for everything except the most performance-critical code (e.g. TBB).

But, yes, it's incomplete modelling of C/C++ memory model. We just happened to get away that far (checking 100+MLOC) without stand-alone fences.

 

> If not that, I think we would need to better understand what exactly cannot be handled well by the tool, in order to recognize these patterns in the code and find out how to change it.

As far as I can tell that's only stand-alone acquire/release fences. seq_cst fences should work as long as they are not used also as acquire/release fences, e.g. in the test.cpp above we use seq_cst fence but also annotate memory accesses with acquire/release as necessary.

Of course, it's possible that TBB will uncover some other interesting corner cases. But we do verify things like epoch-based GCs under tsan and it all works fine.

> Also, I think the model where one need to recompile TBB in order to analyze their own code for thread safety is, let's say, imperfect.

FWIW this will work for us as we tend to compile everything from source (which has other advantages as compared to fixed, non-transparent binary blobs).

But we have options for annotations too. Tsan has a set of annotations for "user code" for acquire/release (synchronizes-with) and mutexes. E.g. ABSL mutex uses these:

https://github.com/abseil/abseil-cpp/blob/master/absl/synchronization/mu...

These annotations tell tsan about user semantics of the mutex but otherwise make it ignore everything that happens inside of mutex impl (both synchronization and memory accesses). This has advantages for both checking precision, performance and reports quality. But mutex impl itself is obviously left unchecked.

Tsan also has interceptors for common library routines. E.g. pthreads:

https://github.com/llvm-mirror/compiler-rt/blob/c593dcbd691be4a3d922575f...

C++ ABI __cxa_guard_acquire:

https://github.com/llvm-mirror/compiler-rt/blob/c593dcbd691be4a3d922575f...

Some ObjectiveC builtin synchronization primitives:

https://github.com/llvm-mirror/compiler-rt/blob/c593dcbd691be4a3d922575f...

And actually the whole of libdispatch/GCD, which is quite similar to TBB in nature:

https://github.com/llvm-mirror/compiler-rt/blob/c593dcbd691be4a3d922575f...

But interceptors work well for well-defined, stable C APIs. I am not sure it will be a good fit TBB C++ APIs. we do can intercept mangled C++ names, but I guess you may still not want to commit to exact names and signatures.

IIRC TBB has some substantial parts of code in header files. This means that even if we don't instrument cpp files, the header parts will end up instrumented as part of user code. This can potentially lead to false positives depending on what route we take.

Also annotations enabled at runtime with some kind of config will have runtime overhead even when not enabled. This is suboptimal.

What do you think about the following?

All TBB entry functions are placed in header files and annotated with macros:

// in header file
void queue::push(void* obj) {
  TBB_TSAN_ENTER();
  ...
  TBB_TSAN_RELEASE(obj);
  push_impl(obj); // in cpp file
  ...
  TBB_TSAN_EXIT();
}

TBB_TSAN_ENTER/EXIT will disable tsan checking for both synchronization operations and memory accesses within the dynamic scope. And TBB_TSAN_RELEASE/ACQUIRE will communicate the user-visible semantics.

Since this all is in headers, we can enable/disable annotations at compile time depending on if the user code is compiler with tsan or not:

#ifdef THREAD_SANITIZER
# define TBB_TSAN_ACQUIRE(p) __tsan_acquire(p)
# define TBB_TSAN_RELEASE(p)  __tsan_release(p)
#else
# define TBB_TSAN_ACQUIRE(p)
# define TBB_TSAN_RELEASE(p)
#endif

 

But there will be a bunch of interesting questions. E.g. what to use as synchronization address if queue element is not a pointer. Or how to wrap user tasks into annotated tasks.

 

Just so that this option is not lost from consideration, making tsan understand TBB atomic operations would solve our problem at hand too.

Leave a Comment

Please sign in to add a comment. Not a member? Join today