BUG report - TBB run incorrectly under Athlon XP

BUG report - TBB run incorrectly under Athlon XP

Hi, everybody,
My computer summary: AMD Athlon XP 2000+/1G/WinXP Sp3
Compiler is MinGW g++4.40.
The TBB can be built, but when I try to run the example(for instance, prallel_reduce/primes), the error message box is appeared: "unknown software exception (0xc000001e)".
I tried tbb22_004oss, tbb22_013oss and tbb30_20100314oss, And get same error.

Fortunatly, I copy these built executeable file to other computer(Intel Core2 CPU), It can run correctly.

Have any idea to resolve this? Thanks!

15 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Thank you for reporting the issue.

Would it be possible for you to rebuild your application with another compiler, preferrably Visual C++ or Intel C++ Compiler? I'd like to understand whether you hit some subtle difference in HW behavior TBB inadvertently relies upon (and so it may exist in pre-built TBB binaries), or it is the issue specific to MinGW.

Another idea is to check whether we accidentally pass some option to MinGW that causes it issuing instructions incompatible with your CPU. Lookinng at build/windows.gcc.inc, I only found -msse option suspicious; would you mind to remove it and try again?

Thank you Alexey
I tried rebuild the tbb libaray and example with VC2005 Express, but get same error(memory position is different).
I also tried to remove -msse option from windows.gcc.inc and rebuild all with MinGW again, and do not have surprise yet.
btw, I also built the tbb under linux system with gcc, the result is here(close to windows result):


So it does not work for you in all combinations you tried. It seems then that my first guess (of a subtle HW difference) is more probable to be true, unfortunately:(

Our internal testing on an Opteron machine did not reveal any issue, so we need your further help. The problem might be specific to your system settings. Could you please run a couple more experiments?

First, could you please take pre-built binaries of TBB (from the Windows-specific package supplied with a com-aligned or a stable release) and see if those work?

Second, do you have a stack trace of the fault (the one from VC is preferable, but any other would probably work as well)?

OK, I will try that later. And sorry for my poor English:-P

I built example/primes(rev tbb22_013oss) with VC9, below is stack trace info:

> tbb_debug.dll!__TBB_rel_acq_fence() line 43 + 0x3 byte C++
tbb_debug.dll!tbb::internal::GenericScheduler::get_task() line 2570 C++


(tbb::task & parent={...}, tbb::task * child=0x7ff87aa0) line 2945 + 0x8 byte C++
tbb_debug.dll!tbb::internal::GenericScheduler::local_spawn_root_and_wait(tbb::task &

first={...}, tbb::task * & next=0x00000000) line 2521 C++
tbb_debug.dll!tbb::internal::GenericScheduler::spawn_root_and_wait(tbb::task &

first={...}, tbb::task * & next=0x00000000) line 1462 C++
primes.exe!tbb::task::spawn_root_and_wait(tbb::task & root={...}) line 581 C++
const >::run(const SieveRange & range={...}, Sieve & body={...}, const

tbb::simple_partitioner & partitioner={...}) line 144 + 0x85 byte C++
primes.exe!tbb::parallel_reduce(const SieveRange & range={...},

Sieve & body={...}, const tbb::simple_partitioner & partitioner={...}) line 262 + 0x11 byte

primes.exe!ParallelCountPrimes(unsigned long n=100000000) line 300 + 0x2e byte C++
primes.exe!main(int argc=1, char * * argv=0x003a60b0) line 388 + 0x9 byte C++
primes.exe!__tmainCRTStartup() line 582 + 0x19 byte C
primes.exe!mainCRTStartup() line 399 C

Execution breaks on line 43 in file tbb\machine\windows_ia32.h:
inline void __TBB_rel_acq_fence() { __asm { __asm mfence } }

The error message is :
Unhandled exception at 0x1000f8c3 (tbb_debug.dll) in primes.exe: 0xC000001E: An attempt was made to execute an invalid lock sequence
Far-Field antenna testing system use exists orbit revolving table

The pre-built binaries of TBB(tbb_13oss_win) also can't work.

I tried debug under linux, GDB gave below info:
Program received signal SIGILL, Illegal instruction.
__TBB_rel_acq_fence () at ../../include/tbb/machine/linux_ia32.h:42
42 inline void __TBB_rel_acq_fence() { __asm__ __volatile__("mfence":
: :"memory"); }

The stack trace is here:
#0 __TBB_rel_acq_fence () at ../../include/tbb/machine/linux_ia32.h:42
#1 0xb7fc4961 in tbb::internal::GenericScheduler::get_task (this=0x804d600)
at ../../src/tbb/task.cpp:2569
#2 0xb7fc683c in tbb::internal::CustomScheduler::local_wait_for_all (this=0x804d600, parent=..., child=0x8058d20)
at ../../src/tbb/task.cpp:2945
#3 0xb7fbf1fe in tbb::internal::GenericScheduler::local_spawn_root_and_wait (
this=0x804d600, first=..., next=@0x8058d1c) at ../../src/tbb/task.cpp:2519
#4 0xb7fc3855 in tbb::internal::GenericScheduler::spawn_root_and_wait (
this=0x804d600, first=..., next=@0x8058d1c) at ../../src/tbb/task.cpp:1461
#5 0x0804995a in tbb::task::spawn_root_and_wait (root=...)
at /home/ymao/tbb22_013oss/include/tbb/task.h:580
#6 0x0804a663 in tbb::internal::start_reduce::run (range=..., body=..., partitioner=...)
at /home/ymao/tbb22_013oss/include/tbb/parallel_reduce.h:144
#7 0x0804a5cb in tbb::parallel_reduce (range=...,
body=..., partitioner=...)
at /home/ymao/tbb22_013oss/include/tbb/parallel_reduce.h:262
#8 0x080491e1 in ParallelCountPrimes (n=100000000) at primes.cpp:300
#9 0x080494f2 in main (argc=1, argv=0xbffff804) at primes.cpp:388


It seems that "mfence" is indeed an illegal instruction for your processor - as well as for old Intel processors that do not have SSE2.

Since you build TBB from sources anyway, my recommendationis to remove this instruction from your copy of the TBB code (but otherwise leave the inlined assembly intact). Unfortunately it opens possibility for subtle races due to instruction reordering by the processor, but at least it should get you somewhere.

If you develop software intended to run on various CPUs, consider limiting the use of modified library only to processors that lack SSE2.

Another possible substitution is to replace mfence with an xchg instruction that modifies memory. The memory location can be a local variable. That way, the code will still work correctly on modern platforms too. The drawback is that xchg is more expensive than mfence.

The xchg instruction implies amemory fence even without a lock prefix. Using xchg for a memory fence should work on processors all the way back to the 8086. Though the point is moot for machines without caches.

Are you mean replace original __TBB_rel_acq_fence with below code? Thanks.
inline void __TBB_rel_acq_fence() {
__asm {
__asm lock xchg eax, ebx
__asm lock xchg eax, ebx

Can I just remove this instruction? the code is became "inline void __TBB_rel_acq_fence(){}". If so, does it introduce any BUGs?

Thanks all.

Best Reply

No you should not remove inlined assembly and make the method a noop. Inlined assembly also serves the purpose of a compiler fence, i.e. it prevents an optimizing compiler from reordering instructions around the call.

For the purpose of memory fence, any lock-prefixed operation also succeeds. Indeed xchg is the only operation that implies a fence without specifying the lock prefix; but its disadvantage is that it requires at least one register. I guess it's the job of the compiler to save the value of the register and restore it afterwards, but also it can be avoided altogether with an operation applied to memory and immediate value. E.g. I think the following should work:

inline void __TBB_rel_acq_fence() { 
    int tmp;
    __asm { 
        __asm lock add tmp,1

The xchg instruction has to have a memory operand to imply a lock prefix. Alexey's use of an explicitly locked add seems better. Perhaps "LOCK INC tmp" would be a minimalist solution.

To expand on Alexey's point about not using a noop, there are two common causes of instruction reordering:

  • The hardware
  • The compiler

The LOCK'd instruction prevents the hardware from reordering. The inline assembly prevents the Microsoft compiler from reordering.

Thanks you Alexey, It worked.

It's frequently done as
I think it has the minimal code footprint, and the location is always in cache.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:

Thanks all, Irecommendthat add a make switch todistinguish SSE2 optimization or i386 support.Best Regards!

Leave a Comment

Please sign in to add a comment. Not a member? Join today