Intel® Threading Building Blocks

Larrabee and TBB

Hi, We've been watching Larrabee with a lot of interest, particularly
for general purpose computation. What can we expect with regards to
TBB support? Will it support TBB right out of the box, how does performance look? The SIGGRAPH paper is a bit vague in this respect.
Thanks for any information,


Undefined symbol error

I searched the forum for an answer to my question but I didnt find it:

I am trying to write a DSO for another program (Renderman) to use.
I have downloaded and compiled tbb21_20080605oss and I have written a small program using parallel_for and I am able to compile it using gmake w/o any errors or warnings.

However when I try to call it from my other program I recieve the following error:
undefined symbol: _ZTVN3tbb4taskE

tachyon example

When I run the tachyon example (tbb21_20080605/examples/parallel_for/tachyon on OSX 10.4), it seems to run fine on my 2-core machine, showing the expected speedup, but I'm curious about the visual output. The first rendering has a narrow sliver (single row of pixels?) of color at the very bottom, otherwise black; the next rendering has a narrow sliver in the middle row; the third has a "fat L" shaped block in the left-center.

Anyone have similar experience? Is this expected?

thanks, Randy

several basic questions

hi, there
I am programming on some image processing algorithms and thinking about using tbb to speed up. The data I am dealing with are often 400*400 2D pixels (each presented in gsl_vector) or sometimes a collection of free draw line in the image. The machine is 2core/4processor intel.
My questions are:
1. If I launch the calculation in multi-thread in the bash command, then I won't benefit from ttb on the algorithm level, am I right?
2. Will tbb benefit me on single core single processor cpu?

More bugs in concurrent_queue

TBB 2.1 (20080825)

I've finally applied Relacy Race Detector to concurrent_queue's signaling mechanism (__TBB_NO_BUSY_WAIT_IN_CONCURRENT_QUEUE + _WIN32 version), and it reveals 2 interesting moments.

I. n_waiting_consumers and n_waiting_producers variables are never decremented. So basically if there was just one blocking, then every subsequent signaling operation executes CRITICAL_SECTION lock/unlock and SetEvent() kernel call. And signalling is on the fast-path of both push and pop operations!

Bug in concurrent_queue

TBB 2.1 (20080825)

Consider following code from concurrent_queue.cpp

void concurrent_queue_base_v3::internal_push( const void* src ) {


 // it's sequentially consistent atomic RMW,

 // so we can consider that it's followed by #StoreLoad style memory fence

 ticket k = r.tail_counter++;


 // it's relaxed load

 if( r.n_waiting_consumers>0 ) {

 EnterCriticalSection( &r.mtx_items_avail );

Asynchronous message processing while maintaining single threaded support.

I have a publish/subscribe system which is designed to allow for low coupling asynchronous communication between components. The best practices recommended by Intel, Herb Sutter, and others is to write code that can work single threaded but scale to multi/many threads without changing the code's logic.

Using TBB with maps

I would appreciate any comments on whether my conceptual thinking is correct.

My code now has a large number of objects that "live" in std::map. The function that I would like to parallelize reads and updates one of the values in the object. Which objects and in which order are processed is not known in advance, so the map cannot be split into blocks for parallel processing. Obviously, I do not want to block access to the whole map for each transaction.

Whaty do you think about creating spin_mutexes within each of the individual objects to control access to them?


This problem seems to have been asked before without a satisfactory solution. There is no way of building for windows and mingw currently distributed with TBB.

I've tried to make my own inc files to get it working but I am, how you say, bad at it.

I have an angle on how it could be done, but I'd appreciate it if a more skilled person either helped or did it himself. If I knew what TBB required of the compiler and how to satisfy those needs via GCC, this would be trvial. Alas, the inc files are a little opaque to my amateur eyes.


Who would (be interested to find out and) explain why "test_mutex.exe 1" succeeds but "test_mutex.exe 2" and "test_mutex.exe 4" just keep spinning after having printed "Spin RW Mutex readers & writers time =" unless spin_rw_mutex is allowed to (time out and) __TBB_Yield(), or, e.g., calls printf() as part of the spin loop (probably also because of an implied yield)?

Subscribe to Intel® Threading Building Blocks