tbb::pipeline instance using excessive CPU when idle

tbb::pipeline instance using excessive CPU when idle

Wenoticed our application that is built on the tbb::pipeline class was using excessive processing cycles in an interesting and predictable manner. After attempting to localize the problem in our code without success, we attempted to duplicate the problem with the text_filter example provided in the distribution and we succeeded.

First, the description of the behavior is basically that the CPU (or a core) is kept busy doing nothing when the pipeline should be idle as there is no input data available and all processing for the previous data is complete. However, this excessive utilization only occurres on every other input data.If the data rate is slow enoughone can watch the CPU utilization bounce from 0 to 50 (on my Core 2 Duo).

I have included the slightly modified text_filter.cpp to illustrate the issue. It was modified to allow more time between iterations (so you can actually see the CPU activity) and to not run in serial mode first. Note that this is the same behavior we see in our application and my input data rate is being controlled externally (socket interface) and not with artifical delays.

Any thoughts on why the pipeline is behaving in this manner?

Modification summary:
- added #include "tbb/tbb_thread.h"
- added in MyOutputFilter::operator():
tbb::this_tbb_thread::sleep(tbb::tick_count::interval_t (10.0));
printf("Done\n");
- commented out serial run in main()

Here is the full modified text_filter.cpp example code:

#include "tbb/pipeline.h"
#include "tbb/tick_count.h"
#include "tbb/task_scheduler_init.h"
#include "tbb/tbb_thread.h"
#include
#include
#include
#include

using namespace std;

//! Buffer that holds block of characters and last character of previous buffer.
class MyBuffer {
static const size_t buffer_size = 10000;
char* my_end;
//! storage[0] holds the last character of the previous buffer.
char storage[1+buffer_size];
public:
//! Pointer to first character in the buffer
char* begin() {return storage+1;}
const char* begin() const {return storage+1;}
//! Pointer to one past last character in the buffer
char* end() const {return my_end;}
//! Set end of buffer.
void set_end( char* new_ptr ) {my_end=new_ptr;}
//! Number of bytes a buffer can hold
size_t max_size() const {return buffer_size;}
//! Number of bytes appended to buffer.
size_t size() const {return my_end-begin();}
};

static const char* InputFileName = "input.txt";
static const char* OutputFileName = "output.txt";

class MyInputFilter: public tbb::filter {
public:
static const size_t n_buffer = 8;
MyInputFilter( FILE* input_file_ );
private:
FILE* input_file;
size_t next_buffer;
char last_char_of_previous_buffer;
MyBuffer buffer[n_buffer];
/*override*/ void* operator()(void*);
};

MyInputFilter::MyInputFilter( FILE* input_file_ ) :
filter(serial_in_order),
next_buffer(0),
input_file(input_file_),
last_char_of_previous_buffer(' ')
{
}

void* MyInputFilter::operator()(void*) {
MyBuffer& b = buffer[next_buffer];
next_buffer = (next_buffer+1) % n_buffer;
size_t n = fread( b.begin(), 1, b.max_size(), input_file );
if( !n ) {
// end of file
return NULL;
} else {
b.begin()[-1] = last_char_of_previous_buffer;
last_char_of_previous_buffer = b.begin()[n-1];
b.set_end( b.begin()+n );
return &b;
}
}

//! Filter that changes the first letter of each word from lower case to upper case.
class MyTransformFilter: public tbb::filter {
public:
MyTransformFilter();
/*override*/void* operator()( void* item );
};

MyTransformFilter::MyTransformFilter() :
tbb::filter(parallel)
{}

/*override*/void* MyTransformFilter::operator()( void* item ) {
MyBuffer& b = *static_cast(item);
int prev_char_is_space = b.begin()[-1]==' ';
for( char* s=b.begin(); s!=b.end(); ++s ) {
if( prev_char_is_space && islower((unsigned char)*s) )
*s = toupper(*s);
prev_char_is_space = isspace((unsigned char)*s);
}
return &b;
}

//! Filter that writes each buffer to a file.
class MyOutputFilter: public tbb::filter {
FILE* my_output_file;
public:
MyOutputFilter( FILE* output_file );
/*override*/void* operator()( void* item );
};

MyOutputFilter::MyOutputFilter( FILE* output_file ) :
tbb::filter(serial_in_order),
my_output_file(output_file)
{
}

void* MyOutputFilter::operator()( void* item ) {
MyBuffer& b = *static_cast(item);
int n = fwrite( b.begin(), 1, b.size(), my_output_file );
if( n<=0 ) {
fprintf(stderr,"Can't write into %s file\n", OutputFileName);
exit(1);
}
tbb::this_tbb_thread::sleep(tbb::tick_count::interval_t (10.0));
printf("Done\n");
return NULL;
}

static int NThread = tbb::task_scheduler_init::automatic;
static bool is_number_of_threads_set = false;

void Usage()
{
fprintf( stderr, "Usage:\ttext_filter [input-file [output-file [nthread]]]\n");
}

int ParseCommandLine( int argc, char* argv[] ) {
// Parse command line
if( argc> 4 ){
Usage();
return 0;
}
if( argc>=2 ) InputFileName = argv[1];
if( argc>=3 ) OutputFileName = argv[2];
if( argc>=4 ) {
NThread = strtol(argv[3],0,0);
if( NThread<1 ) {
fprintf(stderr,"nthread set to %d, but must be at least 1\n",NThread);
return 0;
}
is_number_of_threads_set = true; //Number of threads is set explicitly
}
return 1;
}

int run_pipeline( int nthreads )
{
FILE* input_file = fopen(InputFileName,"r");
if( !input_file ) {
perror( InputFileName );
Usage();
return 0;
}
FILE* output_file = fopen(OutputFileName,"w");
if( !output_file ) {
perror( OutputFileName );
return 0;
}

// Create the pipeline
tbb::pipeline pipeline;

// Create file-reading writing stage and add it to the pipeline
MyInputFilter input_filter( input_file );
pipeline.add_filter( input_filter );

// Create capitalization stage and add it to the pipeline
MyTransformFilter transform_filter;
pipeline.add_filter( transform_filter );

// Create file-writing stage and add it to the pipeline
MyOutputFilter output_filter( output_file );
pipeline.add_filter( output_filter );

// Run the pipeline
tbb::tick_count t0 = tbb::tick_count::now();
pipeline.run( MyInputFilter::n_buffer );
tbb::tick_count t1 = tbb::tick_count::now();

// Remove filters from pipeline before they are implicitly destroyed.
pipeline.clear();

fclose( output_file );
fclose( input_file );

if (is_number_of_threads_set) {
printf("threads = %d time = %g\n", nthreads, (t1-t0).seconds());
} else {
if ( nthreads == 1 ){
printf("single thread run time = %g\n", (t1-t0).seconds());
} else {
printf("parallel run time = %g\n", (t1-t0).seconds());
}
}
return 1;
}

int main( int argc, char* argv[] ) {
if(!ParseCommandLine( argc, argv ))
return 1;
if (is_number_of_threads_set) {
// Start task scheduler
tbb::task_scheduler_init init( NThread );
if(!run_pipeline (NThread))
return 1;
} else { // Number of threads wasn't set explicitly. Run single-thread and fully subscribed parallel versions
{ // single-threaded run
//tbb::task_scheduler_init init_serial(1);
//if(!run_pipeline (1))
// return 1;
}
{ // parallel run (number of threads is selected automatically)
tbb::task_scheduler_init init_parallel;
if(!run_pipeline (0))
return 1;
}
}
return 0;
}

29 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I think the root of the problem is that wecouldnot allow the master thread falling asleep inside the scheduler; so it spins even if no work is available. If we would put it to sleep we would have to invent a complicated logic to wake it up in all proper cases, otherwise the program would not complete at all.

Therefore ifa worker thread executes the input stage and get blocked there, the master thread will still do busy-waiting. But if the master thread blocks, all workers will fall asleep. I think that explains the "instability" of timing you got.

TBB and blocking don't mix very well. Maybe you should consider waiting for a limited time only (zero or non-zero), and then restarting the pipeline when new data is detected, if you can suffer the extra latency associated with draining the pipeline (average and maximum). Depending on the statistical characteristics of the input, some tuning of the input time limit may be required for best results. Even if you aren't looking at energy consumption, a stalled pipeline should be stopped to prevent different degrees of starvation elsewhere in the program.

Well that is very interesting information and it does explain the behavior we are seeing. However, I would have to say that although this behavior may be expected, given insight into the implementation, it is a shortcoming of the implementation. I have not looked at the implementation so I can not comment on why, or even how, it does what it does, but I believe that TBB needs deal with blocking (I/O in particular)for it to be a successful abstraction allievating the programmer of the details of threaded programming. The pipeline abstraction appeared to be a very good match for our number crunching application allievating us of the glue and housekeeping code of a traditional threaded approach. It did in fact greatly reduce the amount of code we had to generate, but unless I can find a workaround I fear that we are going to have to reimplement the application in the traditional manner wherewe can insure that idle tasks (threads) are in fact idle.

Starting and stopping the pipeline is not a reasonable approach from my perspective. Again, it basically complicates what seemed to be an elegant solution. I could easily cause the pipeline to terminate, but all of the additional changes that would be required to deal with the input interface would be complicated by the fact that things are starting up and terminating repeatedly.

Any other ideas on a workaround?

Thanks.

"our number crunching application"
Then where does the blocking arise?

Quoting - Raf Schietekat

"our number crunching application"
Then where does the blocking arise?

Blocking occurres on the input. I do not control the rate at which data arrives. It's variable and we block (via a select) on a socket waiting for the arrival of the data. Generally two "packets" of data will arrive one after the other and we then can process them concurrently (minimizing latency) before sending the processed data on to the final output stage, which is also a socket interface. The data processing is pure number crunching. It could be further decomposed into multiple pipeline stages, but initially we choose not to rework the algorithm until we determined if the pipeline approach would work in our aplication. Future uses of the application will provide an increase in the input data rate. I believe the pipeline architecture would allow us to deal with the increased data rate (and increased processing) by either providing the ability to run more processing stages concurrently, and/or allow us to further decompose the processing into multiple stages.

So even if wewere to move the socket interface out of the input stage andimplement multiple stages for the processing, the data feed to the pipeline will still be be variable resulting in a situation where the pipline is idle at times and the master thread is spinning. There is no way to eliminate this variability in our application.

"Starting and stopping the pipeline is not a reasonable approach from my perspective."
It's somewhat unfortunate, but it seems difficult to find a generally applicable mechanism that's also at the same time the best performing and the most elegant in each particular situation, so this one doesn't really look unreasonable to me, although I wouldn't be surprised, just interested, if somebody could come up with something better. We're not yet talking about several pipelines getting entangled and starving each other, just a few more lines of code...

"I do not control the rate at which data arrives."
Still, the statistical characteristics are important for deciding on a good solution. For example, if you have long waits between bursts, it makes more sense to drain and stop the pipeline. Otherwise there may be an optimum timeout based on the statistical characteristics of the input, the acceptable latency, the amount of work for each data item, the level of optional parallelism for each work item, the number of cores, and perhaps other parameters. Difficult to say. But please take this only at face value and make up your own mind!

Still, perhaps if a task could say, hey, I'm going to be waiting now, and when I wake up I'm not going to be doing a whole lot to compete with other threads until I finish, so meanwhile you might as well dispatch another worker from among the excess ones, TBB might just be more blocking-friendly than it is now...

Characteristics of the data are straightforward:
- 2 packets (~10KBytes) received periodically (350 ms interval,shortestperiod today)
- Data rate is generally constant for some period of time (minutes to hours)
- Data rate can slow to a pair of packets every >1 s
- Each input data packet is processed and generates an output data packet of ~2 MBytes
- The processing stage takes ~50 msec to process an input packet into an output packet (3.17 GHz Core 2 Duo)
- The processing of a packet is primarily a sequential operation (it can be divided into stages)

Yes, I could have implemented this as a single thread process that waits for a message, processes it, and finally outputs the processed data before looping back and starting all over. However, that implementation has scaling issues as we move forward in terms of increased data rates and sizes as well as algorithim refinements that may require additional processing time. Additionally, there are other aspects of the application that complicate the solution.

"just a a few more lines of code..."
Well the input interface is a bit more complicated than may be apparent from my simple description. Additionally, there would be more overhead in this approach. I'm not sure if just re-running a cleared pipeline would still incurr the initial startup overhead we have seen, but there is at least some additional burden with the approach although it may be less than the wasted CPU cycles currently being burned. It's not that the approach itself is completely unreasonable, I just think the benefits of the pipeline may be starting to evaporate. The appealing aspects of the pipeline were minimal management and code while being provided concurrency.

"Still, perhaps if a task could say, hey..."
Sorry, I'm not following the though in this paragraph.

"- The processing stage takes ~50 msec to process an input packet into an output packet (3.17 GHz Core 2 Duo)"

Sorry, I only described the smaller input data packet. The worst case processing timing is ~150 msec per input packet for our large (~100 KBytes) packets, whileour smaller input packet (~10KBytes) can be processed in ~50 msec. Its with the smaller packets, and thus lower processing time, where we see the impact of the pipeline thread spinning since there is significantly more idle time in that scenario.

"Characteristics of the data are straightforward:"
Those are indeed long waits in computer time. I would still think of running select for 50-100 ms (ballpark, should be corrected by tuning, I suppose) before shutting down the pipeline (hmm, how long does a restart take? what are the latency requirements?). But if arrival times are predictible, they could be used instead of simple fixed timeouts.

"I'm not sure if just re-running a cleared pipeline would still incurr the initial startup overhead we have seen"
I hope you're not confusing pipeline startup with task_scheduler_init startup... a good approach is to have at least one long-lived task_scheduler_init somewhere in main(), i.e., for the lifetime of the process.

"there is at least some additional burden with the approach although it may be less than the wasted CPU cycles currently being burned"
Is that an environmental or electricity cost concern (I'm all for that resp. sympathetic), or related to other processes on the same system? The main thread wouldn't be violently twiddling its thumbs if there were an eligible task somewhere ready for the stealing.

"The appealing aspects of the pipeline were minimal management and code while being provided concurrency."
You might also have a look at parallel_while (which might as well be undeprecated), but I guess it would have the same problem (no time to go and verify that now, though).

"Sorry, I'm not following the though in this paragraph."
That was just about a possible change to TBB itself, although minutes after that I started to wonder what it would do to the transitive relation for the sake of affinity between tasks and caches by way of threads... Another suggestion I would make is to have a way to concurrently revive a pipeline while it is shutting down; I might have a look at that one myself sometime.

OK, a possible workaround to reduce the spinning: when you detect that there is no work pending, check whether you're in the main thread; if you're not, send a dummy through the pipeline, thus giving another thread, hopefully the main thread, the chance of detecting the same condition, at which time you just block.

Quoting - Raf Schietekat

OK, a possible workaround to reduce the spinning: when you detect that there is no work pending, check whether you're in the main thread; if you're not, send a dummy through the pipeline, thus giving another thread, hopefully the main thread, the chance of detecting the same condition, at which time you just block.

Sorry for butting in but we have a similar issue with pipeline blocking and your workaround sounds like it fits the bill. JUst some questions.

1) It sounds like if the main thread (the thread that did task_scheduler_init and started the pipeline) blocks on input, then the pipeline wont spin. On the other hand if a worker thread blocks on the input filter first the pipeline will spin. I hope I read your message correctly ?

2) How do we determine whether or not a particular filter is executing in the main thread ? (Sorry if this is a basic question, I cant find an answer in the doc or in the O Reilly book)

3) If we send a sufficiently large number of such dummies (say 1000), can we be sure of the main thread seeing one and blocking ? Is there a way to send a dummy specifically targeted at the main thread ?

My problem is similar to the OP's i.e input blocking. I will post details in another post because I dont want to steal the OP's thread (nopun).

"1) It sounds like if the main thread (the thread that did task_scheduler_init and started the pipeline) blocks on input, then the pipeline wont spin. On the other hand if a worker thread blocks on the input filter first the pipeline will spin. I hope I read your message correctly ?"
That's what I understood from Alexey Kukanov in #1.

"2) How do we determine whether or not a particular filter is executing in the main thread ? (Sorry if this is a basic question, I cant find an answer in the doc or in the O Reilly book)"
Record the main thread's tbb_thread::get_id() for later reference, I suppose.

"3) If we send a sufficiently large number of such dummies (say 1000), can we be sure of the main thread seeing one and blocking ? Is there a way to send a dummy specifically targeted at the main thread ?"
No, and therefore the total busy-waiting may get a lot worse, but I think the likelihood of that happening may be low enough to try this (a necessary condition is that processing of data items be bounded). But just to be sure, when I suggested to "send a dummy through the pipeline", I meant a specially marked pointer or referent (NULL from the input filter is reserved for shutting down the pipeline), that later filters know to ignore, just to give hopefully the main thread a chance to execute the input filter (wait with zero or short timeout, if nothing detected and also the main thread block forever). This is only if the pipeline is the main event loop of the process, of course, otherwise other complications may arise. And note that it's just an idea, I haven't tried it myself yet, and I would still go with my first suggestion instead (shut it down), or try something more radical.

"My problem is similar to the OP's i.e input blocking. I will post details in another post because I dont want to steal the OP's thread (nopun)."
I myself would only look at whether the subject/title covers it (create tasks, not threads). And I'll pretend that you didn't suggest that a thread could get stolen (in TBB, it's the thread that does the stealing). :-)

(Added) It also seems fairly easy to see, unless that's just because I would be mistaken,that for this workaround to be at all beneficial, the amount of time spent on work vs. being blocked must be low enough, going down with the number of threads involved; it may help to use a non-zero timeout.

Here is a new development in my pipeline adventure...

On my Core 2 Duo my application performs generally as I indicated in an earlier post (~150 ms per packet @ 2 packets every ~350 ms). Now this "consumed" about 60 % of the available CPU performance as seen from the Windows Task Manager and ran with about six (6) threads from the Task Manager view.

We moved the application (no changes) to a ~3 GHz Dual Quad Core Xeon (its eventual target) for initial testing. I assumed that the "processing" would improve slightly and our "excessive" CPU utilization would drop to about 15% (total over 8 cores) when all the conditions were right. Well, the excessive utilization estimate was pretty much spot on. What I did not expect was that the "total" CPU utilization indicated by Windows Task Manager was still at 60% and the processing time per packet was not really any better. Additionally, Task Manager now reports somewhere in the neighborhood of 60 threads...

Now this is not sitting very well with me at the moment (and we need to get some more info, but it was the end of the day). I expected minimal usage of the new systems CPU's, but at the same data rates its utilization isn't changing, something is amiss.

BTW, I can account for three of the six threads on the Core 2 Duo (main and two tbb_threads I manually create) so I'm assuming the pipeline is creating the other three.

Anyway, life keeps being interesting...

James,

I was hoping that Alexy or Robert would be able to help you with your pipeline problem using TBB.

In QuickThread (a tool I wrote) we have two classes of tasks: Compute and I/O so the I/O portions of the pipeline application can be performed by the I/O clases of threads. Also, althoughthe "main" which generally launches the thread pool, but once the pool islaunched, there is no "main" thread, at least from a task scheduling viewpoint. The launching thread, which may be the main thread or other thread launched by the main thread, does have the responsibility of enqueuing the top-level task (or simply assuming the role of the top-level task). Other than that, there is no other distinction. As long as any task are remaining to run, none of thethreads will end. When task stealing cannot find a task to run (and after a tunable spinwait time) the thread suspends on WaitForSingleEvent. Which will occure if athread is waiting and an appropriate task is enqueued. Other than for a short spinwait time you won't see a long burn time when insufficient tasks are enqueued and threads are starved for work. In the case of the pipeline, as long as any tasks remain, be they I/O or compute class, the entire thread pool remains intact. When threads starve for work, they suspend, when new work arrives, they are awoken.

Rewriting the TBB sample upcase words test and running on a 4 core Q6600 (XP x64). 1,000,000 lines of "the quick brown fox jumped over the lazy grey dog's back nnn.... " where nnn... is a sequence number to assure the upcased outputbuffers were collated properly(file size is 64.345MB)

SerialRunUpcaseWordsTest =   5.440969 seconds
ParallelRunUpcaseWordsTest = 1.415301 seconds
Serial/Parallel = 3.844389

BTW, the I/O buffer sizes are 10,000 bytes (same as in TBBexample). A larger buffer might have yielded better performance (i.e. multiple of sector size). I have not run this against the equivilent TBB sample yet.

(add files does not seem to open a browse, else I would have attached the QuickThread version of the parallel pipeline upcase words test for you to see the difference in programming style)

Jim Dempsey

www.quickthreadprogramming.com

Quoting - James Grimplin
Here is a new development in my pipeline adventure...

On my Core 2 Duo my application performs generally as I indicated in an earlier post (~150 ms per packet @ 2 packets every ~350 ms). Now this "consumed" about 60 % of the available CPU performance as seen from the Windows Task Manager and ran with about six (6) threads from the Task Manager view.

We moved the application (no changes) to a ~3 GHz Dual Quad Core Xeon (its eventual target) for initial testing. I assumed that the "processing" would improve slightly and our "excessive" CPU utilization would drop to about 15% (total over 8 cores) when all the conditions were right. Well, the excessive utilization estimate was pretty much spot on. What I did not expect was that the "total" CPU utilization indicated by Windows Task Manager was still at 60% and the processing time per packet was not really any better. Additionally, Task Manager now reports somewhere in the neighborhood of 60 threads...

Now this is not sitting very well with me at the moment (and we need to get some more info, but it was the end of the day). I expected minimal usage of the new systems CPU's, but at the same data rates its utilization isn't changing, something is amiss.

BTW, I can account for three of the six threads on the Core 2 Duo (main and two tbb_threads I manually create) so I'm assuming the pipeline is creating the other three.

Anyway, life keeps being interesting...

If you initialize TBB by default (with no explicit arguments passed to task_scheduler_init), it will create 1 worker thread on Intel Core 2 Duo processor, and 7 worker threads on the second system you mentioned.

My guess about ~60 threads is that you use another parallel library, such as OpenMP, either implicitly or explicitly, and it initializes its own thread pool for each thread that uses it. Thus the total number of threads is around square of the number of cores.

Quoting - Alexey Kukanov (Intel)

If you initialize TBB by default (with no explicit arguments passed to task_scheduler_init), it will create 1 worker thread on Intel Core 2 Duo processor, and 7 worker threads on the second system you mentioned.

My guess about ~60 threads is that you use another parallel library, such as OpenMP, either implicitly or explicitly, and it initializes its own thread pool for each thread that uses it. Thus the total number of threads is around square of the number of cores.

Yes, I forgot to mention that we are using the MKL, so it is probably generating some threads...but that's a lot of threads. The TBB thread count you mention is generally what I would have expected.

Anyway, it does not explain why the 8 core machine is being utilized at 60% to perform the same job that consumes 60% of a 2 core system.

Quoting - jimdempseyatthecove

James,

I was hoping that Alexy or Robert would be able to help you with your pipeline problem using TBB.

In QuickThread (a tool I wrote) we have two classes of tasks: Compute and I/O so the I/O portions of the pipeline application can be performed by the I/O clases of threads. Also, althoughthe "main" which generally launches the thread pool, but once the pool islaunched, there is no "main" thread, at least from a task scheduling viewpoint. The launching thread, which may be the main thread or other thread launched by the main thread, does have the responsibility of enqueuing the top-level task (or simply assuming the role of the top-level task). Other than that, there is no other distinction. As long as any task are remaining to run, none of thethreads will end. When task stealing cannot find a task to run (and after a tunable spinwait time) the thread suspends on WaitForSingleEvent. Which will occure if athread is waiting and an appropriate task is enqueued. Other than for a short spinwait time you won't see a long burn time when insufficient tasks are enqueued and threads are starved for work. In the case of the pipeline, as long as any tasks remain, be they I/O or compute class, the entire thread pool remains intact. When threads starve for work, they suspend, when new work arrives, they are awoken.

Rewriting the TBB sample upcase words test and running on a 4 core Q6600 (XP x64). 1,000,000 lines of "the quick brown fox jumped over the lazy grey dog's back nnn.... " where nnn... is a sequence number to assure the upcased outputbuffers were collated properly(file size is 64.345MB)

SerialRunUpcaseWordsTest =   5.440969 seconds
ParallelRunUpcaseWordsTest = 1.415301 seconds
Serial/Parallel = 3.844389

BTW, the I/O buffer sizes are 10,000 bytes (same as in TBBexample). A larger buffer might have yielded better performance (i.e. multiple of sector size). I have not run this against the equivilent TBB sample yet.

(add files does not seem to open a browse, else I would have attached the QuickThread version of the parallel pipeline upcase words test for you to see the difference in programming style)

Jim Dempsey

Sounds interesting Jim. So is QuickThread an enhancement to TBB or an additional component? What's the availability of QuickThread?

No, it is a seperate threading tool with its own thread scheduler, templates, headers, etc...
if you Google

QuickThread site:intel.com

You will find a .PDF file uploaded several months ago.

Jim

www.quickthreadprogramming.com

Quoting - James Grimplin

Yes, I forgot to mention that we are using the MKL, so it is probably generating some threads...but that's a lot of threads. The TBB thread count you mention is generally what I would have expected.

Anyway, it does not explain why the 8 core machine is being utilized at 60% to perform the same job that consumes 60% of a 2 core system.

MKL uses OpenMP, which was not designed for compatibility with other threading packages. In OpenMP, each parallel region is executed by the "team" of threads. By default, if you have two parallel regions started concurrently by different threads, each one will be served by a separate OpenMP thread team. If each TBB worker thread calls an MKL function parallelized with OpenMP, there are 64 threads in the app (TBB, on the contrary, uses singlethread pool, so if you called TBB algorithms from an OpenMP region, there would be 15 threads in total -the main thread, 7 OpenMP workers, and 7 TBB workers).
In addition, in Intel's OpenMP implementation by default threads do not park immediately after the parallel region ends, in the assumption that another one can start right after. So they idle-spin for a little. This behavior is controlled by an environment variable and possibly an API function as well. I do not know whether MKL zeroes this active wait time for OpenMP or not.

It might make sense to use single-threaded version of MKL, unless the outer-level parallelism exploited with TBB is not enough to utilize all processors and/or to provide proper load balancing. You also might reduce the number of threads in every OpenMP team (e.g. to 2), to minimize oversubscription. Generally, I would recommend you to contact Intel Premier Support for help with using MKL functions from TBB.

As for the CPU utilization: the percentage would not necessarily reduce with increased number of cores; I would expect the execution time to decrease instead. If you talk about idle-spinning time only, then I agree CPU utilization should decrease; but the behavior I described above might be responsible for additional idle spinning.

"As for the CPU utilization: the percentage would not necessarily reduce with increased number of cores; I would expect the execution time to decrease instead. If you talk about idle-spinning time only, then I agree CPU utilization should decrease; but the behavior I described above might be responsible for additional idle spinning."

Alexey, thanks for some insight into MKL. We will attempt to place some control on MKL's configuration and see what happens relative to the threads. However, I am not following your comment on the system utilization. I originally had 2 cores being consumed for a total of 60%, generating a result on an input data packet in about 150 ms. I move to 8 cores and now I have 8 cores being consumed at 60% (of the total system) and still generating results in about 150 ms. If you are implying that a core's utilization should not change much, I would generally agree (depending on how mcuh MKL is decomposing the work it is being given) . But to imply that each core is being consumed at 60%and the final result is the same as on a 2 core machine, is not adding up. I'm using 4x the processing of the Core 2 Duo and achieving the same result?

BTW, The algorithmic portion of this app is a port of existing code that used MKL on a 3 machine cluster of dual Intel Itaniums (SGI Altix). When it runs is uses about 20% of the total CPU resources available (at least in one operational mode).

Quoting - James Grimplin

"As for the CPU utilization: the percentage would not necessarily reduce with increased number of cores; I would expect the execution time to decrease instead. If you talk about idle-spinning time only, then I agree CPU utilization should decrease; but the behavior I described above might be responsible for additional idle spinning."

Alexey, thanks for some insight into MKL. We will attempt to place some control on MKL's configuration and see what happens relative to the threads. However, I am not following your comment on the system utilization. I originally had 2 cores being consumed for a total of 60%, generating a result on an input data packet in about 150 ms. I move to 8 cores and now I have 8 cores being consumed at 60% (of the total system) and still generating results in about 150 ms. If you are implying that a core's utilization should not change much, I would generally agree (depending on how mcuh MKL is decomposing the work it is being given) . But to imply that each core is being consumed at 60%and the final result is the same as on a 2 core machine, is not adding up. I'm using 4x the processing of the Core 2 Duo and achieving the same result?

BTW, The algorithmic portion of this app is a port of existing code that used MKL on a 3 machine cluster of dual Intel Itaniums (SGI Altix). When it runs is uses about 20% of the total CPU resources available (at least in one operational mode).

James,

I think you missed Alexey's remark regarding the OpenMP block time.

If, for a sillyexample, the block time is set for 200ms, and you run an #pragma omp, then #pragma omp for with a qualifier on it that results in one thread using the for then here is what you have

8 threads started to run the parallel region, one thread selected to run the loop, seven of the threads skip over the loop and due to block time of 200ms, will sit burning CPU for up to 200ms while waiting for thread processing loop completes. Should the loop take longer than 200ms (not likely) then your system lost 7 x 200ms of computation time. Should the loop take 1ms, and if the time spent by the controling thread is 10ms before it returns to the same section of code (without interviening parallel regions), then those seven other threads will burn those 11ms doing nothing. Upon entry to the parallel region, those seven threads will have their block time reset to 200ms.

So in this obscure hypothetical case, you will burn 100% of eight cores when the work was actualy being done with one core.

This is not to say blocking time isn't important. Correct use of blocking time results in greater throughput for the application (at the expense of available CPU time for other applications).

http://comp.astro.ku.dk/Twiki/view/CompAstro/IntelOpenMP?skin=print.pattern

Here is a link to a short answer.

I do not know why this is not so easy to find in the documentation.

Jim Dempsey

www.quickthreadprogramming.com

"I think you missed Alexey's remark regarding the OpenMP block time."

Well I did not really miss it, I just did not think MKL would perform to this extreme. We have run some more tests and here is the summary:

10+ KB packet, MKL_NUM_THREADS=1 => 9 threads, 10% utilization, 75 ms/packet
10+ KB packet, MKL_NUM_THREADS=2 => 18 threads, 22% utilization, 79 ms/packet
10+ KB packet, MKL_NUM_THREADS=4 => 34 threads, 40% utilization, 80 ms/packet
10+ KB packet, MKL_NUM_THREADS=8 => 66 threads, 65% utilization, 80 ms/packet

100 KB packet, MKL_NUM_THREADS=1 => 9 threads, 12% utilization, 146 ms/packet
100 KB packet, MKL_NUM_THREADS=2 => 18 threads, 25% utilization, 150 ms/packet
100 KB packet, MKL_NUM_THREADS=4 => 34 threads, 45% utilization, 150 ms/packet

We also tried the KMP_LIBRARY and KMP_BLOCK_TIME environment variables, but they did not seem to have any effect. Could be that we are specifying them incorrectly. I could not find anything in the Intel MKL documentation related to these two environment variables (yet).

So it appears that in this latest configuration, MKL is the major contributor to idle spinning and not TBB.

For our application is does not appear that MKL (via OpenMP) parallelization in conjunction with TBB is effective at improving overall perfromance.

Quoting - James Grimplin
We have run some more tests and here is the summary:

10+ KB packet, MKL_NUM_THREADS=1 => 9 threads, 10% utilization, 75 ms/packet
10+ KB packet, MKL_NUM_THREADS=2 => 18 threads, 22% utilization, 79 ms/packet

[snip]

So it appears that in this latest configuration, MKL is the major contributor to idle spinning and not TBB.

A couple observations here. Not all of MKL is threaded. MKL uses threading for those operations where they expect parallel code to gain performance. I didn't see anything specific about what part of MKL is being used in this test, but the time-per-packet numbers suggest that the portion of MKL being employed runs serially.

The other operation regards Jim's comments on KMP_BLOCKTIME. First, it is a KMP (Intel extension)option, not an OpenMP option so it's probably not something used by MKL. Also Jim suggests Intel OpenMP"...will sit burning CPU for up to 200ms while waiting for thread processing loop completes...." It's actually a little worse than that. The KMP_BLOCKTIME interval starts when the threaded loop completes; that is, after all the threads have made theirrendezvous following the loop. Its purpose is to keep threads from paying the penalty of going to sleep and then waking up in regions of code comprising several parallel sections separated by short bursts of serial code. And the default, last time I checked, was 200 ms, which is a long time these days. With the speed of modern computers, I've recommended a setting of 1, 1 ms being sufficient time to hide lots of serial code. At least one application I know of uses a KMP_BLOCKTIME setting of 1 in a couple of places where they have chains of OpenMP serial code and 0 otherwise (to keep other parallel pools like TBB from yielding while OpenMP spins).

>>The KMP_BLOCKTIME interval starts when the threaded loop completes

!?!?!

DoDinkyFunctionAllThreads();

DoOtherDinkyFunctionAllThreads();

Note, the above does not have a barrier after the parallel for loop.
Six of the 8 threads will do back to back function calls (bypassing the for loop)
What you said amounts to those six threads will burn cycles for 10 minutes.

I believe the 200ms timer gets set one of
a) for each thread when it reaches the end of parallel region
b) for each thread when/as any thread in team reaches end of parallel region.
(same for waiting at internal barrier, etc...)

Those two choices would make sense. This is not to say software vendors do things that make sense.

Note, option b) extends all threads (form that team) timeout to +200ms from the last thread reaching the end of the parallel region. This would seem to make sense.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove
>>The KMP_BLOCKTIME interval starts when the threaded loop completes

!?!?!

DoDinkyFunctionAllThreads();

DoOtherDinkyFunctionAllThreads();

Note, the above does not have a barrier after the parallel for loop.
Six of the 8 threads will do back to back function calls (bypassing the for loop)
What you said amounts to those six threads will burn cycles for 10 minutes.

I believe the 200ms timer gets set one of
a) for each thread when it reaches the end of parallel region
b) for each thread when/as any thread in team reaches end of parallel region.
(same for waiting at internal barrier, etc...)

Sorry, I should have said "at the end of the parallel section," not "when the threaded loop completes" though for cases simpler than Jim's example, these may occur at the same time.

Here's the description from the icl 11.0.074 compiler document for KMP_LIBRARY "throughput" mode:

The throughput mode allows the program to detect its environment conditions (system load) and adjust resource usage to produce efficient execution in a dynamic environment.

In a multi-user environment where the load on the parallel machine is not constant or where the job stream is not predictable, it may be better to design and tune for throughput. This minimizes the total time to run multiple jobs simultaneously. In this mode, the worker threads yield to other threads while waiting for more parallel work.

After completing the execution of a parallel region, threads wait for new parallel work to become available. After a certain period of time has elapsed, they stop waiting and sleep. Until more parallel work becomes available, sleeping allows processor and resources to be used for other work by non-OpenMP threaded code that may execute between parallel regions, or by other applications.

The amount of time to wait before sleeping is set either by the KMP_BLOCKTIME environment variable or by the kmp_set_blocktime() function. A small blocktime value may offer better overall performance if your application contains non-OpenMP threaded code that executes between parallel regions. A larger blocktime value may be more appropriate if threads are to be reserved solely for use for OpenMP execution, but may penalize other concurrently-running OpenMP or threaded applications.

And the description of KMP_BLOCKTIME:

Sets the time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping.

Use the optional character suffixes: s (seconds), m (minutes), h (hours), or d (days) to specify the units.

Specify infinite for an unlimited wait time.

See also the throughput execution mode and the KMP_LIBRARY environment variable.

These descriptions seem to favor Jim's option a):each team thread, upon reaching the end of its parallel section, will wait for KMP_ BLOCKTIME (minimum 0, minimum discrete interval 1 ms, default 200 ms) before going to sleep.

Quoting - Robert Reed (Intel)

A couple observations here. Not all of MKL is threaded. MKL uses threading for those operations where they expect parallel code to gain performance. I didn't see anything specific about what part of MKL is being used in this test, but the time-per-packet numbers suggest that the portion of MKL being employed runs serially.

The other operation regards Jim's comments on KMP_BLOCKTIME. First, it is a KMP (Intel extension)option, not an OpenMP option so it's probably not something used by MKL. Also Jim suggests Intel OpenMP"...will sit burning CPU for up to 200ms while waiting for thread processing loop completes...." It's actually a little worse than that. The KMP_BLOCKTIME interval starts when the threaded loop completes; that is, after all the threads have made theirrendezvous following the loop. Its purpose is to keep threads from paying the penalty of going to sleep and then waking up in regions of code comprising several parallel sections separated by short bursts of serial code. And the default, last time I checked, was 200 ms, which is a long time these days. With the speed of modern computers, I've recommended a setting of 1, 1 ms being sufficient time to hide lots of serial code. At least one application I know of uses a KMP_BLOCKTIME setting of 1 in a couple of places where they have chains of OpenMP serial code and 0 otherwise (to keep other parallel pools like TBB from yielding while OpenMP spins).

We are not using much of MKL, very little in fact. We are using the dfti and vsl components. I have not evaluated the extent of use of these MKL components at this point (the code already existed).

We did run a test earlier in which we determined that setting KMP_BLOCKTIME to 1 ms or less (we have run at 100 us) greatly improves utilization (eliminating burning of CPU cycles). We get down to the 15-20% utilization in all MKL_NUM_THREAD cases. (The release notes for the Intel compiler briefly mention adjusting KMP_BLOCKTIME, not that I am currently using the Intel compiler, but I could be if it made a difference)

So we have made headway on the MKL front, but TBB is still burning cycles, about 12% on the octal core machine.

hi, I have seen similar issues with tbb pipleine spinning at 100% cpu when idle.
The first stage of my pipeline pulls data from a blocking queue. Data arrives in bursts (1000's a second) then none for many seconds.

I was interested to know whether any of the ideas suggested above actually worked?

Quoting - derbai
hi, I have seen similar issues with tbb pipleine spinning at 100% cpu when idle.
The first stage of my pipeline pulls data from a blocking queue. Data arrives in bursts (1000's a second) then none for many seconds.

I was interested to know whether any of the ideas suggested above actually worked?

And as a follow up question:

I am thinking abour having a number of live pipelines within the same app. Each pipeline would have as its input stage a bursty data set.
Presumably, the approach of sending dummy items down the pipeline until the main thread is located only works in the case of a single pipeline?

"Presumably, the approach of sending dummy items down the pipeline until the main thread is located only works in the case of a single pipeline?"
That was just an idea for a possible solution but without guarantee, and I think I even indicated that it could also exacerbate the problem sometimes, so it's up to you to decide whether you want to experiment with this. When running multiple pipelines at the same time, from different tbb_threads (or equivalent), there's an additional problem of starvation of some of the pipelines, and I'm not sure whether that issue has been solved yet (this has been discussed elsewhere in this forum).

What should always work is to loop over finite batches and run each to completion using TBB, but then you have to trade off low latency vs. good efficiency (bigger batches have higher latency but lead to better efficiency, and vice versa). Left to its own devices, TBB does not provide global progress, and is intentionally unfair in order to provide the best throughput for finite jobs. Perhaps you could dynamically decide batch size: monitor TBB for latency (but how, and how scalable could this be?), and when starvation is suspected cut off all input until the stuck task is sucked through.

Leave a Comment

Please sign in to add a comment. Not a member? Join today