Regarding scalability with number of threads of a simple CnC program

Regarding scalability with number of threads of a simple CnC program

Bild des Benutzers san_isn

Hi, I wrote a simple program that simple generates tags and processes them (attached below). However its performance degrades when I increase the number of threads. Would very appreciate any insight has to how/why to address the scalability issue. Thanks Sandeep //Stack.cnc ; ; :: (l1compute); :: (l2compute); env->; (l1compute)->;

;;
:: (l1compute); :: (l2compute);
env->;(l1compute)->; //Stack.cpp #include #include #include "stack.h" #include stack_context c; int ctr = 0; // Create an instance of the context class which defines the graph int main(int argc, char** argv) { clock_t start, end; double elapsed; start = clock(); for(int j = 0; j < 4; ++j) { for(int i = 0; i < 3000000; ++i) { c.l1stack.put(j*3000000+ i); } } elapsed = ((double) (end-start))/CLOCKS_PER_SEC; c.wait(); end = clock(); elapsed = ((double) (end-start))/CLOCKS_PER_SEC; std::cout<<"Elapsed "< } int l1compute::execute(const int & t, stack_context & c ) const { c.l2stack.put(t); return CnC::CNC_Success; } int l2compute::execute(const int & t, stack_context & c ) const { return CnC::CNC_Success; }

3 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.
Bild des Benutzers Frank Schlimbach (Intel)

Hi Sandeep,
not sure why performance goes down, but if the steps are doing nothing, then the scalability bottleneck is obvioulsy the conccurrent use of the tag-collection. How many threads are you using?

There is one feature in the API (but not in the spec syntax (yet)) to reduce the overhead of CnC. We call it tag-ranges (even though the better term would be tag-sets). It is similar in spirit to TBB's ranges. Instead of putting individual tags, the API also accepts putting a bunch of tags at once and takes care for partitioning the tag-space internally.

In your example, you could the following:

1. When declarting the tag_collection:
typedef tbb:blocked_range< int > int_range;
CnC::tag_collection< int, int_range > l1stack;

In the env-code, relace your nested loop with

for(int j = 0; j < 4; ++j)
{
c.l1stack.put_range( int_range( j*3000000, j*3000000+3000000 ) );
}

or, even better, with just

c.l1stack.put_range( int_range( 0, 5*3000000 ) );

This will reduce the pressure on l1stack, but not from l2stack. A little more advance version of tag-ranges would be used to also reduce the overhead on l2stack. But first please let us know if the above would be a feasable approach in general.

Bild des Benutzers san_isn

Hi Frank, Thanks for taking a look. Unfortunately loading tags in bulk wouldn't be an option. I thought may be code generation is too fast to keep up with the synchronization. Therefore I changed the code to read tags randomly from a vector (add memory latency). Unfortunately this didn't help me either. I have now tried on two different cpus. I guess I would wait newer hardware from intel for this code to work. -Sandeep

Melden Sie sich an, um einen Kommentar zu hinterlassen.