Regarding scalability with number of threads of a simple CnC program

Regarding scalability with number of threads of a simple CnC program

Hi,I wrote a simple program that simple generates tags and processes them (attached below). However its performance degrades when I increase the number of threads. Would very appreciate any insight has to how/why to address the scalability issue.ThanksSandeep//Stack.cnc;; :: (l1compute); :: (l2compute);env->;(l1compute)->;;;
:: (l1compute); :: (l2compute);
env->;(l1compute)->;//Stack.cpp#include #include #include "stack.h"#include stack_context c;int ctr = 0;// Create an instance of the context class which defines the graphint main(int argc, char** argv){clock_t start, end;double elapsed;start = clock();for(int j = 0; j < 4; ++j) {for(int i = 0; i < 3000000; ++i) { c.l1stack.put(j*3000000+ i); } }elapsed = ((double) (end-start))/CLOCKS_PER_SEC;c.wait();end = clock();elapsed = ((double) (end-start))/CLOCKS_PER_SEC;std::cout<<"Elapsed "<}int l1compute::execute(const int & t, stack_context & c ) const{c.l2stack.put(t);return CnC::CNC_Success;}int l2compute::execute(const int & t, stack_context & c ) const{return CnC::CNC_Success;}

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Sandeep,
not sure why performance goes down, but if the steps are doing nothing, then the scalability bottleneck is obvioulsy the conccurrent use of the tag-collection. How many threads are you using?

There is one feature in the API (but not in the spec syntax (yet)) to reduce the overhead of CnC. We call it tag-ranges (even though the better term would be tag-sets). It is similar in spirit to TBB's ranges. Instead of putting individual tags, the API also accepts putting a bunch of tags at once and takes care for partitioning the tag-space internally.

In your example, you could the following:

1. When declarting the tag_collection:
typedef tbb:blocked_range< int > int_range;
CnC::tag_collection< int, int_range > l1stack;

In the env-code, relace your nested loop with

for(int j = 0; j < 4; ++j)
c.l1stack.put_range( int_range( j*3000000, j*3000000+3000000 ) );

or, even better, with just

c.l1stack.put_range( int_range( 0, 5*3000000 ) );

This will reduce the pressure on l1stack, but not from l2stack. A little more advance version of tag-ranges would be used to also reduce the overhead on l2stack. But first please let us know if the above would be a feasable approach in general.

Hi Frank,Thanks for taking a look. Unfortunately loading tags in bulk wouldn't be an option. I thought may be code generation is too fast to keep up with the synchronization.Therefore I changed the code to read tags randomly from a vector (add memory latency). Unfortunately this didn't help me either.I have now tried on two different cpus. I guess I would wait newer hardware from intel for this code to work.-Sandeep

Leave a Comment

Please sign in to add a comment. Not a member? Join today