Best practices for improving performance of TBB

Best practices for improving performance of TBB

I'm facing some performance issues. One of the tips I've found to improve performance is to use local variable in loop. Still my code is slow. So can you give some tips or best practices how can I improve performance?

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi,

for best practices I think the best approach is to read a book, like http://www.manning.com/williams/

In your specific case, post some code so the people here have something to work with :)

Cheers!

I've a code which comprises of simple ,concurrent_vector, concurrent_hash_map, parallel_for, parallel_sort. But it gives output for a very small input size (5000 lines of text), when I give larger input (around 18 MB of texts) it just take ages, and consumes more and more memory, evetually throws vad_alloc. But the sequential version (without TBB or parallel library) runs in a blink of an eye even on larger input. Any idea why this is happening?

BTW, I've ported this code from TBB to OpenMP. Everything seems fine now.

Just enumerating the parts of Intel Threading Building Blocks that your code happens to be using doesn't really provide us with much insight.  Perhaps some further analysis is called for.  If you have a copy of VTune Amplifier (evaluation copies should be available) and can run a basic hot spot analysis on your test code using Intel TBB, that would at least identify where in the code the computer is spending its time.  Without more details, we are reduced to idle speculation about what might be going on.

Yes, I know that. I've found the culprit. It was parallel_for. After I removed it, it just worked instantly. There are several parallel_for. But trust they are just one or two line code. Most of them just calling some function on all elements in a concurrent_vector or concurrent_hash_map. Here are just some functors which are called by parallel_for.

class ParallelMaximumScoreFinder
{
public:
ParallelMaximumScoreFinder(tbb::concurrent_hash_map<std::string, Clusters, StringComparer>& clusterTable)
: clusterTable(clusterTable)
{
}
virtual ~ParallelMaximumScoreFinder()
{
}

void operator()( const tbb::blocked_range<size_t>& r ) const
{
for(auto& pair : clusterTable)
{
pair.second.FindMaximumScore();
}
}
private:
tbb::concurrent_hash_map<std::string, Clusters, StringComparer>& clusterTable;
};

class ParallelScoreComputer
{
public:
ParallelScoreComputer(tbb::concurrent_hash_map<std::string, Clusters, StringComparer>& clustersTable)
: clustersTable(clustersTable)
{
}

virtual ~ParallelScoreComputer()
{
}

void operator()( const tbb::blocked_range<size_t>& r ) const
{
for(auto& pair : clustersTable)
{
pair.second.ComputeScore();
}
}
private:
tbb::concurrent_hash_map<std::string, Clusters, StringComparer>& clustersTable;
};

class ParallelClusterCreator
{
public:
ParallelClusterCreator(tbb::concurrent_vector<Operation>& operations,
tbb::concurrent_hash_map<std::string, Clusters, StringComparer>& table,
std::shared_ptr<Zone::ScoreComputer> computer)
: computer(computer), operations(operations), table(table)
{}
virtual ~ParallelClusterCreator()
{}

void operator()( const tbb::blocked_range<size_t>& r ) const
{
for(auto& operation : operations)
{
tbb::concurrent_hash_map<std::string, Clusters, StringComparer>::accessor a;
if(table.find(a, operation.GetKey()))
{
a->second.AddOperation(operation);
}
else
{
Clusters clusters(operation.GetKey(), computer);
table.insert(std::make_pair(operation.GetKey(), clusters));
}
}
}

private:
std::shared_ptr<Zone::ScoreComputer> computer;
tbb::concurrent_vector<Operation>& operations;
tbb::concurrent_hash_map<std::string, Clusters, StringComparer>& table;
};

class ParallelFileReader
{
public:

ParallelFileReader(const tbb::concurrent_vector<std::string>& filePaths,
tbb::concurrent_vector<tbb::concurrent_vector<std::string>>& fileLines)
: filePaths(filePaths), fileLines(fileLines)
{}
virtual ~ParallelFileReader()
{}

void operator()( const tbb::blocked_range<size_t>& r ) const
{
for(auto& filePath : filePaths)
{
FileReader reader(filePath);
tbb::concurrent_vector<std::string> lines(reader.ReadAll());
fileLines.push_back(lines);
}
}

private:
const tbb::concurrent_vector<std::string>& filePaths;
tbb::concurrent_vector<tbb::concurrent_vector<std::string>>& fileLines;
};

Well, no wonder your parallel_for doesn't work.  Where's the parallel?  The sample functors listed above each have a blocked_range as a parameter declared in the functor and then never used.  What it looks to me is that the parallel_for initiators for these functors may spin up the thread pool to execute a job and then each thread does the entire job rather than dividing the work between threads.  It's no wonder that Intel TBB faced a hurdle trying to beat your serial implementation of this code.  And eats up all the resources that you mentioned at the beginning. I'd suggest you spend a little more time with our documentation, to get a better idea how Intel TBB parallel_for works.

Leave a Comment

Please sign in to add a comment. Not a member? Join today