Why the multithreaded code degrade the performance?

Why the multithreaded code degrade the performance?

There is two matrix classes, mat and pmat. The following is the code fragement.

class mat {
void lu();
...
};

class pmat{

mat **obMatPtr;
void operator(const blocked_range {
mat *loMatPtr;
for(int i = r.begin; i != r.end(); ++i)
{
loMatPtr = obMatPtr[i];
loMatPtr->lu();
}
}
...
};

pmat loPmat;
parallel_lu(...)
{
parallel_for(blocked_range(0,nblocks), loPmat, auto_partitioner());
}

The code worked correctly. At first, I ran the program with one thread on a dual-core machine
(tbb::task_scheduler_init init(deferred),.., init.initialize(1)). The execution time for performing lu on
block 0 is 118 seconds. The wall-clock time became 179 seconds when I ran it with two threads
(init.initialize(2)).

What did additional 61 seconds come from? I tested the time for just lu performance for a specific block,
that is, the time for executing loMatPtr->lu(). It should be the same no matter how many physical threads
avaible. I also thought it has nothing to do the overhead caused by thread creation and implicit
synchronization at the end of parallel_for.

Could anybody tell me the reason and how to improve the performance?

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Possible cause of degradation is false-sharing. Try to pad your structure as:

class mat {
void lu();
...
char pad [128];
};

I modified the code as per your suggestion. Unfortunately, no improvement. Lots of malloc are called inside mat.lu(). I guess that could be
the problem. I used both scalable_malloc in tbb and a lock-free malloc library, hoard. I also used memalign to avoid false sharing. But, the performance was still the same.

Quoting - Dmitriy Vyukov
Possible cause of degradation is false-sharing. Try to pad your structure as:

class mat {
void lu();
...
char pad [128];
};

You may try to use following brute-force approach. Run single-threaded version under profiler. Run multi-threaded version under profiler. Compare profiles. Identify what parts of the code execute longer in multithreaded version.
For example:
single-threaded version:
func1() - 40%
func2() - 30%
func3() - 30%

multi-threaded version:
func1() - 80%
func2() - 10%
func3() - 10%

The problem is definitely in func1().

When you will identify problematic function, drill down to the machine code level.

Quoting - zhongzel@gmail.com

as you mentioned mat.lu() will access the shared resource, that's why parallel tasks consume much time,they have to wait for other task complete(not matter lock-free or lock-based algorithm)

Leave a Comment

Please sign in to add a comment. Not a member? Join today