Large overhead in TBB parallel_reduce algorithm

Large overhead in TBB parallel_reduce algorithm

Hello,

When I use parallel_reduce on this simple example, I observe a huge overhead in Amplifier :

double ParallelReduceCompute(uint32_t nb_steps)
{
	double step = 1. / nb_steps;

	double pi =
		tbb::parallel_reduce(
			tbb::blocked_range<uint32_t>(0, nb_steps, 1000),
			double(0), // identity element for summation

			// Transformation : f(x) = 4 / (1 + x²)
			[=](tbb::blocked_range<uint32_t>& r, double current_sum) -> double
			{
				for (size_t i = r.begin(); i != r.end(); ++i)
				{
					double x = (i + 0.5)*step;
					current_sum += 4.0 / (1.0 + x*x);
				}

				return current_sum;
			},

			// Reduction : Sum(f(x)dx)
			std::plus<double>()
		);

	pi *= step;

	return pi;
}

I changed grainsize and/or partiioner but, I don't understand why I get such huge overhead.

I have a 8 CPU machine, and I get only a 4x speedup

Thanks in advance for your help !

Zone: 

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Alain,

Could you attach the picture from VTune Amplifier showing huge overhead, please? What is the model of your CPU?

Regards, Alex

Hi Alex,

The CPU is a Core i7 code named Haswell (i7-4800MQ @2.70 GHz)

Regards.

Alain.

Attachments: 

AttachmentSize
Downloadimage/png Capture.PNG35.79 KB

Hi Alain,

Thank you for the information. VTune Amplifier shows really strange numbers. I will contact VTune Amplifier team to investigate the issue.

As for your CPU, it has 4 cores and 2 hyper-threads for each core (8 threads total). The algorithm is compute-bound and I suppose even one thread can fully utilize FPU and there is no opportunity for the second hyper thread to extract additional performance. Therefore, 4x speed up seems very good result.

Regards, Alex

Hi Alex,

Thanks for your answer !

Two more informations : I got a 8x speed up when I switch to Intel compiler and I just checked that this problem of overhead only occured with a x86 target, not with x64. With a x64 target, VTune give me perfect results with no overhead at all.

Regards, Alain.

Leave a Comment

Please sign in to add a comment. Not a member? Join today