Why does parallel_for not scale when moving from 6-core to 32-core machine?

Why does parallel_for not scale when moving from 6-core to 32-core machine?

Michael Uelschen's picture

Hello

I got crazy with the parallel_for in TBB 3 under Linux.

I have a (more or less) simple matrix calculation with similar operations for each element. Therefore I thought that this is easy to parallelize and I expected a good scaling. It works fine on my 6-core machine but when I run the code on a 32-core system (MTL) the performance was similar to my 6-core machine.

I tried to figure out the load balancing using the top command from the shell and I observed that I did not get full cores/processors over the time. It varies very strong, like this 1-7-30-7-11-30-... It should be 32 all the time (there is no further load except OS).

I got mad with this partitioning inside TBB. There is to much "magic inside" for me. If I have 1000 columns
in my matrix and 32 processors I need 32 chunks with 1000/32 items. But this recursive splitting does something else and I observed that I have large chunks and very small ones. Why?

Does the partitioning work like this? (I don't think so; just for double-checking purpose).

1000 --> split 2x500
Assign 500 to the first thread and continue to split the other 500.
500 --> split to 2x250
Assign 250 to the second thread and continue to split the other 250
250 --> split to 2x125
and so on..

In this case the first thread has 50% of the work load, the second thread 25% and so on. This would at least explain my workload this the threads with small chunks have to wait for the treads with large chunks.

I tried to modify the grainsize and also the {simple|auto|affinity}partitioner without any success.

  • How I can force parallel_for to create P similar chunks of data size N/P and to distribute this on P threads when P is the number of availabe cores/processors and N the number of columns/rows of my matrix?
  • Is there a better way to observe the load balancing than using top (without using Intels complete tool chain) ?
  • Does anybody have observed similar "bad-scaling" phenomenon with parallel_for.

Any other hint would be also helpful!

Thank you.

Best regards,
Michael

23 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Raf Schietekat's picture

(Removed non-essential remark) You may see a big difference by partitioning rows instead of columns, perhaps by transposing the matrix, to get rid of false sharing. No need to set grainsize, just let the default auto_partitioner take care of that. To literally see what the partitioner does, have a look at the tachyon example included with TBB (the second one), to understand requires some more effort, but you're not going to have one thread take half the range for itself unless that's how you set the grainsize (simple_partitioner just keeps cutting any subrange that's at least the grainsize, auto_partitioner takes more time to explain than I have right now).

(Added) To convince yourself that you won't have big chunks, just log their sizes, you probably won't see anything larger than 2 with auto_partitioner (which starts with a number of chunks equal to a multiple of the number of workers and may chop them up further later on). Having 32 chunks with 1000/32 items would actually provide less opportunity for work balancing than what the partitioners do now. This is where TBB normally shines, so you're far more likely to find a problem elsewhere.

(Added) I don't really know if it would be false sharing, that's just the first thing I thought of when you mentioned going over the columns. If you have a large number of rows, maybe it's thrashing (or was it another term?) in the cache with only part of each cache line used per iteration, but it has the same remedy. I'll leave it to others now.

jimdempseyatthecove's picture

You may be seeing a diminishing return.

1000 iterations / 32 threads =

1000/32 (parallel work time) + time to launch 31 tasks

1000 iterations / 6 threads =

1000/6 (parallel work time) + time to launch 5 tasks

When your work time is relatively short, the amount of time to launch all the tasks may become a large portion of the process time.

How much work is being done per iteration?

Try structuring a proof test where each iteration takes ~0.1 second of compute time. IOW Serial test runs ~100 seconds. Run the same test on your 6 core system and on the MTL system

Also get the serial run times such that you can adjust thescaling relative to the serial time on the respective systems.

Jim Dempsey

www.quickthreadprogramming.com
Arch D. Robison (Intel)'s picture

The false sharing issue raised by Raf is a likely culprit if columns are being computed by separate cores. There are also bandwidth issues.

For example, if columns are single precision, then a 64-byte cache line spans 16 columns. 1000 columns partitioned across 32 cores is 31.25 columns per core. So threads updating different columns could frequently hit on the same cache line.

I recommend using a 2D partitioning across both the rows and columns of the output, and a fixed block size. Suppose the matrix is an MxN matrix.

  • Use tbb::parallel_for( tbb::blocked_range2d(0,M,100,0,N,100),...,simple_partitioner()). For a 1000x1000 matrix, that will create at least 100 separate tasks, each of which will compute a 100x100 submatrix or smaller. The matrices will be at least 50x50, so that's at least 250,000 floating-point operations, which should be plenty to amortize scheduling overheads.
  • Accumulate the submatrix in a local temporary and then copy it out to the global array. That will avoid false sharing in the inner loops. Because simple_partitioner is being used in conjunction with an explicit grain size of 100x100, the local submatrix will not exceed 100x100. So it can be declared as a fixed size 100x100 array. It might even be possible to get away with declaring just a row of it (see next bullet).
  • Accumulate the submatrix in a way that reduces consumption of memory bandwidth. 32 cores all pulling from main memory can easily outrun the memory subsystem. For example, suppose the local computation is local[i][j] += a[i][k]*b[k][j]. I'd make j the innermost loop and i the outermost loop. That way the bandwidth pressure is only on b[k][j], which ranges over a 1000x100 submatrix. Even for double-precision, that will fit in outer-level cache on most modern machines, and so only has to be fetched once. If i is the outermost loop, then local can be declared as a one dimensional array [100] instead of [100][100], and copied out after each row of the conceptual "local" is computed. Never use k as the inner loop because that introduces a loop-carried dependence on the innermost loop that slows down the hardware.
  • If the matrix might be much larger, block over the k loop. I.e., accumulate the local submatrix as the sum of products of appoximately 100x100 matrices. That way the inner three loops are operating on three matrices (two inputs and one output) that are only 100x100, which all easily fit in outer level cache. If you go really crazy with blocking, you can get essentially everything that counts into L1 cache or registers for the innermost loops.
Dmitry Vyukov's picture

If it would be only false sharing and bandwidth saturation I would expect top to show 100% all the time... There must be something related to task scheduling...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Michael Uelschen's picture

Hello

sorry for the confusion, I've written columns but I meant rows. I already arranged the algorithm in such a way trying to minimize negative cache effects.

Michael

Raf Schietekat's picture

I see I was following a red herring there... Bandwidth is another thing I thought of, but Dmitriy has a good point there. We don't really know how many rows there are, whether the computations are independent on each element (and how much work per element?) or something like matrix multiplication (Arch seems to be making that assumption), how long the computation runs and whether the variation in top is inside the run or across runs. You already tried various grainsizes without success, which seems to disprove Jim Dempsey's point if you went all the way to 200 or so for grainsize (producing 8 chunks of 125 each); otherwise it might have made sense, because if I'm not mistaken auto_partitioner with 32 workers is nearly as drastic as simple_partitioner with a default grainsize of 1 for an initial range of 1000, producing chunks of 1-2 elements.

jimdempseyatthecove's picture

>>You already tried various grainsizes without success, which seems to disprove Jim Dempsey's point if you went all the way to 200 or so ...

Raf,

My point clearly stated to:

Take the iteration space and divide it by the number of threads.

I did not say:

Take the iteration space and divide it into n partitions (tasks or non-task aggrigations).

Therefore, the grainsize has nothing to do with the number of threads.

When the thread pool, sans the current thread, is sleeping, the TBB library must communicate with the operating system to wakeup these threads (set an event in Windows, or condidtion variable in Linux(Pthreads)). This communication adds overhead be it done sequentially by the instigating thread or in a cascade as the additional threads join the task. This communication overhead is independent of the TBB library overhead as its thead pool varies size.

When the thread pool, sans the current thread, is busy (running something else), then the overhead becomes an "it depends" type of situation, however, the more threads in the TBB thread pool, the larger the overhead.

I think a synthetic test program could sort this out.

Jim Dempsey

www.quickthreadprogramming.com
Raf Schietekat's picture

So wouldn't a reduced number of tasks, by way of an increased grainsize, decrease startup time? If such a large setting (grainsize 200) has not been tried after all (and I don't think it unlikely that it hasn't), your point could still be proved right, but if it has, are you saying that 8 tasks still run slower on a 32-thread system than on an 8-thread system? I think I'll abstain now, because I don't think we have enough information.

(2010-10-29 Removed exaggeration.)

Arch D. Robison (Intel)'s picture

I erred when I misread "matrix calculation" in the original post as "matrix multiplication".

jimdempseyatthecove's picture

>>So wouldn't a reduced number of tasks, by way of an increased grainsize, decrease startup time?

There is the latency to start execution of all the tasks (chunks performed by the scheduled threads). So when the grain size reduces the number of threads from the full compement to less than full compement the answer is the startup latency is less. So in the case of dropping from 32-threads to 8-threads you would be reducing the startup latency.

As to if the (startup latency + all threads(chunks) run to completion latency) is reduced, this depends on the run time for the larger chunks and memory contention for the memory bus/subsystem. If this is a simple array copy/addition/scale function then likely 4 threads per socket would swamp the memory bandwidth.

Jim Dempsey

www.quickthreadprogramming.com
Andrey Marochko (Intel)'s picture

To better understand possible reason of the poor scalability you observe, try the following two simple experiments. In one of them increase the amount of calculations during each matrix element processing, and in the other use larger matrix.

Raf Schietekat's picture

#10 "If this is a simple array copy/addition/scale function then likely 4 threads per socket would swamp the memory bandwidth."
And Dmitriy's point in #4?

But maybe I wasn't really exaggerating before in #8 about not nearly having enough information (see questions in #6). Giving hypothetical replies may not bethe best use of anyone's time until the original poster, who seems to have a preconceived idea about where the problem lies, provides access to what we need to know to say anything meaningful.

Michael Uelschen's picture

Hello

Now I did some tests with parallel_for. The more I program with TBB the more I don't understand. My synthetical
test works like this. There is a array with n elements, where n is 100. I run a parallel_for on this array. There is a
simple body that just put the size of the chunk into the first element of the chunk. The second element of the chunk
gets the thread id. The remaining elements get a 0. And then I do some senseless calculation to keep the cpu running.

I run the program on my notebook with linux on a virtual machine and also on my hexcore machine with HT enabled.
There are 2 observations:

  1. The program on my notebook is faster (2 times)
  2. The chunk sizes on the hexacore have a large variance.

Any insight in this behavior is appreciated.

Thanks.

--Michael

This is the output on my notebook:

Parallel Runtime=21.8565
Size Freq.
1250 8
Histogram checking ok.
Chunks per Thread
tid=5345 : 1250 1250 1250 1250 sum=5000
tid=5346 : 1250 1250 1250 1250 sum=5000

This is the output on my hexacore:

Parallel Runtime=38.8367
Size Freq.
1 9
2 22
3 4
5 15
9 1
10 11
19 6
20 5
39 8
78 8
156 13
157 2
312 11
313 9
Histogram checking ok.
Chunks per Thread
tid=3476 : 312 313 39 39 78 2 3 9 10 2 sum=807
tid=3477 : 312 313 312 313 312 2 3 2 2 sum=1571
tid=3478 : 19 20 312 313 5 5 10 10 78 78 sum=850
tid=3479 : 312 313 312 313 19 20 5 5 19 20 sum=1338
tid=3480 : 156 156 156 2 2 3 39 39 5 5 sum=563
tid=3481 : 2 5 5 2 2 78 78 312 313 19 20 sum=836
tid=3482 : 10 10 2 2 312 2 156 156 39 39 sum=728
tid=3483 : 156 156 156 157 39 39 2 2 2 sum=709
tid=3484 : 156 156 156 157 2 10 10 sum=647
tid=3485 : 312 313 312 313 2 3 2 5 5 5 sum=1272
tid=3486 : 2 2 78 78 78 10 10 10 10 sum=278
tid=3487 : 5 5 5 19 2 19 20 5 156 156 sum=392

This is my program:

#include "tbb/tick_count.h"
#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"
#include "tbb/partitioner.h"
using namespace tbb;

#include 
#include 
#include 
#include 
using namespace std;

#include 
#include 


// --1. Test: Check the size of the partitioned data.

class SynthBody1 {
	// --Link to the data
	unsigned int* data;
	unsigned int dimx,dimy;

public:
	// --Constructor
	SynthBody1(unsigned int* d, unsigned int dx, unsigned int dy):
		data(d), dimx(dx), dimy(dy) {}

	// --()-operator
	void operator()(const blocked_range& r) const {
		// --The iterator of the loop.
		blocked_range::const_iterator it=r.begin();
		// --The first element in the range is set to the number of elements in the range.
		data[it++]=r.size();
		// --The second element in the range is set to the tid.
		if (r.size()>1)
			data[it++]=syscall(SYS_gettid);

		// --Following elements are set to zero.
		while (it!=r.end())
			data[it++]=0;

		// --Just do some silly calculation (does the compiler optimize this?)
		volatile double tmp;
		for(unsigned int i=0;i<(1<<28);i++) {
			double x=1000./double(i);
			tmp=x;

		}
	}

};



void synthPartitionerTest1() {

	// --Create a matrix.
	unsigned int dimx=100,dimy=100;
	unsigned int* matrix=new unsigned int[dimx*dimy];

	// --Time measurement - start
	tick_count t0=tick_count::now(),t1;
	// --Start parallel_for on a chunk of data.
	SynthBody1 body(matrix,dimx,dimy);
	auto_partitioner ap;
	parallel_for(blocked_range(0,dimx*dimy,2),body,ap);
	// --Time measurement - stop
	t1=tick_count::now();
	cout << "Parallel Runtime=" << (t1-t0).seconds() << endl;

	// -- Calculate some histogram on the chunk size (serial version).
	unsigned int* chunksize=new unsigned int[dimx*dimy];
	memset(chunksize,0,dimx*dimy*sizeof(unsigned int));

	// --Record the chunks per thread
	map > tchunks;

	for(unsigned int i=0;i1) {
				unsigned int tid=matrix[i+1];
				// --Insert the chunk into thread specific vector.
				tchunks[tid].push_back(matrix[i]);
				// --Skip tid in the matrix.
				i++;
			}
		}

	// --Output the histogram
	unsigned int counter=0;
	cout << setw(8) << "Size" << setw(8) << "Freq." << endl;
	for(unsigned int i=0;i >::const_iterator it;
	for(it=tchunks.begin();it!=tchunks.end();it++) {
		unsigned int sum=0;
		cout << "tid=" << it->first << " : ";
		vector::const_iterator jt;
		for(jt=it->second.begin();jt!=it->second.end();jt++) {
			cout << *jt << " ";
			sum+=*jt;
		}
		cout << "sum=" << sum << endl;
	}

	// --Delete the matrix.
	delete [] chunksize;
	delete [] matrix;
}

Dmitry Vyukov's picture

> The program on my notebook is faster (2 times)

I guess it's not faster, it just performs much less work. Amount of work in your test is not constant, it depends on number of parallel tasks, the more hardware threads machine has the more parallel tasks it creates the more work it performs. That's ridiculous benchmark.

Instead of:
for(unsignedinti=0;i<(1<<28);i++){

put:
for(unsignedinti=0;i<(1<<24)*r.size();i++){

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture

> The chunk sizes on the hexacore have a large variance.

Did you kill *all* other processes in the system?

Whatever, parallelism support libraries usually does not provide any guarantees to an end user with respect to uniformness of chunk sizes, they usually only try to achieve linear parallel speedup. It is really important to you?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Michael Uelschen's picture

Hello

my intention was not to have a benchmark. I'm still wondering on the partitioning of parallel_for.
If I have n elements and p processors/cores then I would like to have n/p chunks of data. If I
know that the calculation of each element in the vector/matrix is the same and constant then I
do not need a recursive splitting. I'm a little bit worrying on the overhead with this partitioning scheme.
I understand that the current behavior is very flexibel and self-adapting to the problem and the
workload on the machine that runs the algorithm. But so far I did not understand how can I control
the partitioning.

Let's have a look on the documentation of the auto_partitioner (Ref.Man. p.26):

The range subdivision is initially limited to S subranges, where S is proportional to the
number of threads specified by the task_scheduler_init (11.2.1). Each of these
subranges is not divided further unless it is stolen by an idle thread. If stolen, it is
further subdivided to create additional subranges. Thus a loop template with an
auto_partitioner creates additional subranges only when necessary to balance load.

From this documentation when I have n=10000 elements and 12 cores I would expect 12 sub-ranges
with approx 830 elements. But I got something like this (including the change suggested by Dmitriy)

Run Test1
Parallel Runtime=12.5559
Size Freq.
1 3
2 13
3 7
5 22
9 4
10 28
19 10
20 10
39 25
40 1
78 15
79 1
156 14
312 9
313 6
Histogram checking ok.
Chunks per Thread
tid=4123 : 312 78 78 78 79 39 39 78 78 sum=859
tid=4124 : 312 313 312 sum=937
tid=4125 : 78 312 313 10 10 10 5 5 5 2 3 2 3 9 10 10 10 sum=797
tid=4126 : 312 313 19 19 20 2 2 5 5 5 5 2 3 2 3 19 20 19 20 sum=795
tid=4127 : 19 20 19 20 10 10 10 10 39 39 39 156 156 156 78 78 sum=859
tid=4128 : 39 39 39 39 39 39 156 156 156 2 2 10 10 10 10 39 5 5 9 10 10 39 sum=863
tid=4129 : 156 9 10 10 10 156 156 5 5 78 78 156 sum=829
tid=4130 : 39 39 39 40 39 39 39 39 39 156 156 39 39 19 20 sum=781
tid=4131 : 2 3 2 3 312 313 9 10 2 3 19 2 2 78 sum=760
tid=4132 : 78 78 10 10 78 78 10 10 10 10 156 156 19 20 19 20 20 20 sum=802
tid=4133 : 312 313 312 sum=937
tid=4134 : 5 5 5 5 5 5 5 312 313 5 5 5 5 10 10 39 39 sum=778

The overall workload is nearly uniformly distributed. But there is huge amount of very small partitions.

Does the documentation explain the seen behavior of the simple algorithm? I cannot see this.

Best regards,
Michael

Dmitry Vyukov's picture

> my intention was not to have a benchmark.

Your program is inappropriate if you ever going to compare execution times of two executions. And you do.

> But so far I did not understand how can I control the partitioning.

If you want static partitioning (each thread processes N/P elements), just set granularity size as N/P.

> Does the documentation explain the seen behavior of the simple algorithm? I cannot see this.

I suspect the root cause is randomized work-stealing, that is idle thread steals piece of work from *random* thread, and not from the thread that has the biggest piece of work. So small partitions are inevitable.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

"If I have n elements and p processors/cores then I would like to have n/p chunks of data."
You mean chunks of size n/p, but that would be suboptimal unless used in an outer loop on a machine that isn't running anything else and where there is negligible variability across elements in execution time. TBB's way is better.

auto_partitioner tends toward smaller range sizes and higher number of tasks as the number of available threads increases, exposing it to higher parallel overhead. You may see better performance if you set a reasonable grainsize in the blocked_range to avoid the smallest ranges that you get on the higher number of threads. Tell us how that works out!

Arch D. Robison (Intel)'s picture

The assertion "the calculation of each element in the vector/matrix is the same and constant" is generally not enough to guarantee equal load. Cache misses, page faults, and interrupts can create unequal work for cores. Many years back when I was prototyping TBB on a 32-way Altix system, I observed that even for something as simple as a matrix multiply, differences in cache misses made dynamic load balancing pay off in some cases.

The intended usage model of TBB is to not think about the number of cores, but about reasonably amortizing overhead of chunks of work. For example, find chunk sizes so that about 5% of the time is spent on parallel scheduling overhead. The scheduling overhead per chunk is roughly independent of the number of processors, so the chunk size should be good across different processor counts.

For the detailed workings of the partitioners, see header file "tbb/include/partitioner.h". Method auto_partitioner::partition_type::should_execute_range has the basicsubdivide-on-steal logic. The logic in affinity_partitioner::partitioner_type is similar, albeit obscured somewhat by the affinity logic.

Raf Schietekat's picture

Arch, can you confirm that auto_partitioner (currently) requires a sensible grainsize setting to avoid degraded efficiency on a higher number of cores? The problem seems to be that thieves do not always go for the biggest chunks (except if they stumble upon them by accident), which would make for a partially exponential growth in the number of tasks with the number of cores (and increased overhead as a result). A solution might be to lead the secondary and tertiary thieves etc. back to the original victim for a more even distribution of work, if that were doable. But perhaps I overlooked something whenarriving atthat conclusion "in cerebro"...

Meanwhile, the recommended use would be to find a suitable grainsize using a simple_partitioner to avoid some of the sources of degraded performance with small subranges (not completely reliable without an actual many-core test machine, but still), and then replace it with an auto_partitioner for workloads that don't exhibit extreme variance across elements for improved performance on fewer cores.

Does that seem plausible?

(Added) Hmm, that goes against recursive parallelism, of course. But doesn't the auto_partitioner also, in a way? New idea: instead of splitting up the initial range into N times the number of worker threads and letting each part be stolen, record the current auxiliary grainsize starting at N times the number of cores parts (same division as now), but use recursive parallelism to get down to that size. Only when a smaller piece gets stolen will it be subdivided using the same mechanism. This will get the work divided across workers with less overhead in two ways that both contribute to better performance: recursive parallelism, and a higher probability of stealing a bigger chunk. Well, that's just what I came up with during the drive home, but I'm off again now, so maybe I'll get some more inspiration on the way. :-)

(Added 2010-11-16) So each task would be marked as one of about log_2(N*default_num_threads) generations for normal splitting (unless the range runs into its grainsize first), with the last generation executed completely when not stolen or starting another cycle when stolen. The generation could be recorded inside the local copy of the partitioner, and evaluated by incrementation modulo the value above in the next generation.

(Added 2010-11-18) Now that's embarrassing... auto_partitioner already works almost exactly like I "proposed", sorry about that. Note to self: don't just wing it.

Michael Uelschen's picture

Hello

I did some more tests that shows the scaling is not too bad. There difference between simple and auto partitioner was
small (almost zero). Ok, I need more time to understand/to simulate my initial problem that had the bad performance on
MTL.

#Threads SimplePartitioner w/ grain size=50
1 69.2847 (1,00)
2 34.6817 (1,99)
4 18.0429 (3,84)
6 12.3846 (5,59)
12 12.6946

I tested with 12 threads since HT is enabled on that system. I thought this would bring some more performance but
the effect is nothing at least for this simple algorithm. In brackets the speed-up relative to single thread version is shown.

Thanks so far.
--Michael

Raf Schietekat's picture

The hyperthreading would only make a difference if one threadwere able to make progress while the other is blocked waiting for access to memory, I guess,so if both threads are doing nothing but fill the write buffer then it doesn't make much a difference to switch between them. In a more realistic algorithm, with some time spent inside the core between memory accesses,you might be more likely to see an improvement when using 12 threads.

(Added 2010-11-17) Hmm, but why would it scale almost perfectly until 6 threads and then just stop improving? I'd like to retract my answer unless somebody confirms it, because the correct explanation may be different one.

Login to leave a comment.