Help: with 40 threads, duration of workload jumps to 100x of serial with gcc

Help: with 40 threads, duration of workload jumps to 100x of serial with gcc

imagem de Mike Pearce (Intel)

From one of our users:
I have a time loop, and inside of it several loops working on several arrays. Say the sequential time for a given size and number of iterations is ~80 seconds. Then with 2 threads is a little more than 40 seconds, and so on. The speedup varies with the size of the problem (cache issues) but that's not the point.

The trouble is, with 40 threads or so, the time jumps to ~8999 seconds. The results are still correct (by correct I mean equal to the sequential version).

This happens with both versions of gcc. It doesn't happen with icc. It doesn't happen with gcc, if i use cpu affinity (GOMP_CPU_AFFINITY).

What really puzzles me is that the elapsed times are remarkable consistent, regardless of the problem size. With 100 iterations, the walltime tends to be 89.9 seconds, with 1000 iterations is 899,9, with 10000 iterations is 8999, for different matrix sizes.

The fact that the problem doesn't happen with cpu affinity may indicate that is a cpu migration issue. Is there a tool, like perf, that I can use to measure that?

Here is an example of execution times, for a given size (400x200, 10k iterations) and different number of threads:

1 81.3115

2 41.0895

3 27.7585

4 20.9646

5 16.89

6 14.2178

7 12.4336

8 10.8436

9 9.84998

10 8.8406

11 8.26324

12 7.65106

13 7.14509

14 6.72464

15 6.36933

16 6.09183

17 5.83491

18 5.61528

19 5.55352

20 5.28404

21 5.25616

22 5.11808

23 5.07538

24 4.74014

25 4.70499

26 4.81977

27 4.47424

28 4.3797

29 4.4637

30 4.40033

31 4.32332

32 4.40318

33 4.25379

34 4.1882

35 4.47564

36 4.51471

37 4.12109

38 4.40239

39 8999.58

40 8999.46

1 post / novo 0
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.