TBB3: number of threads are different on acano01 and batch-system?

TBB3: number of threads are different on acano01 and batch-system?

Imagen de Michael Uelschen

Hello

I just testing my program on the MTL with TBB3 and I experienced that the number of threads are different depending on the machine I run it. The method task_scheduler_init::default_num_threads() returns 64 on the acano01 and 32 on the batch system.

I thought that both system are the same (from hardware point of view).

Maybe someone can give me some advice if the hardware is different and/or there is a switch to enable the remaining 32 threads on the batch system.

Thank you.

Best regards,
Michael

publicaciones de 5 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.
Imagen de jimdempseyatthecove

The log-on system has HT enabled. (32 cores, 2 threads per core)
The batch system has HT disabled. (32 cores, 1 thread per core)

Jim Dempsey

Blog: The Parallel Void

www.quickthreadprogramming.com
Imagen de Michael Uelschen

Understood.

However it would be somehow interesting to know how the system scales when using HT. Is there any experience so far? In case I'm using 64 thread (or TBB tasks) what is the expected maximum (ideal) speed-up? It should be somehow between 32 and 64. Is it closer to 32 or it is closer to 64? If I remember correctly I read some information from Intel that HT increases the performance by 30%. This would mean I can expect at max a speed-up by 40 using HT-enabled? Or am I completely wrong?

Any comments?

Best regards,
Michael

Imagen de jimdempseyatthecove

Michael,

The degree of performance boost of HT is highly dependent on the application and the programmers ability to coordinate the threds shareing availablecaches. On the MTL systemyou have 4 processors, 4 L3 caches, 8 L2 caches, 8 L1 caches. Where on the system with HT enabled

nThreads=64
nL3=4
nThreadsPerL3=16
CacheSize_L3=25165824
CacheLineSize_L3=64
nL2=32
nThreadsPerL2=2
CacheSize_L2=262144
CacheLineSize_L2=64
nL1=32
nThreadsPerL1=2
CacheSize_L1=32768
CacheLineSize_L1=64
L3(0) = {0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60}
L3(1) = {1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61}
L3(2) = {2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62}
L3(3) = {3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63}
L2(0) = {0,32}
L2(1) = {1,33}
L2(2) = {2,34}
L2(3) = {3,35}
L2(4) = {4,36}
L2(5) = {5,37}
L2(6) = {6,38}
L2(7) = {7,39}
L2(8) = {8,40}
L2(9) = {9,41}
L2(10) = {10,42}
L2(11) = {11,43}
L2(12) = {12,44}
L2(13) = {13,45}
L2(14) = {14,46}
L2(15) = {15,47}
L2(16) = {16,48}
L2(17) = {17,49}
L2(18) = {18,50}
L2(19) = {19,51}
L2(20) = {20,52}
L2(21) = {21,53}
L2(22) = {22,54}
L2(23) = {23,55}
L2(24) = {24,56}
L2(25) = {25,57}
L2(26) = {26,58}
L2(27) = {27,59}
L2(28) = {28,60}
L2(29) = {29,61}
L2(30) = {30,62}
L2(31) = {31,63}
L1(0) = {0,32}
L1(1) = {1,33}
L1(2) = {2,34}
L1(3) = {3,35}
L1(4) = {4,36}
L1(5) = {5,37}
L1(6) = {6,38}
L1(7) = {7,39}
L1(8) = {8,40}
L1(9) = {9,41}
L1(10) = {10,42}
L1(11) = {11,43}
L1(12) = {12,44}
L1(13) = {13,45}
L1(14) = {14,46}
L1(15) = {15,47}
L1(16) = {16,48}
L1(17) = {17,49}
L1(18) = {18,50}
L1(19) = {19,51}
L1(20) = {20,52}
L1(21) = {21,53}
L1(22) = {22,54}
L1(23) = {23,55}
L1(24) = {24,56}
L1(25) = {25,57}
L1(26) = {26,58}
L1(27) = {27,59}
L1(28) = {28,60}
L1(29) = {29,61}
L1(30) = {30,62}
L1(31) = {31,63}
 

When HT is disabled logical processors 32:63 will be omitted.

When HT is disabled, each L3 will have eight logical processors sharing the L3 - and no logical processors sharing L2 nor L1.

When HT is enabled, you have sixteen logical processors sharing each L3, two logical processors sharing each L2 and two logical processors sharing each L1.

The key to optimal performance is the programmer's skill at thread team coordination within or across available caches.

As to 30% performance boost for HT, some algorithms can drive this up by an order of magnitude or more. See http://software.intel.com/en-us/articles/superscalar-programming-101-mat...

Jim Dempsey

Blog: The Parallel Void

www.quickthreadprogramming.com
Imagen de Vladimir Polin (Intel)

Hi Michael,

you can find an example of tbb scalability in my blog http://software.intel.com/en-us/blogs/2010/06/11/intel-tbb-30-in-intel-manycore-testing-lab/

--Vladimir

Inicie sesión para dejar un comentario.