Threads overhead Nehalem vs Sandy-bridge vs Ivy-bridge

Threads overhead Nehalem vs Sandy-bridge vs Ivy-bridge

Hi all,

After upgrading servers from Dual Xeon E5645 2.4GHz (Nehalem) to Dual Xeon E5-2620 2.0GHz (Sandy bridge) I have serious performance decrease in my multithreaded application. I have created small C++ sample (attached) that summarizes the problem. In general I have prebuild LUT with 3000 int rows, each row contains about 2000 numbers. The function just copys each row to preallocated buffer and sorts it. I tried it once in main thread and once in separate thread (main thread is waiting). I do know that there is thread creation overhead but I used to think it is up to 1ms. For precise results I am averaging 100 iterations. I tested the same code on 3 servers running Windows Server 2008 R2 x64 and my application is also x64. The code was compiled with VC++ 2012 express. The results are:

Dual Xeon E5645 2.4GHz (Nehalem): Main thread - 340.522[ms], Separate thread: 388.598[ms]  Diff: 13%

Dual Xeon E5-2620 2.0GHz (Sandy bridge): Main thread - 362.515[ms], Separate thread: 565.295[ms]  Diff: 36%

Single Xeon E3-1230 V2 3.3GHz (Ivy bridge): Main thread - 234.928[ms], Separate thread: 267.603[ms]  Diff: 13%

My problem is with 36%. Can anyone explain me what is wrong with my code? Maybe it is not super optimized but why it behaves differently on Sandy bridge?

Many thanks, Pavel.

AdjuntoTamaño
Descargar main.cpp2.22 KB
publicaciones de 56 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Hi Pavel,

I don't have a system with Xeon Ex-xxxx but I could try to investigate ( at the end of the week ) what could be possibly wrong. I have Intel Core i7-3840QM ( Ivy Bridge / 4 cores ) and let me know if you're interested.

Could you provide L1, L2 and L3 cache line sizes for all CPUs? ( from ark.intel.com )

I think that profiling your program with Xperf should be done first.The main idea is to check what is the time spent in thread creation stage and cs(context switch) stage.Please install Xperf or run it if you have it installed already.Next start your application.Below are commands to be entered from the elevated command prompt.

xperf.exe -on -stackwalk PROC_THREAD+CSWITCH

xperf.exe -stop "name of your file".etl

Hi Pavel

I have forgotten to add that you need to disable paging on Win7 64-bit. Use these commands

REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f

Hi Sergey,

Thanks for your offer, I hope to resolve the problem before weekend, but who knows. In abovementioned site I found only L3 cache size. The cache sizes are: Xeon E5645 - 12M (shared between 6 cores) , Xeon E5-2620 - 15M (shared between 6 cores), Xeon E3-1230V2 - 8M (shared between 4 cores).

Hello Pavel,

I don't VS2012+ installed so I don't have the <thread> file... so I can't build your example.

Have tried adding timing statements just inside the Run() routine? It seems like this would tell you if the work is running slower or if the overhead of creating a thread is just much higher in Sandybridge case versus other cases.

Pat

>>... I found only L3 cache size. The cache sizes are:

All the rest numbers have to be in Datasheets ( PDFs / links are on the right part of a web-page for a given CPU on ark.intel.com ).

>>Xeon E5645 - 12M (shared between 6 cores) ,
>>Xeon E5-2620 - 15M (shared between 6 cores),
>>Xeon E3-1230V2 - 8M (shared between 4 cores)

It matches to my system and it will be interesting if 13% difference in performance will be reproduced.

>>...LUT with 3000 int rows, each row contains about 2000 numbers...

Simply to note, the size of your LUT ( 3000 * 2000 * sizeof(int) = 6000000 * 4 = 24000000 ) is ~22.89MB and it exceeds the size of L3 cache line for any system you use.

Then, the LUT is created in a primary thread and in the 2nd case an additional thread could be scheduled for a different CPU ( needs to be investigated! ). In that scenario both threads, scheduled for different CPUs, are possibly competing for access to L3 cache. In terms of common problems of multi-threading two cases are possible:

- Race Conditions ( more likely / consider your input array is a "large shared variable"... )
- False Sharing ( less likely )

Could you try to use VTune to review what is going on with L3 cache lines? Another option to consider is to pause the primary thread until processing in the 2nd thread is completed ( some synchronization object has to be used ).

>>...Another option to consider is to pause the primary thread until processing in the 2nd thread is completed ( some
>>synchronization object has to be used ).

Or, with Win32 API something like:
...
::SuspendThread( hPrimaryThread );
...
Note: 2nd thread should suspend the primary thread and then resume it as soon as the processing is completed.

but I think it could be done in a different way with API from thread header.

Hi Sergey,

It is true that whole data is larger than L3 cache, however there is no race as only one thread is running and other is suspended (join). Besides, I am not saying my implementation is super optimized and considers cache sizes, I just need to understand why the difference between different servers.

Thanks, Pavel

@Pavel

Beside running xperf you can also profile your code with the VTune as it was suggested by Sergey.If you need an precise percentage of time spent in thread creation procedures and contex switching procedures it is advised to use xperf.

>>>but I think it could be done in a different way with API from thread header>>>

This simply means adding another layer of indirection above Win API.Will not be a better option to call directly thread scheduling API directly from his code?

>>...Will not be a better option to call directly thread scheduling API directly from his code?

No. The test is very simple and you could try to run ( or debug ) it in order to see how it works.

Pavel,

I have Not reproduced your problem and on my computer when a command line option '--fast' was used it ran faster. Here are tests results:

[ Tests - Debug ]

..>main.exe
Average run time: 546.466[ms]

..>main.exe --fast
Average run time: 392.835[ms]

[ Tests - Release ]

..>main.exe
Average run time: 426.612[ms]

..>main.exe --fast
Average run time: 391.799[ms]

Here are details on how executables were compiled:

Notes:

- Visual Studio 2012 environment & Intel C++ compiler XE 13.0.0.089 ( Initial Release )
- No any modifications in your source codes

[ Compilation - Debug ]

..>icl /MDd main.cpp
Intel(R) C++ Compiler XE for applications running on IA-32, Version 13.0.0.089 Build 20120731
Copyright (C) 1985-2012 Intel Corporation. All rights reserved.

main.cpp
Microsoft (R) Incremental Linker Version 11.00.50727.1
Copyright (C) Microsoft Corporation. All rights reserved.

-out:main.exe
main.obj

[ Compilation - Release ]

..>icl /MD main.cpp
Intel(R) C++ Compiler XE for applications running on IA-32, Version 13.0.0.089 Build 20120731
Copyright (C) 1985-2012 Intel Corporation. All rights reserved.

main.cpp
Microsoft (R) Incremental Linker Version 11.00.50727.1
Copyright (C) Microsoft Corporation. All rights reserved.

-out:main.exe
main.obj

Hardware & Software:
OS Name Microsoft Windows 7 Professional
Version 6.1.7601 Service Pack 1 Build 7601
System Model Dell Precision M4700
System Type x64-based PC
Processor Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz, 2801 Mhz, 4 Core(s), 8 Logical Processor(s)

Thanks all, I will be back to work on this problem in day or two and will update you with results.

>>>No. The test is very simple and you could try to run ( or debug ) it in order to see how it works>>>

Ok.I will test on my pc.

>>>>No. The test is very simple and you could try to run ( or debug ) it in order to see how it works...
>>
>>Ok.I will test on my pc.

That would be nice.

You will need some C/C++ compiler that has thread header file. So far I see the one only in Visual Studio 2012. Please take into account that Express Edition ( available for free ) could be used ( this is what I have ) and you could compile the test with a default Microsoft C++ compiler ( you don't need Intel C++ compiler ). Let me know if you need Visual Studio 2012 project for your tests.

Thanks in advance.

Hi all, 

I noticed that changing  thread t(&CorticaTask::Run, task) to thread t(&CorticaTask::Run, &task) makes things runs significantly faster (on Sandy), which is undertandable, however it still very strange that it is running slower in some working point on better and newer server.

Regards, Pavel

Hello Pavel,

Have you tried adding timing statements inside the Run() routine? This would tell us how much of the runtime variation is due to thread creation overhead versus how much time is spent actually doing the work in the loop.

Pat

>>...it still very strange that it is running slower in some working point on better and newer server...

Pavel,

My Dell Precision M4700 with Windows 7 Professional 64-bit OS is highly optimized for different performance evaluations. It means, that I turned off as many as possible Windows Services and when the computer is Not connected to the network ( I simply disable a network card ) only 33 Windows Services are working. It makes sense for you to check how many Windows Services are working on your computers. By default, just right after Windows installation is completed, at least 50-60 different Windows Services are working and that number could be even greater. Please also check settings for Anti-Virus software.

If you need a detailed list of my software configuration(s) I could provide it.

>>...how much of the runtime variation is due to thread creation overhead

Patrick,

Windows creates threads very fast. I don't have an exact number but it has to be done in a couple of hundres microseconds, or less. Pavel's differences in performance are two big. However, such a verification with RDTSC instruction will be useful.

My overall conclusion is that something else is wrong and some software or hardware affects performance.

Note: Pavel, Did you install all updates for Visual Studio 2012? I did it last weekend...

>>>My Dell Precision M4700 with Windows 7 Professional 64-bit OS is highly optimized for different performance evaluations. It means, that I turned off as many as possible Windows Services and when the computer is Not connected to the network ( I simply disable a network card ) only 33 Windows Services are working>>>

Disabling network adapter is wise decision because of servicing network card incured interrupts and further packet processing can hog down the CPU.I would also recommend to run from time to time general system monitoring with the help of Xperf tool you will get a very detailed breakdown of various activity.Moreover it is recommended to disable(when you are not connected to the Internet) your AV software.It is known that for example Kaspersky AV uses system wide hooks and detours to check system function callers and this activity can add to the load on CPU.Moreover AV often installs custom drivers used to gain access into various internal OS structures implemented in kernel and this activity is sometimes done at IRQL == DPC_LEVEL mostly for synchronization and can block scheduler which also runs at DPC_LEVEL so uninstalling an AV on developer's machine is highly recommended.

That would be nice.

>>>You will need some C/C++ compiler that has thread header file. So far I see the one only in Visual Studio 2012. >>>

Thanks for informing me about this.I completely did not take it into account.

>>>Hello Pavel,

Have you tried adding timing statements inside the Run() routine? This would tell us how much of the runtime variation is due to thread creation overhead versus how much time is spent actually doing the work in the loop>>>

Hi Patrick!

Xperf has some thread creation and context switching timing and monitoring abilities.By default it is system-wide ,but I think that there is possiblity to launch the monitored process directly by using xperf.Or it could be done programmaticaly.

Pavel, Here is another advise:

Add a call to getch CRT-function in the main function at the very beginning, like:
...
...main(...)
{
getch();
...
}

While the test application waits for input from the keyboard open Windows Task Manager and select the test application. Then, "force" execution of the test just on one CPU ( use Set Affinity item from the popup dialog ). Also, take a look at how many threads will be created when the test will continue execution.

So this is really simple. 

Change the Run routine from:

 void Run()
{
for (int i=0; i<m_data.size(); i++)
{
vector<int> &row = m_data[i];
copy(row.begin(), row.end(), m_buffer.begin());
sort(m_buffer.begin(), m_buffer.end());
}
}

to something like:

 void Run()
{
QueryPerformanceCounter(&start2);
for (int i=0; i<m_data.size(); i++)
{
vector<int> &row = m_data[i];
copy(row.begin(), row.end(), m_buffer.begin());
sort(m_buffer.begin(), m_buffer.end());
}
QueryPerformanceCounter(&finish2);
timeMs2 += (double)(finish.QuadPart - start.QuadPart) ;
}

 where timeMs2 is an global variable.

Then you can compare the time inside Run() with the time outside Run() and see if (as I expect) the time spent inside the Run() code is exactly the same for the 2 (-fast and not -fast) cases.

No need to mess with xperf or anything complicated yet.

Pat

 

Thanks, Patrick! I'll run another set of tests on my Ivy Bridge and results will be posted by Monday.

>>>No need to mess with xperf or anything complicated yet.>>>

Hi Pat!

QueryPerformanceCounter used exactly as in your code snippet will not provide any timing information about the time spent in thread creation routines.One of the initial thread startter's question was "how to measure latency(overhead) of thread creation routines.

The code with my suggested changes is basically:

start_timer1
create thread (or not)
start_timer2
do_work_in_loop
end_timer2
end thread (if created)
end_timer1

If you create a thread, it seems like the difference between timer1 and time2 should be the overhead of creating the thread.

And the 2 timers would verify that the same amount of time spent in 'do_work_in_loop'. If the time is not the same, then something unexpected (but not unprecedented) is going on.

Hi all,

The problem of performance may be resolved by affinity mask - I tried it and it worked. I have two sockets in server and most probably the problem is in L3 cache. However I can't explain why Sandy bridge behaves worse than Nahalem - there must be an Intel bug.

Regards, Pavel

If affinity mask fixes the performance and you have 2 sockets, then perhaps the sandy bridge system has NUMA enabled and for some reason the new thread runs on the other socket? This would cause the sandy bridge system with the new thread to do remote memory accesses whereas the single threaded version does local memory access.

Do you have NUMA enabled on the sandy bridge?

Do you have NUMA enabled on the nehalem box?

Hi Patrick,

What is NUMA and where is may be enabled? In BIOS?

Thanks, Pavel

Thanks for the update, Pavel.

When I tested your codes I saw that 4 threads were created, one thread for one CPU, and I expect ( sorry, didn't have time for investigation with VTune ) they were "fighting" for access to data but in overall the test with '--fast' switch worked faster on my Ivy Bridge.

>>... in server and most probably the problem is in L3 cache. However I can't explain why Sandy bridge behaves worse than
>>Nahalem - there must be an Intel bug.

Of course it is possible but it needs to be proven. Please provide more details and a new test case if you think so.

Best regards,
Sergey

>>>then perhaps the sandy bridge system has NUMA enabled and for some reason the new thread runs on the other socket? >>>

Yes it could be also  NUMA related issue.

>>>What is NUMA and where is may be enabled?>>>

http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access

@Pavel

Here is very interesting discussion about the NUMA performance : http://software.intel.com/en-us/forums/topic/346334

Pavel, Could you check specs of your hardware in order to confirm that you have NUMA system(s)? Thanks in advance.

Pavel has dual Xeon motherboard so it is a NUMA system.

If both systems are NUMA and I am not using affinity masks in my code. How one system can run faster than other?

How can I check if NUMA is enabled? In BIOS? Can I check it from Windows with some program?

Thanks, Pavel

>>>How can I check if NUMA is enabled? In BIOS? Can I check it from Windows with some program?>>>

You check for NUMA nodes programmaticaly.Please consult this reference:msdn.microsoft.com/en-us/library/windows/desktop/aa363804(v=vs.85).aspx

UPD: Both servers are NUMA enabled.

Can the cost of remote memory use be so much higher in Sandy bridge?

@Pavel

It is not an easy question to answer.There also very scarce information about the NUMA in Intel SDM.I'm posting a link to very interesting discussion about the NUMA related performance.I posted there a few links one of them gives detailed explanation of the NUMA performance degradation.

Link to the post :://software.intel.com/en-us/forums/topic/346334

 Very interesting information regarding NUMA performation degradation link ://communities.vmware.com/thread/391284

@Pavel

I posted a few links to the very interested discussion also related to the NUMA and performance degradation.Unfortunately still my posts are queued for the admin approval so I'm posting below a part of my answer on that discussion.

 

>>>Probably NUMA architecture - related memory distances coupled with the thread beign executed by the different nodes and forced to access its non-local memory could be responsible  for any performance degradation related to the memory accesses. When the number of nodes is greater than 1 some performance penalty will be  expected.IIRC the performance penalty is measured in unit of "NUMA distance" with the normalized value of 10 and every access to the local memory has the cost of 10(normalized value) thats mean 1.0 when the process accesses off-node memory(remote) from the NUMA "point of view" some penalty will be added because of overhead related to moving data over the numa interlink.Accessing neighbouring node can add up to 0.4 performance penalty so the total penalty can reach 1.4. More information can be found in ACPI  documentation>>>

@Pavel

Can you add to your test case a NUMA API functions and test it on both servers?

Just an inquiry... but how many threads are you running?  Are you running with hyperthreading?  What's the IPC of the threads, and what's the average B/instruction in each thread.  I wonder whether you're running out of ILD bandwidth in SB.  SB is more prone to this than IB.  Just a thought..

perfwise

@perfwise

Can the pressure build up on one of the execution Ports trigger rescheduling of the threads and moving threads further in the NUMA space.

Regarding NUMA related information it is also located in the KPRCB structure.

@Pavel

Regarding setting processor affinity you can use so called "Interrupt Affinity policy tool".You can download it from the Microsoft website.Bear un mind that those settings are related to the interrupt priority and can be used only when it is known that some driver's ISR is consuming too much processor resources.

iliyapolak,

    SB differs from IB, because if you're IPC is high enough, and you're not hitting in the DSB, then you can become starved for instructions from the front end.  That's one of the biggest differences between SB from NH/IB.  Once can identify if this is an issue by monitoring the # of uops delivered by the DSB and also the IPC.  If you're at 3+ in IPC.. and you're not hitting in the DSB.. you may degrade performance.  My experience tells me if you're trying to pull more than 8B per cycle from the ILD then you're not going to be a happy camper on SB, but IB can do so.  Must have been some issue which shipped with SB, that not a functional problem, degraded performance and was fixed in IB.  This is a hard to identify issue.. and I'm sure most don't know it exists, simply because it's likely rare to happen given the large % of time the DSB is delivering uops.

perfwise

Thanks you all for very professional and useful feedback. We decided to stop the migration to SB till the affinity issue would be fixed in our code.

Pavel 

@perfwise

What "DSB" stands for?

 

>>@perfwise
>>
>>What "DSB" stands for?

When somebody uses an abbreviation, like DSB, and doesn't explain what it means for everybody is the most unpleasant thing on any forum. Personally, I don't have time to "hunt down" on the Internet for all these unexplained abbreviations if I don't know what they mean.

Páginas

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya