After upgrading servers from Dual Xeon E5645 2.4GHz (Nehalem) to Dual Xeon E5-2620 2.0GHz (Sandy bridge) I have serious performance decrease in my multithreaded application. I have created small C++ sample (attached) that summarizes the problem. In general I have prebuild LUT with 3000 int rows, each row contains about 2000 numbers. The function just copys each row to preallocated buffer and sorts it. I tried it once in main thread and once in separate thread (main thread is waiting). I do know that there is thread creation overhead but I used to think it is up to 1ms. For precise results I am averaging 100 iterations. I tested the same code on 3 servers running Windows Server 2008 R2 x64 and my application is also x64. The code was compiled with VC++ 2012 express. The results are:
Dual Xeon E5645 2.4GHz (Nehalem): Main thread - 340.522[ms], Separate thread: 388.598[ms] Diff: 13%
Dual Xeon E5-2620 2.0GHz (Sandy bridge): Main thread - 362.515[ms], Separate thread: 565.295[ms] Diff: 36%
Single Xeon E3-1230 V2 3.3GHz (Ivy bridge): Main thread - 234.928[ms], Separate thread: 267.603[ms] Diff: 13%
My problem is with 36%. Can anyone explain me what is wrong with my code? Maybe it is not super optimized but why it behaves differently on Sandy bridge?
Many thanks, Pavel.