performance loss

performance loss

Hi,

some interesting performance loss happened with my measurements.

I have a system with two sockets, each socket is a E5-2680 processor. Each processor has 8 cores and with hyper-threading. The hyper-threading was ignored. 

On this system, I started a program 16 times at the same time and each time pinned the program to different cores. At first, i set all cores to 2.7GHz and saw :

Program 0 Runtime 7.7s

Program 8 Runtime 7.63s

And then, i set  cores on the second socket  to 1.2GHz and saw:

Program 0 Runtime 12.18s

Program 8 Runtime 15.73s

The program 8 ran slower. It is clear, because core 8 had lower frequency. But why was program 0 also slower? Its frequency wasn't touched.

 

Regards,

Bo

9 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

Did you verify that you actually can set different clock rates per socket? (measure the rates too)

Jim Dempsey

www.quickthreadprogramming.com

Yes. I get following output with "cat /proc/cpuinfo | grep MHz"

cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000

With numaclt --hardware, i get:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 32735 MB
node 0 free: 30458 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31

 

 

 

BTW, you can recognize the new frequency with different runtime of these two measurements.

To check whether a new frequency has been set, I have  "cat /proc/cpuinfo | grep MHz" and get:

cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000

With "numaclt --hardware", i get:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 32735 MB
node 0 free: 30458 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31

 

did you allocate memory on fast node or slow one?

--Vladimir

Each program has local memory, i.e. memory is distributed over these two sockets.

In your motherboard BIOS you can configure the memory in two different ways

UMA: all banks attached to both sockets are interleaved (sequential addresses are sequentially distributed across all banks) such that on average memory access everywhere is uniform. Depending on who wrote the BIOS user guide there is often (occasionally) a mistranslation of what interleaved means. Some list this backwards

NUMA: Each memory attached to each socket has contiguous address blocks. Meaning CPU0 can access the block locally attached faster than the block remotely attached.

Now then, in your sample program, should your memory system be configured UMA, then slowing down one CPU will slow down both CPU's access to memory. Should your memory system be configured NUMA, then provided that memory is allocated from the addresses local to the CPU, then each CPU would experience your expected results.

You will have to read up on how to configure your memory system (UMA or NUMA), as well as read up on the rules to follow to assure your memory allocations, and use, reside with the socket you expect.

Jim Dempsey

www.quickthreadprogramming.com

right, the simplest way to check what's going on is to take vtune amplifier and look at hotspots difference.

--Vladimir

发表评论

登录添加评论。还不是成员?立即加入