NUMA on Xeon E5-2620

NUMA on Xeon E5-2620

Jason K.'s picture

My apologies in advance.  This is the first time I'm posting on the Intel developer forums, and in all fairness, I'm not 100% sure that all parts of my question belong in this forum.  Please bear with me.

I work in a university CS and Eng. department.  A faculty member recently purchased an Intel server with the W2600CR motherboard, and dual Xeon E5-2620 processors along with 4 x Nvidia GTX670 cards to be used with GPU computing experiments (with CUDA) under Linux (Red Hat Enterprise).  The system has 64 GB of DDR3 memory.  Because of the system architecture, 32 GB is "attached" to each CPU, and 2 PCI-E slots, hence two GPUs are attached to each CPU (1 x16 and 1 x8 on the first, 2 x8's on the second).

After getting the system setup, the faculty member was shocked that the performance of paged memory tests between the system and the graphics cards were extremely poor (simply running CUDAs bandwidthTest program with the --memory=pageable option).  The results were in the range of 1500 MB/s instead of 4000 MB/s that he was getting on a much older Core i7 system from 2009.  The GTX670 graphics card was taken out of this newer system, and moved to the 2009 system, and it worked at the proper performance level there.  The GTX580 graphics card was taken out of the 2009 system and tried in the new system, and it also performed equally as poorly as the GTX670.

I spoke to the vendor who we had purchased the Intel solution from who opened up a ticket with Intel, and we've been going back and forth for quite a few weeks now with no resolution to the problem.  I tried many different setups, but finally, I removed one of the Xeon processors from the new system, the second 32 GB of memory, and 3 of the graphics cards, and now the performance tests worked perfectly giving the expected results.  This led me to wonder if this was some kind of NUMA issue -- maybe I was somehow running the performance test  on CPU 0, using a graphics card connected to CPU 1, and memory connected to CPU 1.  Could this be the reason why the performance numbers were so poor? 

I put the second Xeon processor back in the system, but left out the second 32 GB memory, and left only one GPU in the system connected to CPU 0.  I used "numactl --hardware", and expected to see two nodes with 12 cores each (E5-2620 is dual 6 core, and hyperthreading is enabled).  Instead, I saw 1 node with 24 cores!  I ensured that NUMA was enabled in the BIOS.  I did try running the bandwidthTest on all 24 cores, and while there was some minor variation in numbers, nothing even close to the 4000 MB/s result that I wanted to see. 

I decided that since I didn't have much luck getting Intels help to solve the GPU problem, that maybe if we could solve the "NUMA" issue that the GPU performance issue would be gone.  After more back and forth with the vendor, and Intel, Intel apparently setup a W2600CR system with Linux (RH6.1 since apparently Intel doesn't "support" later versions) in a test lab.  I excitedly waited for the response.  Would it be an O/S bug? A BIOS issue that needed to be fixed???  My vendor called me back and explained the the support person said that after testing all the boards in the W2600 series, this NUMA behaviour was normal for this board and processors.  My problem is, I just don't understand *WHY* it's normal.  Another vendor which has an excellent tutorial on using numactl under Linux on their web site has a system with the same chipset, and same processors, yet they show 2 nodes!  Nobody seems to be able to answer *why* it's different for this Intel board.  

In the end, I've been unable to rectify the situation, and have lost a LOT of time trying.  Is my only choice at this point to buy a second server board, move the second Xeon processor, 32 GB of memory, and two GPUs there?  I really don't believe in "solving" problems this way...I suspect the answer *IS* out there, and maybe, just maybe you've got it ... please? ! :)

Jason.

20 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Sergey Kostrov's picture

Hi Jason,

>>...Another vendor which has an excellent tutorial on using numactl under Linux on their web site has a system with the same
>>chipset, and same processors, yet they show 2 nodes! Nobody seems to be able to answer *why* it's different for this Intel board...

Posts related to different problems with NUMA are very interesting and challenging ( there were 4 or 5 during last 3 months ) but it is really hard to help because, I'm sure for 99%, many of IDZ experienced users don't have a NUMA system. Since I belong to a group of IDZ users who don't have a NUMA system I would prefer to stay away from any technical advises.

So, after reading your post I think you need to work with Another vendor in order to understand what could be possibly wrong with your NUMA system. Personally, I really understand your situation and I know that time is the most unrecoverable asset.

Best regards,
Sergey

iliyapolak's picture

Hi Jason,
In general NUMA troubleshooting is chalenging and not easy thing to do.I would like to post a few links to various articles and sites which deal with Linux NUMA performance penalty.

P.S
Web links are posted without the "http" protocol indentifier it is so because of anti spam filtering.

Link1 : ft.ornl.gov/pubs-archive/47-mccurdy-1.pdf
Link2 : bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c03261871/c03261871.pdf
Link3 ://lse.sourceforge.net/numa/
Link4://ft.ornl.gov/pubs-archive/numa.pdf //Very insightful article.
Link5:devtalk.nvidia.com/default/topic/481951/cuda-numa-memory-and-34-numa-34-gpus/

Jason K.'s picture

Thanks Sergey and Iliyapolak,

Sergey - First, I'm not sure if the problem is or is not NUMA related.  However, I know that Intel has specialists in this area, and it's not clear why my support ticket doesn't get escalated to someone who can directly help me in this respect, and explain the problem.  At least someone at Intel should be explain to me why it's "normal" for this board to see all cores attached to one node.  Unfortunately, I can't return the board.  I wish I could. 

Thanks Iliya for the posts -- they are very very helpful, and include some links that I haven't yet read, especially the one that you marked "very insightful article".

However, in the end, I suspect that "maybe" my problem wouldn't be a problem if I could use numa properly with this board. 

Jason.

Sergey Kostrov's picture

>>...However, I know that Intel has specialists in this area, and it's not clear why my support ticket doesn't get escalated to
>>someone who can directly help me in this respect, and explain the problem. At least someone at Intel should be explain
>>to me why it's "normal" for this board to see all cores attached to one node. Unfortunately, I can't return the board...

I simply don't know. Did you try to request a Premier support or something like this in your country? I think you need to communicate with a person who has real experience with NUMA and that type of board.

I wonder if Intel software / hardware engineers could take care of the problem? I see that it is too specific.

Jason K.'s picture

Sergey,

As it happens, I have now identified that the problem is not actually NUMA after all.  After adding back 32 GB on the second processor, numactl reports 2 nodes with 12 cores on each node, and 32 GB on each.  Apparently by removing the memory from the second memory bank (in order to force memory access to CPU0), it is normal NUMA behaviour to show all cores on one node. It was actually a person who wrote one of the papers that Iliyapolak (above) had brought to my attention that explained this...

For testing, if I have only one GPU that is connected to a PCI-E port connected to CPU 0, if I use numactl to run bandwdithTest on the first CPU, I get one result, and then if I use numactl to run bandwidthTest on the second CPU, the result is a little bit slower (since it has to talk to the GPU connected to CPU0).  This makes sense,  This doesn't explain the 1000 MB/s reduction in speed in bandwidthTest when a second CPU is inserted.   I still believe this is a chipset flaw. I need to speak to the vendor about escalating this request.  It would sure be nice to have the problem resolved since I've been working on it for weeks!!!  It would be nice if an Intel engineer who reads this message might be able to help.

Jason.

Sergey Kostrov's picture

>>...I still believe this is a chipset flaw....
>>...
>>...It would be nice if an Intel engineer who reads this message might be able to help...

I address it to TimP ( Intel ), Patrick Fay ( Intel ) or an engineer from Intel:

Please take a look at it and forward to somebody who could help. Thanks in advance.

iliyapolak's picture

>>>Thanks Iliya for the posts -- they are very very helpful, and include some links that I haven't yet read, especially the one that you marked "very insightful article".>>>

You are welome.

iliyapolak's picture

>>>This doesn't explain the 1000 MB/s reduction in speed in bandwidthTest when a second  It would sure be >>>

Such a behaviour(when second CPU is present) can be traced back to NUMA memory distances.Bear in mind that NUMA functionality at the physical level resembles small network with its own protocol ,error check and correction ,hardware arbitration etc...

Link5:://www.cs.uchicago.edu/files/tr_authentic/TR-2011-02.pdf

iliyapolak's picture

Please follow this link:stackoverflow.com/questions/7259363/measuring-numa-non-uniform-memory-access-no-observable-asymmetry-why

iliyapolak's picture

Do you have any updates related to your problem?

Jason K.'s picture

Hi.

I still don't have any response from Intel.  Someone (non-Intel) suggested that with one processor installed, there's no QPI.  When I install the second processor, QPI is enabled, and with a "slower" processor (2.0 Ghz), it is possible that this isn't enough to run QPI at full speed, hence affecting the result.  Even though MY program isn't using QPI.   This is, by the way, running bandwidthTest on CPU 0, using memory bank 0, using a GPU in an x16 slot that is controlled by CPU 0.  The problem is not NUMA related because running the test using memory bank 1 or CPU 1 shows the true effects of NUMA.  I suspect I will never really 100% know the answer!

Jason.

iliyapolak's picture

>>>Someone (non-Intel) suggested that with one processor installed, there's no QPI>>>

Not exactly.On single CPU system QPI is used for  intreconnecting processor with I/O hub (X58 chipset).On multi-processor system QPI is used to interconnects nodes and I/O hubs.

 >>> Even though MY program isn't using QPI>>>

How can you know that your program does not use QPI?

Jason K.'s picture

Good point.  My program isn't using the processor/memory interconnection of QPI whether running on single processor/dual processor configuration.  I'm told that Intel will still setup a trial and get back to me.  It just takes a long time.

iliyapolak's picture

 >>>My program isn't using the processor/memory interconnection of QPI whether running on single processor/dual processor >>>

Sorry I was looking at wrong chipset.I assume that your motherboard is build around C600 chipset?

Jason K.'s picture

Yes, C600-A to be specific..

iliyapolak's picture

Quote:

Jason K. wrote:

Yes, C600-A to be specific..

So I was wrong I thought that graphics data (text strings or fonts) is moved over QPI to I/O hub which sends it to gpu for text rendering.

Jason K.'s picture

Hi..

I waited for the Intel response to try the single and dual Xeon configuration in the W2600CR along with NVIDIA GTX580 card, and the CUDA bandwidthTest program.  In particular, I want to understand the reason why the pageable memory test with dual CPU runs at 1 GB/s reduction in speed.  I gave explicit instructions on where to download CUDA, how to install it, how to run bandwidthTest if a demonstration was needed.

The first response that I got back from Intel: 

Thank you for contacting Intel Technical Support. 

The Intel® QPI technology is not an issue in the Intel® S2600 series of Server boards. The performance and benchmarks you received with one or two CPU are correct.  

We ran the tests with several Intel® Server board with the C600 chipset and the Intel® QPI technology and we received the same behavior with 2 CPUs in the configuration.

So I write back, and say that  I *know* that they can replicate the result.  I want to know the *reason* for the result.  My ticket gets "escalated" to someone else.  Now I get:

The way I read this is the customer’s app or bench is poorly (multi) threaded.

Tell them to try a different multi-threaded benchmark and/or different app.

We have some performance bench available here: http://www.intel.com/content/www/us/en/benchmarks/server/xeon-e5-2600-summary.html

Perhaps try one of them to see if there’s really performance degradation with the second processor installed.

Take note that I have seen some HPC benchmark performance affected by HyperThreading but this is again due to an app related issue, not board or processor.

I now give up.  I told my vendor to just close the ticket.  I've been working on this for weeks, getting back responses like above that don't address my question whatsoever.  All I wanted was for a hardware expert to  be able to explain what was causing the discrepency.  Unfortunately, the Intel support that *I* have access to is unable to explain this to me.

iliyapolak's picture

Hi Jason,

I understand you, but I do not think that simple benchmark can diagnose the problem.In order to investigate the problem someone must dig deep into internal implementation of Uncore,QPI ,NUMA and GPU's front end probably at the machine code level.As far as there is no large scale performance penalty with various apps nobody will really investigate such a issues and software will be blamed for performance degradation.

jimdempseyatthecove's picture

Jason,

I have an experiment for you to try.

I've noticed that you have experimented with removing one CPU's attached memory and running with 2 CPUs, as well as removing 1 CPU and good or reasonably good performance. Good detective work by the way. The additional experiment is to configure with 2 CPU's with NUMA enabled (the configuration you want).

Run the CUDA bandwidth test with the bandwidth test app constricted to one NUMA node (and one of the GPUs). Essentially you have done this already. However the twist is, on the other NUMA node, make a dummy app, that engagues all threads on that node performing _mm_pause();

Get results, repeat test using same nodes for apps, different GPU

Get results, repeat test using swapped nodes for apps, different GPU 

Get results, repeat test using same nodes for apps, different GPU

Essentially what the test is doing is assuring that virtuallly all hardware threads on other CPU are minimally iterfereing with QPI bus

Should the GPU bandwith performance improve then this may provide some insight for the Intel support people to follow-up on. However, this will not fix your underlaying problem.

An additional test to run is

Run 2 CUDA bandwidth test apps concurrently, each constricted to a NUMA node and using the GPU attached to the CPU on that node.
This test would more likely be representative of your application (as opposed to testing each individually).

Jim Dempsey

www.quickthreadprogramming.com

Login to leave a comment.