My apologies in advance. This is the first time I'm posting on the Intel developer forums, and in all fairness, I'm not 100% sure that all parts of my question belong in this forum. Please bear with me.
I work in a university CS and Eng. department. A faculty member recently purchased an Intel server with the W2600CR motherboard, and dual Xeon E5-2620 processors along with 4 x Nvidia GTX670 cards to be used with GPU computing experiments (with CUDA) under Linux (Red Hat Enterprise). The system has 64 GB of DDR3 memory. Because of the system architecture, 32 GB is "attached" to each CPU, and 2 PCI-E slots, hence two GPUs are attached to each CPU (1 x16 and 1 x8 on the first, 2 x8's on the second).
After getting the system setup, the faculty member was shocked that the performance of paged memory tests between the system and the graphics cards were extremely poor (simply running CUDAs bandwidthTest program with the --memory=pageable option). The results were in the range of 1500 MB/s instead of 4000 MB/s that he was getting on a much older Core i7 system from 2009. The GTX670 graphics card was taken out of this newer system, and moved to the 2009 system, and it worked at the proper performance level there. The GTX580 graphics card was taken out of the 2009 system and tried in the new system, and it also performed equally as poorly as the GTX670.
I spoke to the vendor who we had purchased the Intel solution from who opened up a ticket with Intel, and we've been going back and forth for quite a few weeks now with no resolution to the problem. I tried many different setups, but finally, I removed one of the Xeon processors from the new system, the second 32 GB of memory, and 3 of the graphics cards, and now the performance tests worked perfectly giving the expected results. This led me to wonder if this was some kind of NUMA issue -- maybe I was somehow running the performance test on CPU 0, using a graphics card connected to CPU 1, and memory connected to CPU 1. Could this be the reason why the performance numbers were so poor?
I put the second Xeon processor back in the system, but left out the second 32 GB memory, and left only one GPU in the system connected to CPU 0. I used "numactl --hardware", and expected to see two nodes with 12 cores each (E5-2620 is dual 6 core, and hyperthreading is enabled). Instead, I saw 1 node with 24 cores! I ensured that NUMA was enabled in the BIOS. I did try running the bandwidthTest on all 24 cores, and while there was some minor variation in numbers, nothing even close to the 4000 MB/s result that I wanted to see.
I decided that since I didn't have much luck getting Intels help to solve the GPU problem, that maybe if we could solve the "NUMA" issue that the GPU performance issue would be gone. After more back and forth with the vendor, and Intel, Intel apparently setup a W2600CR system with Linux (RH6.1 since apparently Intel doesn't "support" later versions) in a test lab. I excitedly waited for the response. Would it be an O/S bug? A BIOS issue that needed to be fixed??? My vendor called me back and explained the the support person said that after testing all the boards in the W2600 series, this NUMA behaviour was normal for this board and processors. My problem is, I just don't understand *WHY* it's normal. Another vendor which has an excellent tutorial on using numactl under Linux on their web site has a system with the same chipset, and same processors, yet they show 2 nodes! Nobody seems to be able to answer *why* it's different for this Intel board.
In the end, I've been unable to rectify the situation, and have lost a LOT of time trying. Is my only choice at this point to buy a second server board, move the second Xeon processor, 32 GB of memory, and two GPUs there? I really don't believe in "solving" problems this way...I suspect the answer *IS* out there, and maybe, just maybe you've got it ... please? ! :)