Is it possible to extend the memory of MIC?

Is it possible to extend the memory of MIC?

Hi,

We are using 5110P model which has 8GB on board memory. We have some applications that require more memory than the MIC card has. If the application use up the 8GB on board memory, the system will return an error and kill the program. For this use case, is it possible to use the memory in the host machine as a backup?

Thanks!

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Memory on the Xeon Phi Coprocessor card can be accessed directly with a latency of about 275 ns and a bandwidth of up to 175 GB/s, while memory on the host can only be accessed indirectly (by way of lots of software) with a latency of about 5000 ns and a bandwidth of up to 7 GB/s -- in other words about 20 times slower for either small transfers (latency dominated) or large transfers (PCIe bandwidth dominated).

Although it is theoretically possible to create the software infrastructure to make access to host memory transparent (e.g., creating a /dev/swap device that uses host memory as the swap space), this is going to be quite inefficient.   Swapping is based on transfer of individual 4 KiB pages, for which OS overhead dominates and throughput will be only a small fraction of the PCIe bandwidth (which is already only 1/20 of the local memory bandwidth).

Because of the overhead of setting up transfers, the best performance would be obtained with (large) explicit transfers to and from the host (preferably bidirectional to get maximum PCIe throughput) overlapped with computation on the Xeon Phi Coprocessor.   This may not be fully supported by the standard software models (e.g., "offload" from the host), but should be possible using a native application on the Xeon Phi communicating with a "helper" process on the host.

Given the 20:1 bandwidth ratio between local memory and PCIe accesses to the host memory, I would suspect that relatively few applications have the right balances for this to be an effective approach.
 

John D. McCalpin, PhD
"Dr. Bandwidth"

I would recommend to run in symmetric mode, i.e. run on the host cpu and a xeon phi. Then split the memory allocation and use MPI for communication between the host and the xeon phi.

Or just use two xeon phi.

Do you need ECC? If not you can turn it off and get some extra memory for your program!

Quote:

Patrick S. wrote:

I would recommend to run in symmetric mode, i.e. run on the host cpu and a xeon phi. Then split the memory allocation and use MPI for communication between the host and the xeon phi.

Or just use two xeon phi.

Do you need ECC? If not you can turn it off and get some extra memory for your program!

Actually we are using MPI to run in symmetric mode. We figure out that if we want to fully utilize all 240 available threads, we will use up the on board memory.

BTW, how much memory can be saved by turning off ECC? And how can we do that?

Quote:

John D. McCalpin wrote:

Memory on the Xeon Phi Coprocessor card can be accessed directly with a latency of about 275 ns and a bandwidth of up to 175 GB/s, while memory on the host can only be accessed indirectly (by way of lots of software) with a latency of about 5000 ns and a bandwidth of up to 7 GB/s -- in other words about 20 times slower for either small transfers (latency dominated) or large transfers (PCIe bandwidth dominated).

Although it is theoretically possible to create the software infrastructure to make access to host memory transparent (e.g., creating a /dev/swap device that uses host memory as the swap space), this is going to be quite inefficient.   Swapping is based on transfer of individual 4 KiB pages, for which OS overhead dominates and throughput will be only a small fraction of the PCIe bandwidth (which is already only 1/20 of the local memory bandwidth).

Because of the overhead of setting up transfers, the best performance would be obtained with (large) explicit transfers to and from the host (preferably bidirectional to get maximum PCIe throughput) overlapped with computation on the Xeon Phi Coprocessor.   This may not be fully supported by the standard software models (e.g., "offload" from the host), but should be possible using a native application on the Xeon Phi communicating with a "helper" process on the host.

Given the 20:1 bandwidth ratio between local memory and PCIe accesses to the host memory, I would suspect that relatively few applications have the right balances for this to be an effective approach.

 

Swapping is, as you pointed out, not practical given the small page fashion. I am wondering is it possible to use DMA to make access to host memory transparent, which I think is somewhat promising.

Disabling ECC only increases the available memory by 1/32, so it is unlikely to provide enough memory to make any difference to your application.

MPI sets up communication buffers that chew up a lot of memory on the Xeon Phi.  We have found that the system runs best with a small number of MPI tasks on the Xeon Phi (anywhere from 1 to 10) with OpenMP threading within each MPI task providing the additional parallelism.

John D. McCalpin, PhD
"Dr. Bandwidth"

Hi YW,

I realize it's been 2 years... did you find a working solution?

The new Xeon PHI devices (x2xx, released recently) have up to 16GB on-die; additionally, when operated as a server host CPU (on the S7200AP for example), those devices can enjoy up to 384GB DRAM additional memory using DDR4 (http://www.intel.com/content/www/us/en/motherboards/server-motherboards/...).

If you need more than that, you can use Intel’s new Software Defined Memory capability on NVMe flash drives (with ScaleMP technology) and easily attach a couple of 1.6TB NVMe drives to an S7200AP system, which would appear like a PHI host system with 3TB of system memory.

Benzi
(Proper disclosure: I am with ScaleMP)

Leave a Comment

Please sign in to add a comment. Not a member? Join today