I wanted to look at using infinband verbs on the MIC card, where we're talking about transfers between two different compute nodes (two different MICs, system memories or the system memory on one node and a MIC on the other). You are supposed to be able to do something like this with GPUDirect for Nvidia cards, but the documentation for GPUDirect that I could find is too sketchy right now to be useable. I wrote a program to do some transfers using RDMA and found that transfers from host memory to host memory are much faster than transfers from MIC memory to MIC memory.
My code was excised from a larger program and hard to understand and build, but I realized that the standard program in the OFED software, ibv_rc_pingpong, can be used to demonstrate the issue. This was already installed on compute nodes and on the MIC processors on Stampede. The data in the attached plot comes from ibv_rc_pingpong. If you try it, you need to specify the device with '-d mlx4_0' because there are two devices, and the other one, 'scif0', produces nonsense when trying to connect two separate compute nodes. I also increased the maximum transfer unit to 2048 with '-m 2048', improving rates somewhat. Depending on where in the network your two compute nodes are placed, there can be some variability in the rates, but all data in that plot was taken with the same two nodes, and I found the rates to be typical after trying several pairs of nodes.
From looking at the source code for ibv_rc_pingpong, it doesn't use RDMA: the opcodes being used in the work requests are IBV_WR_SEND and IBV_WR_RECV, not IBV_WC_RDMA_WRITE or READ. Regardless, the rates look the same as my RDMA code, where the best host-host transfer rate is around 5.77 GB/s and the best MIC-MIC transfer rate is around 0.92 GB/s.
Details of how ibv_rc_pingpong works aside, the important thing is that the program is doing the same thing whether run on the host or the MIC, and when the MIC is involved, the transfer rates are much lower. I also have host to MIC rates in the plot, which are a little better than MIC to MIC. Surely this is just a driver issue and there is no limit on the underlying hardware that makes the rates involving the MIC so low. In fact, from the SCIF transfer rates I posted elsewhere in this forum, it would be about 2x faster to do a three step process: sending data from the MIC to the local host, then out to the remote host, then up to the remote MIC.
I am working on a large sparse eigensolver for a material science problem where an eigenvector is needed, roughly speaking, for each electron in the system. Here, a very large collection of vectors is produced, and dense matrix operations are performed on those vectors. However, sparse matrix operations produce the vectors in the first place. Because system memory is so much larger than accelerator memory, I had been thinking that it might make sense to do the dense operations on the host ( where the vectors must eventually end up anyway because of the larger host memory size ) and do the sparse operations on the MIC. This would be counterintuitive since accelerators are typically thought of as poor for sparse matrix computations. The point of this test and the SCIF tests in another post is for trying to decide if that will actually work. At 0.92 GB/s, forget about it. Does anyone at Intel think this is just a driver issue and can be improved in the future through software updates?