When I put timers around a scif_writeto, putting the start timer before the call (incorporating overhead) shows transfer rates of 6.6 GB/s. Putting the start timer right after the call (wrapping only sync logic in timers) shows 7.9 GB/s, which would be peak performance for a 16x PCI express 2.
I think my test is OK: 500 MB message size, flags parameter for writeto set to 0, all memory aligned on a multiple of the page size, all memory is intialized with some value on both ends of the connection. Huge pages are enabled according to /proc/meminfo (this is on Stampede). The host is writing to the MIC.
When I test message sizes 16, 8 and 1 MB, for each case, the fraction of time spent in writeto stays roughly the same. So that the overhead seems proportional to the message size. Furthermore, when I issue the same transfer repeatedly using the same buffer, the overhead sticks around. I would have guessed that any O(N) heavy lifting (such as pinning) would have been done by the register function, not the send function, and only done once regardless.
I thought this test would show peak transfer rates. Maybe it is, considering the 7.9 GB/s number. But sooo much time is being spent in writeto. Does anyone know of any pitfalls? Has anyone seen better transfer rates where the time spent in writeto is incorporated?
The source file is just a simple but messy scif test program; sorry for the mess if you take a look. The lines in question are 68-78.