scif_writeto seems heavy weight and general writeto/readfrom performance

scif_writeto seems heavy weight and general writeto/readfrom performance

When I put timers around a scif_writeto, putting the start timer before the call (incorporating overhead) shows transfer rates of 6.6 GB/s.  Putting the start timer right after the call (wrapping only sync logic in timers) shows 7.9 GB/s, which would be peak performance for a 16x PCI express 2.

I think my test is OK: 500 MB message size, flags parameter for writeto set to 0, all memory aligned on a multiple of the page size, all memory is intialized with some value on both ends of the connection.  Huge pages are enabled according to /proc/meminfo (this is on Stampede).  The host is writing to the MIC.

When I test message sizes 16, 8 and 1 MB, for each case, the fraction of time spent in writeto stays roughly the same.  So that the overhead seems proportional to the message size.  Furthermore, when I issue the same transfer repeatedly using the same buffer, the overhead sticks around.  I would have guessed that any O(N) heavy lifting (such as pinning) would have been done by the register function, not the send function, and only done once regardless. 

I thought this test would show peak transfer rates.  Maybe it is, considering the 7.9 GB/s number.  But sooo much time is being spent in writeto.  Does anyone know of any pitfalls?  Has anyone seen better transfer rates where the time spent in writeto is incorporated?

The source file is just a simple but messy scif test program; sorry for the mess if you take a look.  The lines in question are 68-78.

Fichier attachéTaille
Télécharger scif-test.cpp4.89 Ko
6 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

This computer has some nodes with Nvidia K20s in them, and the best transfer rates I can get out of those are 6.08 GB/s with CUDA and 6.28 GB/s with OpenCL (odd that they're different).  So 6.6 GB/s with the MIC is on par with that card.  Still, I wish I knew why the scif_writeto takes so long.

Fichiers joints: 

Fichier attachéTaille
Télécharger cudaxfer.cpp1 Ko
Télécharger xfer.cpp1.88 Ko

Hi grumkin,

I tried to compile your first program (scif_test.cpp), when running it I got the error: "scif register failfed". Would you like to provide more information on setup, compile, etc, ... so that I can have a look at the issue?

Regards.

Hey Loc,

I modified the code to print better error messages and do more tests.  The compilation instructions are in a comment in the new attached file.  I'm using icc 13.1.0 (icc is in a directory with the substring 'composer_xe_2013.2.146' if that gives you more specific info).  Your problem may be a limit on the amount of memory you can lock if you are running this on a workstation.  That has made the Infiniband verbs equivalent of scif_register fail for me in the past.  You can run 'ulimit -a' (assuming linux) and see what the 'max locked memory' value is.  Something in the tens of kilbotyes range is normal, but the program will try to lock 500 MB.  A system administrator can change the limit.

The new program tests all four combinations of transfers where there are two directions data can flow and two ways to start the transfer (scif_readfrom and scif_writeto) depending whether the host or the MIC calls the function.  I've noticed a few more things.  My overall question is just this: is this expected or can I do anything differently to improve transfer rates.  The attached python script will generate a plot from the output.host and output.mic files generated by the code.  You need the matplotlib library installed.  Here are my observations, see the attached plot in transfer_rates_MIC.jpg.

The CPU thread is sleeping for 1 microsecond via usleep and 0.1 microseconds via _mm_delay_64 on the MIC.  Given the various typical rates and message sizes (as detailed in the output files), this results in roughly 10% maximum error for the 8K transfers, but negligible error for all the other message sizes.

1. The transfer rate peaks at 6.7 GB/s [ better than NVIDIA :) ], but not 8 GB/s

2. When transfering from the MIC to the host, there is a sweet spot at 1 MB message sizes, for larger messages, the rate drops sharply - very odd and scary

3. When initiating the transfer from the MIC, small messages are MUCH faster than if the same transfer is initiated from the host, regardless of the direction of data flow.  The rate never drops below 1 GB/s when initiated by the MIC, but can go as low as 50 MB/s when the initiated by the host.

5. The curves for the MIC to host transfer are highly variable in spite of the fact that the transfers are issued mutliple times and the computed rates are averages.  You can't see this by looking at the attached plots; you must run the program multiple times to see this variability.  You will very likely see those two curves come out differently if you make plots.  I added CPU affinity with sched_setaffinity but that doesn't help.  The trend for the rate to drop sharply for large messages seems to always be there.

6.  About half the time I ran the code, a third plot, the host to MIC transfer initiated by the MIC, would come out all wonky (see unusual_MIC_readfrom_transfer_rates_MIC.jpg).  Compilling with -O3 seemed to make this go away most of the time.  I figured the wait loop might be throwing off the MIC processor's performance, so I replaced usleep on the MIC with _mm_delay_64.  This seems to have totally fixed the problem, improving but not removing the variability from the above point, and I can take away the -O3 without the issue coming back. (CPU affinity was set when I made all of these observations.)  By the way, the MIC to host initiated by the MIC is very weird in this plot.  After having done at least 10 runs, I haven't seen anything like that since I switched to _mm_delay_64.

7. The only absolutely consistent plot is the host to MIC transfer where the host initiates.  Certain data points on other plots are just as consistent though, for instance a readfrom issued by the MIC for 8192 bytes is always 1.2 GB/s

At this point I think I've learned these rules of thumb:
1. If data must flow from the mic to the host, the message size should not exceed something like 1MB.

2. If the MIC processor is so sensitive to whether or not you use usleep or _mm_delay_64 ( I could be wrong about that, but for the sake of argument...), then something is happening in the core such that the idea of issuing some IO and then doing some mathematical calculations on the same core (overlapping computation and IO) is risky.  A core might need to be dedicated to IO, though that strategy hinges on how efficiently MIC cores can communicate with each other.

3. If you must use small messages, initiate the transfer from the MIC if possible.

Any thoughts?

Fichiers joints: 

Hi grumkin,

i have cudaxfr tested on my cuda-5.0 environment openSUSE 12.3 RC1 64 Bit Linux Kernel 3.7.6  2x GeForce GTX560Ti, icc 13.1.0, with the same compile option as you

CUDA SDK 5.0  driver 313.18  the output of cudaxfer are 5.686 GB / secs

your program works correct

best regards

Franz

Hi Grady,

Your questions are forwarded to SCIF experts. They will get back to you soon.

Regards.

Connectez-vous pour laisser un commentaire.