SCIF API and Knight's Landing

SCIF API and Knight's Landing

Hi all,

I am currently porting an application to the Xeon Phi that does streaming processing of large data files. I started by profiling native runs of the application on the Phi using VTune, and found that performing native file I/O on the Phi was the bottleneck.

As a workaround, I plan to use the SCIF API to stream data to the card.  I have done some benchmarking of SCIF RMA transfers on my cluster using the sample code provided in this post: (scif-test.cpp). I have found that the benchmarking results are both very good (~ 6GB/s) and very consistent across runs (images attached).

However, the real target system for my software is the socketed Knight's Landing architecture, and I have a couple of concerns about using SCIF on that platform:

(1) Is the interface for the SCIF API stable? Will it remain the same for Knight's Landing?
(2) Is it expected that native file I/O will be greatly improved in the Knight's Landing version (thus rendering my work with SCIF pointless)?

- Ben

P.S. In case it may be useful to others, I have attached my own "Hello, World!" program based on the SCIF benchmarking code mentioned above. The main difference between the benchmarking program and the "Hello, World!" program is the way that the code is run on the Phi.  Two instances of the benchmarking program must be run simultaneously on the Phi and the host CPU, whereas the "Hello, World!" just needs to be run on the host CPU.  (The "Hello, World!" version spawns an task on the Phi using the asynchronous version of the #offload pragma.)

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.


KNL isn’t released so I really can’t say much about it. But note that a socketed KNL is a different usage model than KNC and has to be viewed in that light.


Hi Ben,

I assume that when you say you are bottlenecked by file I/O, you are using NFS to read/write files on the host. It would be surprising if you found the local RAM disk slow! In this case, if you want a future-proof framework for sharing file data, you can consider one of the following:

1) File I/O using VirtIO, which is quite fast for reading and writing small files (if you stay within the disk cache of the OS). It is also fast for cold reading of large files, but not for writing large files - see the same paper. VirtIO is easy to set up, and your application does not have to do anything special, just write files to a specific mount point.

1) File I/O using Lustre, which can work fast enough that you are bottlenecked by the HDD speed:  The downside of Lustre is that the initial setup may be difficult.

3) Reading data on host and sending it to the coprocessor using MPI. This is fast as long as you use InfiniBand-backed fabric. In the case of a single compute node, you can just install the InfiniBand software stack called OFED, and it will speed up MPI between host and coprocessor even if you don't have InfiniBand software (see ). MPI will be supported for KNL, or I will eat my hat.

4) Pragma-based offload of data read from files organized in such a way that you can flip a switch and run the application on the host instead of the coprocessor. This can be done with the "if" clause in pragma offload: "#pragma offload target(mic) if(switch==1)". Chances are that #pragma offload will still be supported for KNL, because it will have a PCIe version.



Thanks for your response. I will take that as a warning not to invest too much time on SCIF or other methods of optimizing data transfers to/from the Phi card, given that the PCI version is not my real target system.


Thanks for some very helpful information! The white Colfax white paper you linked describes some configuration options that I was not aware of (e.g. TMPFS vs. RAMFS), and also has some highly relevant benchmarking results. I was not aware that there was a Lustre client available for the Phi, so that is yet another option.

To answer your question, my application was reading the files using the C++ standard library calls, and (if I understand correctly) those file calls ultimately map to NFS when running on the Phi. The host was itself reading the files over NFS from a remote drive, which means two hops over NFS in total (a pretty suboptimal configuration, I admit). Oddly, I did a test where I preloaded the files onto the Phi RAM disk using scp and, although the performance improved greatly, I still found that the file I/O was the bottleneck.

Thanks also for the suggestions about using the #offload pragma or MPI to transfer data to the Phi. I was aware of those options and probably should have mentioned them in my post. I think that the #offload pragma is probably the best strategy for getting good transfer speeds with minimal development effort.

Additional comments:

Despite being my best option, it seems that the "#offload" pragma doesn't lend itself very easily to streaming data processing applications.  Because it requires data to be transferred to the Phi in fixed-size chunks, the application must implement a for loop around an "#offload" code block in order to accomodate inputs that are larger than the available RAM on the Phi card (8GB). This in turn implies that either (i) the state of the Phi program must to be maintained between #offload invocations or (ii) the program state on the Phi must be transferred back to the host between iterations (via output variables in the #offload pragma). I don't know whether or not (i) is possible.  

I am aware that there is an asynchronous version of the "#offload" pragma, but I can't see any way to use that in order to run code on the Phi and transfer data to the Phi at the same time. The asynchronous #offload pragma seems to be more geared towards running code on the Phi and the host at the same time (or transferring data to the Phi and running code on the host at the same time).

Leave a Comment

Please sign in to add a comment. Not a member? Join today