I/O performance and best way to run Native executable

I/O performance and best way to run Native executable

Ritratto di luckynew

Hi,

In the past, I used to copy input files to the MIC local virtual disk, and then run my application (Hybrid MPI OpenMP) from the Xeon Phi.

This approach has some drawbacks:

- This consumes physical memory. In my case, the input deck was 2 GB, so 2 GB less physical memory for running my application.

- The input files need to be copied before running the application. I noticed that the scp performance between the host and the Xeon Phi appears to be very low, around 5 MB/s while I got 70 MB/s on my local network. This is far from the PCI express performance. I have no explanation why and don't know if it is specific to my system configuration. I would be pleased to get some feedback on this first point.

So, I am using another solution based on NFS disk mounted from the host and exported to the Xeon Phi card.

In this case, it is no more needed to copy the input deck on the Xeon Phi. I can run directly my application on the Xeon Phi reading the input deck from this exported directory.

However, the performance of reading and writing from the Xeon Phi with my application is really poor.The I/O is done using c code via fread and fwrite. I have been able to reproduce this behaviour by coding some simple c programs that mimic what I do in my application.

For example, to write 2 GB from the MIC to this NFS location, it takes around 200s, and around 100s to read.

If I monitor the Xeon Phi, it is idle or consuming some system time only. This amount of time may vary somehow, so those timings are an average.

Instead, if I use local "file system" (in memory), it takes only 10s to write and 5s to read on the Xeon Phi (and 1,7 s write, 0,5 s read on the Xeon) .

Instead of using the Xeon Phi, if I mount this directory from another Xeon machine in my network, it takes only 40s to write and 2s to read.

Maybe this is related to the way I mount this directory or to the way I am using C to perform the read/write but I don't understand those numbers and poor I/O performance on NFS.

 Extract from my fstab on the Xeon Phi:

172.31.1.254:/home/micshare     /home/micshare  nfs     rsize=8192,wsize=8192,nolock,intr 0 0

My dummy programs to read and write :

cat cw.c

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define BUFLEN 256
#define DIM 100000

main()
{
FILE *curfile;
int i, iter, k, len;
curfile=fopen("file_test","w");
double a[BUFLEN*DIM];
printf("begin\n");
len=BUFLEN*DIM;
iter=10;
for(k=0;k<len;k++) {
a[k]=k*2;
}
for(i=0;i<iter;i++){
printf("iter %d : write %ld double with fwrite\n",i+1,len*sizeof(double));
for(k=0;k<len;k+=BUFLEN)
{
fwrite(&a[k],sizeof(double),BUFLEN,curfile);
}
}
}

cat cr.c

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define BUFLEN 256
#define DIM 100000

main()
{
FILE *curfile;
int i, iter, k, len;
curfile=fopen("file_test","r");
double a[BUFLEN*DIM];
len=BUFLEN*DIM;
iter=10;
for(i=0;i<iter;i++){
printf("iter %d : read %ld double with fread\n",i+1,len*sizeof(double));
for(k=0;k<len;k+=BUFLEN)
{
fread(&a[k],sizeof(double),BUFLEN,curfile);
}
}
printf("finished %f\n",a[0]);
}

To compile them:

icc -O2 -o cr cr.c
icc -O2 -mmic -o cr.mic cr.c

icc -O2 -o cw cw.c
icc -O2 -mmic -o cw.mic cw.c

To run them on the Xeon Phi, please, add:

ulimit -s 400000

I hope you will give me some advice to figure out how to solve this I/O issue.

Thanks for your help,

Eric.

9 post / 0 new
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione
Ritratto di  Wei W.

I dont know why you have such low I/O. There is a alternative way if you can not solve your I/O issue.  Do not use intel MIC to access the remote file in NFS. You could use SCIF API. I guess you want to use MIC to read files and do some operations and write back to files. 

You can read files in host, then use scif_writeto to send data to MIC, After MIC finish operations, then use scif_readfrom to read data back to host, then save data back to files. When open one connection between MIC and host, the bandwidth of scif_writeto and readfrom is 6GB/s. If you open multiple conections, you can get much better bandwidth. I think it is enough for your application. 

Ritratto di Tim Prince

Are you certain that uploading the files to virtual disk is taking so much RAM?

I haven't seen a study on the performance trade-offs between scp and mounting the files.  Both are disagreeably slow, and some people are willing to accept benchmarking which doesn't count scp time but does count time spent reading mounted files.

Ritratto di luckynew

Hi,

Thanks for your answers.

@Wei, I don't want to use the host. I want to run pure native, not offload code. Furthermore, my code is an industrial application, so I can't rewrite the whole I/O module, especially, if you think that those I/O performance numbers are not the ones I should expect.

@Tim, I fully agree that for benchmarking purpose, this is better to do the scp before, and to make speedup comparizon w/o accounting those transfer times...On this other hand, when those files are big, and your application memory is close to the memory limit this becomes a problem. In my application, the files I need to transfer are "restart" files. They roughly correspond to the memory image of my application (at least to permanent data). So, their sizes are benchmark dependent.

I still not understand why NFS on my system is much more slower between the host and the Phi than between the host and any other machine on my local network. So, maybe there is something wrong in my setup. If someone else could confirm what he gets with my small program example, this would be really helpful to understand if I encounter a particular problem or not.

Thanks,

Eric.

Ritratto di Andrey Vladimirov

I remember hearing in one of Intel's talks that the standard issue TCP/IP stack in MPSS is the cause of the slow speed of NFS and SSH. If I remember correctly, they said that PCIe is a reliable fabric, but TCP is trying to do its thing and maintain the reliability of communication on top of PCIe. This slows things down. I suspect that Intel is working on a new TCP/IP stack that will address the issue. I will try to find a link to this talk.

So, in the meantime, if file output is critical for the application, you can probably do this trick: create an additional MPI process on the host and MPI_Send your data from the MPI processes on the coprocessor to that new process on the host. Then the host process can do file output directly to the disk. With the standard MPI_Send you should be getting 6 GB/s, and you don't have to use offload.

Ritratto di  Wei W.

Quote:

Andrey Vladimirov wrote:

I remember hearing in one of Intel's talks that the standard issue TCP/IP stack in MPSS is the cause of the slow speed of NFS and SSH. If I remember correctly, they said that PCIe is a reliable fabric, but TCP is trying to do its thing and maintain the reliability of communication on top of PCIe. This slows things down. I suspect that Intel is working on a new TCP/IP stack that will address the issue. I will try to find a link to this talk.

So, in the meantime, if file output is critical for the application, you can probably do this trick: create an additional MPI process on the host and MPI_Send your data from the MPI processes on the coprocessor to that new process on the host. Then the host process can do file output directly to the disk. With the standard MPI_Send you should be getting 6 GB/s, and you don't have to use offload.

He doesnt want to use host to do file output. I think your suggestion is still a kind of "offload". He needs native. 

Ritratto di luckynew

Wei is correct.

My main concern remains wheither what I observed in term of performance with my given implementation is expected or if I made a mistake somewhere or if it is possible to tune NFS to get better performance with PCI express between host and Phi.

Eric.

Ritratto di Gerben Roest

I can confirm that my scp speed between host and mic is also 5 MB/s.

Ritratto di Gerben Roest

Using "scp -c arcfour" improves the speed a bit, to > 6 MB/s. Arcfour is a simpler cypher, so this shows it's not ONLY the network.

Accedere per lasciare un commento.