Connection problem during HPL over Phi run

Connection problem during HPL over Phi run

francesca.tartaglione's picture

Dear all,

I've compiled and configured HPL to run in a system with Xeon Phi but I have a problem with the linpack run.

I've copied HPL.dat and xhpl under /tmp of mic0, set the I_MPI_MIC enabled and /sbin/sysctl -w net.ipv4.ip_forward=1

Then I've run the following command:

 HOST# mpirun -hosts mic0 -n 114 -wdir /tmp ./xhpl 

I opened a top on the MIC card and actually the run stared, but after a minute I got the following message:

Connection to mic0 closed by remote host.
APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)

after that HPL stopped and all the terminals opened on mic0 were closed.

Do you have any suggestion about it?

Thanks as always for your help!

Do you 

18 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
loc-nguyen (Intel)'s picture

Hi Francesca,

My suggestion is to try with a small number of rank (n =1) to see if the same problem occurs. Thanks.

francesca.tartaglione's picture

Hello,

I've made some tests and I've seen the following behavior.

Usually I calculate the HPL Ns using the 80% of the memory, so since the Phi cards have 6GB, I used 4.8GB of memory which gives me Ns=24495. With this value I have the issue that I wrote in the previous post.

I tried then using Ns=14495 (just to try) and HPL actually worked fine without any issue.

During this tests I've always used n = 228, so I suppose it's not a matter of the rank size but could be a memory-related problem. 

What do you think about this? 

Are there any memory limit on the phi cards?

For the record I'm using hpl-2.1 and CentOS 6.3.

Thank you very much!

francesca.tartaglione's picture

Hello,

a little update: I've tried with n=16 and I don't have the issue if I calculate Ns using 50% of the memory (if I do the same test with n=228 I have the connection problem).

Then I tried to use n=16 with Ns calculated with the 80% of the memory and again the connection problem occurred.

Am I missing some confuguration step or HPL setting?

Thanks as always!

loc-nguyen (Intel)'s picture

Would you like to point me where to get your HPL code? I need to test it.

Thanks.

francesca.tartaglione's picture

Hello, 

I've downloaded the source code from here: http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz 

I put in attachment my Make file that compiled it.

I've also foud, looking at the Phi monitor that there is an oversized allocation of memory compared to the memory that I use in order to calculate N. So it seems that I got that error and that exit from the program because it was allocated too much memory ( > 6GB). 

Do you know why it has this behaviour? Are there any particular memory settings or variables to export?

I've tried to summarize what I saw in a table that I also put in attachment.

This behaviour happened also if I use the Intel mp_linpack under the mkl folder.

Thank you very much for your support!

Attachments: 

AttachmentSize
Download make.intelmic.txt9.6 KB
Download table.png8.08 KB
francesca.tartaglione's picture

Hello, 

I've downloaded the HPL source code from here: http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz

I've put in attachment the Make file that compiled it.

I've also discovered looking at the mic monitor that there is an oversized allocation of memory compared to the amount of memory that I use to calculate N. In fact I saw that the problem of exiting and killing the HPL application occurs when it allocates too much memory ( >6GB).

I've tried to summarize my tests in a table that I've also put in attachment.

Do you know why it has this behavior? Are there any memory settings of variables to export that I'm missing?

Thank you for your support!

ps: sorry if this post is a copy but I've lost the previous one that I wrote this morning :)

Attachments: 

AttachmentSize
Download make.intelmic.txt9.6 KB
Download table.png8.08 KB
loc-nguyen (Intel)'s picture

Hi,

I tried but failed to build the HPL-2.1 using your Makefile. Below are the steps I tried:

1. Download hpl-2.1.tar.gz

2. Untar the file:

> gunzip hpl-2.1.tar.gz; tar -xvf hlp-2.1tar

That created a directory called "hpl-2.1"

3. I copy that directory under /opt/ . Now we have /opt/hpl-2.1

4. Rename your file "make.intelmic.text" to "Makefile" and place it under /op/t/hpl-2.1:

> mv make.intelmic.txt /opt/hpl-2.1/Makefile

5. Build it:

> make

make: *** No targets. Stop

I am not sure why your Makefile doesn't work for me.

loc-nguyen (Intel)'s picture

Hi,

I tried but failed to build the HPL-2.1 using your Makefile. Below are the steps I tried:

1. Download hpl-2.1.tar.gz

2. Untar the file:

> gunzip hpl-2.1.tar.gz; tar -xvf hlp-2.1tar

That created a directory called "hpl-2.1"

3. I copy that directory under /opt/ . Now we have /opt/hpl-2.1

4. Rename your file "make.intelmic.text" to "Makefile" and place it under /op/t/hpl-2.1:

> mv make.intelmic.txt /opt/hpl-2.1/Makefile

5. Build it:

> make

make: *** No targets. Stop

I am not sure why your Makefile doesn't work for me.

francesca.tartaglione's picture

Hello,

everything is ok till point 4. In this step you have to rename make.intelmic.text to Make.Intel under the directory /op/t/hpl-2.1:

mv make.intelmic.txt /opt/hpl-2.1/Make.Intel

then you can build it:

(under  /opt/hpl-2.1/ )

make arch=Intel

Let meknow if now it's working for you too.

Thank you in advance

loc-nguyen (Intel)'s picture

Hi,

Thank you for the instruction. I was able to build and run the application now. After I transferred xhpl, HPL.dat to mic0, I run it succesfully with n=114 or n=256

> mpirun -host mic0 -n 256 -wdir /tmp ./xhpl

For your information, the system I use is RHEL 6.2 and it has MPSS 4982-15 installed. What MPSS version you have? Did you rebuild your MPSS for CentOS 6.3? If so, what version of the compiler?

Also, make sure to transfer mpiexec, pmi_proxy, libmpi.so.4 and libmpigf.so.4 to the coprocessor mic0 before running mpi.

Thank you.

 

 

francesca.tartaglione's picture

Hello, 

thank you for your tests. How much memory did you use? Which value did you assign to N?

Actually in my system it is installed CentOS 6.3 with MPSS  5889-16. I've downloaded from the Intel website the package KNC_gold_update_2-2.1.5889-16-rhel-6.3.tar (it is related to the same kernel version that I have:  2.6.32-279.el6.x86_64) so I did not rebuit MPSS for my distribution. 

I'm using icc 13.1.0 20130121. and the mpiicc that is under /opt/intel/impi/4.1.0/bin64/, is that the correct mpiicc to use? Or should I use the mpiicc under /opt/intel/impi/4.1.0/mic/bin/?

And also I did trasfer all the files you mentioned, without them the run can't start at all.

Thanks for you support!

loc-nguyen (Intel)'s picture

Hi Francesca,

Using the tool micsmc, I saw that memory is used about 25% when the program runs. I just use whatever the default value of N is used when I type "mpirun -host mic0 -n 256 -wdir /tmp /xhpl".

Your MPSS version is more recent than mine, your Intel composer is the same. Could you verify the MPI version by typing "ls -l /opt/intel/impi" please? Mine is 4.1.0.030.

Thanks.

 

francesca.tartaglione's picture

Hello,

I have the  4.1.0.030 version of impi.

In my case using micsmc I saw that the memory is over allocated that's why HPL chushed.

Have you set any variable like OMP_NUM_THREADS during your run?

Thanks for you support!

francesca.tartaglione's picture

Hello,

I have the  4.1.0.030 version of impi.

In my case using micsmc I saw that the memory is over allocated that's why HPL chushed.

Have you set any variable like OMP_NUM_THREADS during your run?

Thanks for you support!

loc-nguyen (Intel)'s picture

I didn't set any env variable at all. The only difference is that you ran MPSS built for rhel on your CentOS, maybe that is the cause? Thank you.

PONRAM's picture

solutions usefull to me also.

ashish s.'s picture

Hi, 

I am facing issue in running Linpack on Intel Phi 5110P. I am using Intel mpss-3.1.2 with n=24000. I used mpirun -n 200 -host mic0 -wdir /tmp ./xhpl. But, it is giving the error----

HPL ERROR from process # 0, on line 246 of function HPL_pdtest:
>>> [0,0] Memory allocation failed for A, x and b. Skip. <<<

Thanks in advance.

 

 

 

Login to leave a comment.