Problem compiling SuperLU_Dist 3.3 with Intel 14.0 (worked with Intel 2013)

Problem compiling SuperLU_Dist 3.3 with Intel 14.0 (worked with Intel 2013)

Hi,

I am trying to compile SuperLU_Dist version 3.3 with OpenMPI 1.6.5 wrapper of Intel compiler icc version 14.0.1 (gcc version 4.8.0 compatibility), and it fails with a very strange error :

/software6/mpi/openmpi/1.6.5_intel/bin/mpicc -I/software6/mpi/openmpi/1.6.5_intel/include -I/software6/mpi/openmpi/1.6.5_intel/include -O3 -xHost -mkl -fPIC -m64 -fPIC -O3  -DAdd_ -DUSE_VENDOR_BLAS -c pdgstrf.c
make[1]: Leaving directory `/software6/src/petsc-3.4.3/externalpackages/SuperLU_DIST_3.3/SRC'
error #13002: unexpected CFE message argument:  e. The staggered cosine transform may be
warning #13003: message verification failed for: 556; reverting to internal message
pdgstrf.c(2672): warning #556: a value of type "int" cannot be assigned to an entity of type "MPI_Request"
            U_diag_blk_send_req[krow] = 1; /* flag outstanding Isend */
                                      ^     

pdgstrf.c(2672): warning #152: Fatal error: Trigonometric Transform has failed to release the memory.
            U_diag_blk_send_req[krow] = 1; /* flag outstanding Isend */
                                        ^     

compilation aborted for pdgstrf.c (code 1)
make[1]: *** [pdgstrf.o] Error 1

 

I understand that the code of SuperLU_Dist is non-standard (it assigns an int to a type MPI_Request), but why is the compiler crashing with this weird message :

error #13002: unexpected CFE message argument:  e. The staggered cosine transform may be
warning #13003: message verification failed for: 556; reverting to internal message

pdgstrf.c(2672): warning #152: Fatal error: Trigonometric Transform has failed to release the memory.

 

This seems to be a compiler bug, since it worked with Intel icc version 13.0.0 (gcc version 4.1.2 compatibility)

Thanks,

Maxime Boissonneault

37 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Maxime,

I installed OpenMPI 1.6.5 and was able to compile pdgstrf.c successfully per your command line options without any errors and not able to reproduce the issue.  If you can attach the preprocessed file (passing -P option) and attach to this issue then I can try and see if I can reproduce the issue. 

Thanks,  
Kittur 

Hi Kittur,

Attached is the pre-processed file.

Also, just to make sure we are compiling the same file, here is the SuperLU_Dist source code that I am trying to compile

http://crd-legacy.lbl.gov/~xiaoye/SuperLU/superlu_dist_3.3.tar.gz

Thanks,

Maxime

Attachments: 

AttachmentSize
Downloadapplication/octet-stream pdgstrf.i271.08 KB

Thanks Maxime for the attachment, I'll take a look at it. BTW, just to make sure, can you also provide the system info (os, gcc version etc) too?

Thanks,

Kittur

Thanks Maxime for the attachment, I'll take a look at it. BTW, just to make sure, can you also provide the system info (os, gcc version etc) too?

Thanks,

Kittur

Thanks Maxime for the attachment, I'll take a look at it. BTW, just to make sure, can you also provide the system info (os, gcc version etc) too?

Thanks,

Kittur

Hi Kittur,

This is with CentOS6. GCC was built from source (replacing the default OS compiler) and is version 4.8.1.

Thanks,

Maxime

Hi Maxime,

Well, I tried using your .i file as well as the new SU tar file also on RHEL 5,.X as well as 6.2 and couldn't reproduce (See below).

%icc -V

Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.1.106 Build 20131008

~/maxime$ ~/intel/openmpi/bin/mpicc -I~/projects/intel/openmpi/include -O3  -xHost -mkl -fPIC -m64 -fPIC -O3  -DAdd_ -DUSE_VENDOR_BLAS -c pdgstrf.i

pdgstrf.c(2672): warning #556: a value of type "int" cannot be assigned to an entity of type "MPI_Request"
            U_diag_blk_send_req[krow] = 1; /* flag outstanding Isend */
                                      ^

pdgstrf.c(2672): warning #152: conversion of nonzero integer to pointer

            U_diag_blk_send_req[krow] = 1; /* flag outstanding Isend */

%ls *.o

pdgstrf.o

=============================

BTW, I don't have a CentOS6 system but RHEL6 is a compatible system for it (we don't officially support CentOS). The only thing is that the gcc version I find on that system is 4.4.  I'll try and see if I can install gcc 4.8 in  the meantime on such a system and try. Other than that, don't know what else we can do since I don't have a CentOS system which is not officially supported.....(we always try to reproduce with compatible EL systems, fyi) You can also try to execute the file witout openMPI and see if you have an issue? If you don't then I am wondering if it's a bug in openMPI? Just a thought....

Regards,

Kittur

Hi Kittur,

I can reproduce it using directly ICC :

/software6/compilers/intel/composer_xe_2013_sp1/bin/icc -O3 -xHost -no-prec-div -Mipa=fast,safe -xHost -fPIC -DDEBUGlevel=0 -DPRNTlevel=1 -DPROFlevel=0 -DAdd_ -fPIC -DUSE_VENDOR_BLAS -c pdgstrf.c -I/software6/mpi/openmpi/1.6.5_intel/include -pthread
icc: command line warning #10006: ignoring unknown option '-Mipa=fast,safe'
error #13002: unexpected CFE message argument:  e. The staggered cosine transform may be
warning #13003: message verification failed for: 556; reverting to internal message
pdgstrf.c(2672): warning #556: a value of type "int" cannot be assigned to an entity of type "MPI_Request"
            U_diag_blk_send_req[krow] = 1; /* flag outstanding Isend */
                                      ^

pdgstrf.c(2672): warning #152: Fatal error: Trigonometric Transform has failed to release the memory.
            U_diag_blk_send_req[krow] = 1; /* flag outstanding Isend */
                                        ^

compilation aborted for pdgstrf.c (code 1)

 

I got the command line from adding --showme to the mpicc command.

Our OpenMPI was compiled with the same version of Intel, using the configure options :

./configure --prefix=$PREFIX \

     --with-threads --with-openib --enable-shared \

     --enable-static --with-ft=cr --enable-ft-thread \

     --with-io-romio-flags="--with-file-system=testfs+ufs+nfs+lustre" --with-tm

and then make && make install.

Hi Maxime,

I tried it on a EL6 system with gcc 4.8.1 also and couldn't reproduce:

$/cts/tools/bin/gcc --version
gcc (GCC) 4.8.1
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$~/intel/openmpi/bin/mpicc/mpicc -I/home/cmplr/usr4/kganesh1/projects/intel/openmpi/include -O3 -xHost -mkl -fPIC -m64 -fPIC -O3  -DAdd_ -DUSE_VENDOR_BLAS -c pdgstrf.i

pdgstrf.i(11347): warning #556: a value of type "int" cannot be assigned to an entity of type "MPI_Request"
            U_diag_blk_send_req[krow] = 1;
                                      ^

pdgstrf.i(11347): warning #152: conversion of nonzero integer to pointer
            U_diag_blk_send_req[krow] = 1;
                                        ^

------------------------------

So, basically can't reproduce the issue :-(

Regards,

Kittur

Interesting, I know RH EL systems are completely compatible with Cent-OS so it's interesting that I am not able to reproduce.
Also, I did build and installed openmpi using icc. Let me see what else might be going on and i'll ping some of my peers to see if they can recognize any further clues on this.
_Regards,
Kittur 

Maxime, the only other thing i see is I have left a few options when building openmpi so I'll try that and see if that makes any difference, thanks

Regards, Kittur

Is there anything else I could try on my side ? To get more verbose output, debugging, etc.

Maybe some particularities of our system : we do not have the OS-provided gcc/libstdc++ dating from gcc 4.4. We have instead built GCC 4.8 and its dependencies using GCC 4.4, then uninstalled GCC 4.4, leaving only the glibc (not c++) since this one is required for many more system packages.

The reason we are doing this is because we do not want our users to rely on old versions (at least not without knowing it).

Maxime

Maxime, well I reinstalled openmpi and tried again - same scenario, couldn't reproduce the issue.

> We have instead built GCC 4.8 and its dependencies using GCC 4.4, then uninstalled GCC 4.4, leaving only the glibc (not c++) 
What you say with reference to could be a factor I am not sure and will need to check with our front-end expert developer.
I'll update you as soon as I get more info. Appreciate your patience till then.

Regards,
Kittur

I can also give you an access to the system I am compiling on if that may help.

Thanks,

Maxime

Hi Maxime,

Well, from compiler per-se it appears there's no issue but I've passed this on to the MKL team to find out if there's any issue with MKL and I'll get back to you as soon as I've an update, appreciate much

Regards,
Kittur

HI Maxime,

Our front-end team let me know that that diagnostic is coming from the diagnostic infrastructure when grabbing messages from the catalog and verifying the contents. Usually messages like this come from the fact that the compiler is picking up the wrong message catalogs.  The catalogs are picked up via the NLSPATH environment variable.  

So, please check to see if this variable is set, and if so it is set to a known location that matches up with the compiler that is being invoked.

If it is an NLSPATH problem, you can either set it to the proper value or unset it completely and the internal compiler diagnostics will be used instead of the catalog.

Could you please try the above and let me know if it resolves the issue? Appreciate much for your patience and for your quick response.

Regards,
Kittur

Hi Kittur,

The NLSPATH environment variable is not set on our system. What should it be set to ?

Maxime

Hi Maxime,

That's strange since somehow the compiler is picking up the wrong message catalog. Could be that you may have not sourced the icc environment file "compilervars.sh".  

Can you do following:

1) Go to the bin directory of where icc is installed and do:

% source compilervars.sh intel64 (if 64bit system)

Now, NLSPATH should be set to where the msg catalogs are and then try compiling....and let me know.

Regards.

Hi Kittur,

I did do this. It does not change anything though :

[mboisson@colosse3 SRC]$ . /software6/compilers/intel/composer_xe_2013_sp1/bin/compilervars.sh intel64

[mboisson@colosse3 SRC]$ env | grep NLS
NLSPATH=/software6/compilers/intel/composer_xe_2013_sp1.1.106/compiler/lib/intel64/locale/%l_%t/%N:/software6/compilers/intel/composer_xe_2013_sp1.1.106/ipp/lib/intel64/locale/%l_%t/%N:/software6/compilers/intel/composer_xe_2013_sp1.1.106/mkl/lib/intel64/locale/%l_%t/%N:/software6/compilers/intel/composer_xe_2013_sp1.1.106/debugger/gdb/intel64_mic/py26/share/locale/%l_%t/%N:/software6/compilers/intel/composer_xe_2013_sp1.1.106/debugger/gdb/intel64/py26/share/locale/%l_%t/%N:/software6/compilers/intel/composer_xe_2013_sp1.1.106/debugger/intel64/locale/%l_%t/%N:/software6/compilers/intel/composer_xe_2013_sp1/mkl/lib/intel64/locale/en_US/mkl_msg.cat
[mboisson@colosse3 SRC]$ /software6/compilers/intel/composer_xe_2013_sp1/bin/icc -O3 -xHost -no-prec-div -Mipa=fast,safe -xHost -fPIC -DDEBUGlevel=0 -DPRNTlevel=1 -DPROFlevel=0 -DAdd_ -fPIC -DUSE_VENDOR_BLAS -c pdgstrf.c -I/software6/mpi/openmpi/1.6.5_intel/include -pthread
icc: command line warning #10006: ignoring unknown option '-Mipa=fast,safe'
error #13002: unexpected CFE message argument:  e. The staggered cosine transform may be
warning #13003: message verification failed for: 556; reverting to internal message
pdgstrf.c(2672): warning #556: a value of type "int" cannot be assigned to an entity of type "MPI_Request"
            U_diag_blk_send_req[krow] = 1; /* flag outstanding Isend */
                                      ^

pdgstrf.c(2672): warning #152: Fatal error: Trigonometric Transform has failed to release the memory.
            U_diag_blk_send_req[krow] = 1; /* flag outstanding Isend */
                                        ^

compilation aborted for pdgstrf.c (code 1)
 

Maxime, that's strange. Since we don't have a reproducer it's hard to see what's going on. The only thing I can think of for now is:

 -> Try using the -no-diag-message-catalog  option which should force disabling of using the catalog

See if the above disables the diagnostic? I'll discuss with our FE team further to see if they have any other suggestions

Regards.

Using this flag actually enables successfull compilation! So, the problem has to do with diagnostic message catalog ?

Hi Maxime,

This shows there's is some issue with your environment per our developers Maxme. It looks like the binary could be corrupt (or may be a bad copy of the catalog on your system) else it's an environment issue on your end. 

What you can do:

    1) Try uninstalling and re-installing the compiler.
    2) Ensure the environment is configured correctly for icc (source the compilervars.sh <arg>", where arg is "intel64" for 64-bit and ia32 for 32 bit. 
    3) Try to do the same on another system (install icc there and try)

So, if after doing the above you still have an issue then it has to do with your environment as with the .i file we can't reproduce on our systems. Let me know how it goes after you try the above, thanks.

Regards,
Kittur

Hi Maxime,

Did you try the steps1-3 I outlined in my previous communication? Any update?

Thanks, Kittur 

Hi Kittur,

I am sorry I did not. Since disabling the diagnoses fixed the issue, I moved to a more urgent call. I am going on vacation on friday for two weeks. I will try to get back to this once I'm back.

Thanks again,

Maxime

Have a nice vacation Maxime! No hurry, was just wondering if the issue got resolved after reinstall as the catalog files probably is corrupt (compiler can generate a wrong msg if that's the case) and a reinstall should fix it.....

_Cheers, 
Kittur

Hi Kittur,

I am back from vacation. I reinstalled ICS today, and I get the same error. What I did is

1- I moved the previous installation folder to ".bak"

2- I removed the ~/intel folder from my home

3- I reinstalled l_ics_2013.1.039_intel64. I did this without root privileges. I only changed the installation prefix and added PGI support for MKL. I used default options everything else, and provided with our license server information. Note that there were unavailable optionnal prerequesites, compilers said "unsupported OS" (CentOS, which is not redhat per say, but should still work), and it did not find gtk, or java. I don't think either should be a problem ?

4- I installed the following updates : l_ccompxe_2013_sp1.1.106, l_fcompxe_2013_sp1.1.106, l_mkl_11.1.1.106. I disabled ia32 support for the two compilers, and added PGI support for mkl. Once again, I got a warning that java was not found, which I think is fine.

5- I then tried compiling as before, and got the exact same problem with the diagnostic messages.

Maxime

Welcome back Maxime, hope you had a good vacation. Well, since the error occurs even after you re-installed the package it probably means that the binary is fine and is possible due to the environment. BTW, I assume you checked to make sure that the NLSPATH variable is set to the message catalog (locale) path.

Unfortunately, since we're not able to reproduce this on our systems even with the .i file you attached it's difficult to know what's wrong with the environment.  In the meantime, may be you can try on a completely different system installing the same package and try to see if that works? I'll have to ping our developer to check if he has any further clue on the same and update you accordingly. Appreciate your patience....

Regards, Kittur

Hi Maxime,

BTW, as far as not using the catalogs there's no downside unless you need Japanese messages. If that's the case you can use the workaround of using the -no-diag-message-catalog option to force the compiler to not use the msg catalog and close this issue. But, if you think we can dig further then if you can set up a remote working session (we can try and find a convenient time) we can try to dig and see what's going on. Let me know what you think?

Thanks,
Kittur

Hi Kittur,

So far as the issues does not resurface every now and then on different codes, I think we can close this issue. If the issue resurface for some other code, then we may have to dig further, but so far, the problem only occured for that single code out of about a hundred applications I compiled.

Thanks,

Maxime

Understood Maxime, looks like this is an isolated incident on that particular system/code. Yes, sure I'll close this issue but if it resurfaces then we can dig further through a remote session and such. Again, appreciate your patience through this.

_Cheers, Kittur

Hi Kittur,

I know it has been a while - my appologies for doing thread necromancy - but the problem resurfaced this week when compiling SciPy (pip install scipy) and pysam (pip install pysam) with Python 2.7.5 and Intel Compilers.

I am considering options here, possibly adding -no-diag-message-catalog to the icc.cfg/icpc.cfg files on our cluster. Would this have unwanted impacts ?

My other consideration is, is it possible that having

LANG=fr_CA.UTF-8

messes with the message catalog ? Can you try this on your side ?

Best regards,

Maxime

I actually tried unsetting LANG, and I still get the error.

Maxime

Hi Maxime!

Yes, you cannot have LANG=fr_.... as there is no french message catalogs and will fail if compiler looks for.  Our developer says there should be no impact at all when using the option -no-diag-message-catalog  unless you are using the japanese message catalogs per-se which is not the case.

Also, one other thing you can check.  Try a brand new terminal (not invoked from another terminal)  and then source the "% compilervars.sh intel64" to set the icc environment and try. Reason, the NLSPATH variable could be very long and that can cause issues.  That said, just use the option -no-diag-message-catalog  as a workaround since we cannot reproduce the issue on our end.  

Hope that helps. Let me know if you need any other clarification, appreciate much.

_Kittur

Hi Maxime,

I'd like to know if the NLSPATH variable value was too long and if you tried reproducing on a new terninal as well? Appreciate your input, thanks.

_Kittur

Hi Kittur,

I tried again by unsetting the LANG environment variable and got the same error. Our NLSPATH once the Intel compiler module is loaded is :

NLSPATH=/software6/compilers/intel/composer_xe_2013_sp1.0.080/mkl/lib/intel64/locale/en_US/mkl_msg.cat

I notice that there is only a path from within MKL. Are we missing a NLSPATH for the compiler itself ? I do not think so, because even after running compilersvars.sh :

[mboisson@colosse2 ~]$ compilervars.sh intel64
[mboisson@colosse2 ~]$ env | grep NLSPATH
NLSPATH=/software6/compilers/intel/composer_xe_2013_sp1.0.080/mkl/lib/intel64/locale/en_US/mkl_msg.cat

 

Maxime

Leave a Comment

Please sign in to add a comment. Not a member? Join today