Segfault in multithreaded dcsrmv

Segfault in multithreaded dcsrmv

Hello,

I have a weird problem in the code attached. When OMP_NUM_THREADS=1, I don't have any segmentation fault. When OMP_NUM_THREADS>1, it segfaults, unless I uncomment the lines 56 to 60 (i.e. if I first compute Ax, then A^T x'). Do you see where might be the problem ? I'm using icpc version 13.0.0 (gcc version 4.7.0 compatibility) on a Debian comp. Could this be linked to my (currently unresolved) other problem http://software.intel.com/en-us/forums/topic/336409 ?

Thanks in advance for your help.

Compiler is called as follows: icpc dcsrmv_segfault.cpp -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5

LDD returns:

        linux-vdso.so.1 =>  (0x00007fffe37ff000)
        libmkl_intel_lp64.so => XXX/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_intel_lp64.so (0x00007fc6f1c4a000)
        libmkl_intel_thread.so => XXX/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_intel_thread.so (0x00007fc6f0cd5000)
        libmkl_core.so => XXX/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_core.so (0x00007fc6efada000)
        libiomp5.so => XXX/composer_xe_2013.0.079/compiler/lib/intel64/libiomp5.so (0x00007fc6ef7e2000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fc6ef54b000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fc6ef244000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fc6ef02e000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc6eeca6000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fc6eeaa2000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fc6ee886000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fc6f2398000)

Edit: new .tar.gz

Fichier attachéTaille
Télécharger dcsrmv-initialized.tar.gz62.33 Ko
12 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

I checked how it works on Win7, 64 bit and even in all cases when the sequential versions of MKL has been used,
I saw a lot of NAN into outputs. pls check if the CSR representation is correct.

Hello,
If you take my program "as is" and simply add printf at the end of the main(), of course you will get NaN, since 'x' and 'xT' are uninitialized arrays (this is just a test program I wrote to upload here, I don't really care about the results stored in 'x' and 'xT' in this piece of code).
I uploaded a new version that initializes both working vectors (and I don't see NaN anymore). I don't think my CSR is wrong to be honest, as I'm using this kind of matrix in a much larger production code, and when I'm using OMP_NUM_THREADS=1, the results are correct.

By the way, I just got a little bit more than "Segmentation fault", here is the stderr:
a.out: malloc.c:3096: sYSMALLOc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)' failed.

quote:"I uploaded a new version that initializes both working vectors (and I don't see NaN anymore)...."
Where is the new version of your code?

It is updated in my first post (dcsrmv-initialized.tar.gz), sorry if I wasn't clear enough.

Hello.
Is someone able to reproduce my problem ?

the updated example works w/o problem on my system with mkl 11.0 update 1.
the attached log is the results I have got.

Fichiers joints: 

Fichier attachéTaille
Télécharger res.log173.42 Ko

Hello,
Are you sure this works even with OMP_NUM_THREADS > 1 ? As I can see, I'm not the only one having this issue, c.f. http://software.intel.com/en-us/forums/topic/344909

Here is my BT :
OMP_NUM_THREADS=1
[New Thread 10154 (LWP 10154)]
n = 5100 nz = 29748 I.n = 5101 I(I.n) = 29748
5100 x 9915
Program exited normally.

OMP_NUM_THREADS=2
[New Thread 1095 (LWP 1095)]
n = 5100 nz = 29748 I.n = 5101 I(I.n) = 29748
5100 x 9915
[New Thread 1272 (LWP 1272)]
[New Thread 1273 (LWP 1273)]
Program received signal SIGSEGV
mkl_spblas_lp64_mc3_dcsr0tg__c__mvout_par () in XXX/mkl/lib/intel64/libmkl_mc3.so
(idb) bt
#0 0x00002ba77859d90f in mkl_spblas_lp64_mc3_dcsr0tg__c__mvout_par () in XXX/mkl/lib/intel64/libmkl_mc3.so
#1 0x00002ba772978eac in mkl_spblas_lp64_dcsr0tg__c__mvout_omp () in XXX/mkl/lib/intel64/libmkl_intel_thread.so
#2 0x00002ba7729791b1 in mkl_spblas_lp64_dcsr0tg__c__mvout_omp () in XXX/mkl/lib/intel64/libmkl_intel_thread.so
#3 0x00002ba772793cdc in mkl_spblas_lp64_mkl_dcsrmv () in XXX/mkl/lib/intel64/libmkl_intel_thread.so
#4 0x00002ba7736e3d19 in mkl_dcsrmv () in XXX/mkl/lib/intel64/libmkl_rt.so
#5 0x0000000000405f73 in main () at XXX/dcsrmv_segfault.cpp:67
#6 0x000000328141ecdd in __libc_start_main () in /lib64/libc-2.12.so

In both cases, mkl_spblas_lp64_dcsr0tg__c__mvout_omp seems to go wrong (the other thread concerns dcsCmv, but it seems like it is calling some dcsRmv subroutines)

Thanks for your help.

By the way, on another piece of code, with Sandy Bridge-E, I also get segfaults in libmkl_avx.so(mkl_spblas_lp64_avx_dcsr0tg__c__mvout_par+0x281)
I think that there is something wrong in dcsr0tg, whether with SSE or AVX instructions (once again, the problem doesn't show up when OMP_NUM_THREADS=1 ...).

Please, I really don't know what to do here. Could you at least tell me if it is working on your side ? (on Linux, with more than one thread)

yes, i have accidently checked the case with the single threads and didn't see the problelm. yes, we see the problem now when #thr>1. we will investigate the case and will let you know the update.

Alright that is great news ! I guess fixing this problem will also fix the problem with dcscmv encoutered here http://software.intel.com/en-us/forums/topic/344909.
Please let me know as soon as you can for a possible patch/fix, thanks a lot for your help.

Hello, 

This issue has been fixed in MKL v.11.0 update 2 released yesterday.

You can download this update from intel registration center and check the problem on your side.

--Gennady

Laisser un commentaire

Veuillez ouvrir une session pour ajouter un commentaire. Pas encore membre ? Rejoignez-nous dès aujourd’hui