Issue introduced in MKL 11.0 Update 4 (64-bit Linux only)

Issue introduced in MKL 11.0 Update 4 (64-bit Linux only)

After installing MKL 11.0 Update 4 over MKL 11.0 Update 2 on Linux our QA process is SIGSEGV at...

#0  0x00002aaab745874a in mkl_serv_malloc ()
 #1  0x00002aaab7f6bbcc in mkl_blas_mc3_dgemm_get_bufs ()
 #2  0x00002aaab6ae8a99 in mkl_blas_mc3_xdgemm_par ()
#3  0x00002aaab4c2cf74 in mkl_blas_xdgemm_par ()
 #4  0x00002aaab4b81ecb in mkl_blas_dgemm_2d_bsrc ()
 #5  0x00002aaab4b7b489 in gemm_host ()
 #6  0x00002aaabb92b4f3 in L_kmp_invoke_pass_parms ()
   from /opt/intel/composer_xe_2013.4.183/compiler/lib/intel64/libiomp5.so

100% reproducible in certain cases.

Reverting to MKL Update 2 solves the issue.

Seems to happen after many iterations , and many threads computation created/destroyed.

Note we are running multiple (boost) threads that call MKL. We call MKL_Thread_Free_Buffers at the completion of each thread.

15 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Andrew, How can we reproduce the issue?

The only way to reproduce is for Intel to have a copy of our software and an evaluation license from us. I will pursue this through premier support.

ok. we will take this issue as soon as you will submit it there

OK, I created a ticket, but I said to reproduce Intel will have to download 400MB installer and license file but no response to that question.

No doubt, this will be a painful process for everyone to reproduce,but I cannot use MKL 11.0 Update 4 until this is resolved.

Premier support issue # 697704

I hope you put some of the missing details in your issue submission.

I don't see any clues as to which checklists you have followed; there are several good ones, including

http://software.intel.com/en-us/articles/determining-root-cause-of-sigse...

I can't even guess whether you explored simple remedies such as increasing stack (both global and thread stack) or using heap options.

>>...Seems to happen after many iterations...

Do you have that SIGSEGV error after all threads released memory and completed ( destroyed )? Or in the middle, or at the end, of processing?

This is what MSDN says about that very obsolete signal-error processing constant:
...
SIGSEGV
Illegal storage access. The default action terminates the calling program.
...

Not sure what you mean by "obsolete"? On Linux, signals such as SIGSEGV are a fundamental part of the OS. A segementation violation can be caused by accessing an address that is illegal. Such as dereferencing a NULL pointer.

Quote:

TimP (Intel) wrote:

I hope you put some of the missing details in your issue submission.

I don't see any clues as to which checklists you have followed; there are several good ones, including

http://software.intel.com/en-us/articles/determining-root-cause-of-sigse...

I can't even guess whether you explored simple remedies such as increasing stack (both global and thread stack) or using heap options.

The details are that MKL 11 Update 2 passes 300-400 QA tests without failure, MKL Update 4 fails 6+ of those tests with a segmentation violation inside MKL, reproducibly.  I have supplied premier support with a reproducible example. I will update this thread with the results.

Currently I am having to give the Premier support person a tutorial in GDB.

But heres a clue for anyone at Intel who cares about this issue.

Does this look like a race condition in MKL?

Thread 1 is crashing with a segmentation violation in....

#11 0x00002aaab75d40da in mkl_serv_malloc ()
   from /opt/intel/composer_xe_2013.4.183/mkl/lib/intel64/libmkl_core.so
#12 0x00002b93a4980aec in mkl_blas_mc3_dgemm_get_bufs ()

Thread 2 is calling

#0  0x00002aaab75dfe00 in mkl_blas_dgemm_set_blks_size ()

#1  0x00002aaab66135d9 in gemm_host ()

Hi Andrew, we definitely care and the local MKL team is now looking into the issue. We will report back once we have more information. -Shane

 I just installed MKL 11 Update 5 and the problem has gone away....looks like someone found and fixed the isssue....

To close the loop on this issue. Intel premier support confirmed there was an issue in Update 4 and it was fixed in Update 5. Thanks guys!

we are always welcome to help you :)

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi