Potential Issue with MKL 11 update 4 with SVD functions on 64-bit windows

Potential Issue with MKL 11 update 4 with SVD functions on 64-bit windows

Recently, upon updating our MKL from version 10 to the latest MKL 11 update 4, we noted performance slowdowns in our application compiled for a 64-bit application using Visual Studio 2012. Upon profiling the application, the location of the slowdowns, seems to stem from the ZBDSQR call in the function below. This slow down does NOT occur when using a 32-bit Release build using Visual Studio 2012. Is this slowdown an existing defect in the MKL 11.0.4 release? Are there plans to address this issue? Thanks in advance for your response. ------------------------------------------------------------------------------------------------------------------------------------------------------- Sample Code Snippet where the problem was detected ------------------------------------------------------------------------------------------------------------------------------------------------------- Note in the test case being profiled MA = NA = 60, complex == MKL_Complex16 void LaSVD(complex *A, complex *U, double *S, complex *VT, int MA, int NA) { char UPLO=(MA>=NA ? 'U':'L'), VECTU='Q', VECTV='P'; int NCC=0, LDA=MA, NRU=MA, LDU=MA, NCVT=NA, LDVT=NA, LDC=1, LWORK=16*(MA+NA), INFO; int NB=(MA>=NA ? NA:MA), SIZEU=MA*NA, SIZEVT=MA*NA, INCX=1, INCY=1, KU=NA, KV=MA; double *RWORK, *D, *E; complex *WORK, *TAUQ, *TAUP, *C=0; int charlen; RWORK=new double[4*NA]; D=new double[MA+NA]; E=new double[MA+NA]; WORK=new complex[LWORK]; TAUQ=new complex[MA+NA]; TAUP=new complex[MA+NA]; //zgebrd_(&MA, &NA, A, &LDA, D, E, TAUQ, TAUP, WORK, &LWORK, &INFO); //Reduces a general matrix to bidiagonal form. GetLAPack64()->ZGEBRD(&MA, &NA, A, &LDA, D, E, TAUQ, TAUP, WORK, &LWORK, &INFO); //if(INFO!=0) error_handler(NUMERICAL_ERROR, "ZGEBRD"); //f2c_zcopy(&SIZEU, A, &INCX, U, &INCY); GetLAPackBlas()->ZCOPY(&SIZEU, A, &INCX, U, &INCY); charlen=1; //zungbr_(&VECTU, &MA, &NA, &KU, U, &LDA, TAUQ, WORK, &LWORK, &INFO); //Generates the complex unitary matrix Q or PH determined by ?gebrd. GetLAPack64()->ZUNGBR(&VECTU, &MA, &NA, &KU, U, &LDA, TAUQ, WORK, &LWORK, &INFO); //if(INFO!=0) error_handler(NUMERICAL_ERROR, "ZUNGBR"); //f2c_zcopy(&SIZEVT, A, &INCX, VT, &INCY); GetLAPackBlas()->ZCOPY(&SIZEVT, A, &INCX, VT, &INCY); //zungbr_(&VECTV, &NA, &NA, &KV, VT, &LDA, TAUP, WORK, &LWORK, &INFO); //Generates the complex unitary matrix Q or PH determined by ?gebrd. GetLAPack64()->ZUNGBR(&VECTV, &NA, &NA, &KV, VT, &LDA, TAUP, WORK, &LWORK, &INFO); //if(INFO!=0) error_handler(NUMERICAL_ERROR, "ZUNGBR"); //zbdsqr_(&UPLO, &NB, &NCVT, &NRU, &NCC, D, E, VT, &LDVT, U, &LDU, C, &LDC, RWORK, &INFO); //Computes the singular value decomposition of a general matrix that has been reduced to bidiagonal form. GetLAPack64()->ZBDSQR(&UPLO, &NB, &NCVT, &NRU, &NCC, D, E, VT, &LDVT, U, &LDU, C, &LDC, RWORK, &INFO); //if(INFO!=0) error_handler(NUMERICAL_ERROR, "ZBDSQR"); charlen=NB; //f2c_dcopy(&charlen, D, &INCX, S, &INCY); GetLAPackBlas()->DCOPY(&charlen, D, &INCX, S, &INCY); delete[] TAUP; delete[] TAUQ; delete[] WORK; delete[] E; delete[] D; delete[] RWORK; } A similar slow down was detected when using the ZGESVD function. Same usage as described above.

23 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Hi Murray,

I check the mkl release notes C:\Program Files (x86)\Intel\Composer XE 2013\Documentation\en_US\mkl\Release_Notes.htm

and mkl bug list  http://software.intel.com/en-us/articles/intel-mkl-110-bug-fixes/.  seems no direct case.

Could you please provide some detials information, like

1. exact performance number of slowdown 10. vs. 11.0 update 4

2. how do you link mkl,dynamic or static, threaded or sequential.

3. what is the processor?

and a workable test case will be helpful.

Best Regards,

Ying

Murray, Did you see how the source codes of your test case are re-formatted? It is absolutely Not readable.

Hi Murray, Please attach a cpp-file with the source codes. Thanks in advance.

Note: Please don't forget to press Start upload button before submitting your new post.

Quote:

Ying H (Intel) wrote:
1. exact performance number of slowdown 10. vs. 11.0 update 4

      Example for ZGESVD (63% slower) ( Will update on other functions at a later date)

      MKL 11.0.4    ZGESVD   Total Time - 15.47 sec,  # Calls 8950, Ave Call Time 1.73msec, Min Call Time 0.89msec, Max Call Time 31.73mS

      MKL 10 ZGESVD   Total Time - 9.47 sec,  # Calls 8950, Ave Call Time 1.06msec, Min Call Time 0.77msec, Max Call Time 12.06mS

Quote:

Ying H (Intel) wrote:
2. how do you link mkl,dynamic or static, threaded or sequential.

     a) We use dynamically Link to MKL

     b) Our final application is threaded at a higher level than MKL function calls in this instance. Thus these MKL function calls should be using only one thread each. The timing data above was collected using all single threading both in the call application an MKL #1 threads is set to 1.

Quote:

Ying H (Intel) wrote:
3. what is the processor?

      Intel Core i7-3770K CPU @ 3.5GHz,  16GB RAM, Windows 7 Professional, Service Pack 1 64-bit

Quote:

Ying H (Intel) wrote:
and a workable test case will be helpful.

I would like to create a dynamically linked console test project for you that reads the input data from a binary file. Unfortunately our application has MKL library locations in strange directly locations and is coupled into many other libraries. If you have an example dynamically linked console project set up in Visual Studio 2012 to test an MKL function, this would expedite the process significantly. I will collect the data for the binary file directly from out composite application. Please advise on this regard.

Thanks,

Murray

P.S. I will be on vacation for the next ~7 days. I will attempt to have another at my company respond in my absense.

>>...If you have an example dynamically linked console project set up in Visual Studio 2012 to test an MKL function, this would
>>expedite the process significantly...

Consider a simple console-based and universal test project ( don't assume that everybody has VS 2012 ) because it could be compiled with any version of Intel or Microsoft C++ compilers. For example, my primary VSs are VS 2005 and 2008 and it takes just a couple tens of seconds to compile some small test project from the command line without overheads of VS projects.

>>...Example for ZGESVD ( 63% slower )...

I could verify it with MKL versions 10 and 11 on at least three platforms if you upload a test case. Also, 63% performance decrease doesn't look good and I have a concern that incorrect CPU dispatching DLLs of MKL were used.

For example, on a Platform A with some CPU ( non AVX-like ) a SSE4 CPU dispatching DLL was used and on a Platform B with some CPU ( AVX-like ) a SSE CPU dispatching DLL was used instead of AVX CPU dispatching DLL.

Hi Murray,  Sergey,

I create a zgesvd sample (A is 60x60)  in MSVS 2012.  I upload it here. Could you please test and let us know the result.

Best Regards,

Ying

 

Allegati: 

AllegatoDimensione
Download zgesvd.zip13.52 KB

Hi Ying,

>>...I create a zgesvd sample (A is 60x60) in MSVS 2012. I upload it here. Could you please test and let us know the result.

Yes and I'll let you know test results for two MKL versions.

Hi Murray,

A new update release recently. If possible, could you please try it?

There is a well-known problem regarding SVD. which is fixed in the verson http://software.intel.com/en-us/forums/topic/401167.

Best Regards

Ying

Hi Ying, 

Thanks for the update and your support. We did notice the latest release with the mention of the SVD in the notes and we are in the process of testing this to see if this resolves the problem. 

As for a update on our end, we created some temporary code in our custom application to capture a sequence of data that exibits the problem within our custom application (Final.exe). We then created a stand alone test class (K_SVD_Test) that loads this file, performs the SVD operations and logs the time. Next we created a stand alone console program (SVD_Console.exe) that links to the MKL functions using an abstract interface implement using a separate dll ( VLAL_MKL.dll). This way of linking to MKL allows the majority of our developers to avoid building any of MKL/VLAL dependencies, rather we just connect to it via this custom dll. Upon running this console program (SVD_Console.exe) we noted that we could NOT reproduce the slow down we see in our custom application. Thus there was something else going on in our custom application (Final.exe) that was effecting the performance. After about a week of work, we were able to isolate another usage of MKL within our custom application (Final.exe) that appeared to trigger the undesired behavior. This second series of calls to MKL was also using the SVD functions (although a bit differently).  A second test class K_SVD_Test2 was created along with the input data saved to file in a file. This test class was then used in the console program (SVD_Console.exe) and we were finally able to repeat the problem.

Essentially the console program (SVD_Console.exe) runs SVD_Test, then SVD_Test2, and finally SVD_Test again. The timing of SVD_Test before and after running SVD_Test2 show the previously described slow down when using MKL 11.0.04.1. MKL 10.3.10.1 does not have this problem. 

As I was unable to obtain permission to post our custom linking to MKL ( VLAL_MKL.dll) on a public forum, I am unable to present (SVD_Console.exe and its associated source code) to you for verification on your end. 

Presently, we are attempting to create a second console applicatiion (SVD_Console_Standalone.exe) which links directly with the MKL libraries and thus eliminates the custom method of connecting to MKL (VLAL_MKL.dll) mentioned above. We hope to have this new console program working in the next couple of days and then we can send you example (including all source code and input data files) that you could run on your machines. This last step would also eliminate any possible errors that may be caused by our custom access to the MKL functions  via (VLAL_MKL.dll). 

In parallel to this effort, we are also attempting to rebuild our custom application using the latest update of MKL (v11.0.5.). Unfortunately this effort is not under my direct control, thus I cannot guarentee an expedient response in this evaluation. 

Best regards,

Murray

Hi Murray,

Thank you a lot for the details.  You can upload the sample and dll by premier.intel.com,  which is another official support channel. and the code and communication is IP protected 

Best Regards,

Ying

Hi, Ying

We created this console test engine by removing all the dependency of our envoriment. It contains two sets of test matrixes (binary) under \mtx directory.

After compiling the MKLconsole.sln, you will need to copy the mkl dlls to the \x64\release or \x64\debug directory. Mklconsole.exe will run a LaSVD, GESVD, GESDD on newLaSVD_Inputbak dataset, then run GESVD on RSVD_Inputbak dataset, and then re-run LaSVD, GESVD, GESDD on newLaSVD_Inputbak dataset. You will see the ~50% performance slowdown with 11.04 mkl.

Please let me know if you have any questions/comments.

Rgds

Yuan Liu

ps, the dll needed are 

libimalloc.dll

libiomp5md.dll
mkl_avx.dll
mkl_avx2.dll
mkl_cdft_core.dll
mkl_core.dll
mkl_intel_thread.dll
mkl_mc.dll
mkl_mc3.dll
mkl_p4n.dll
mkl_vml_avx.dll
mkl_vml_avx2.dll
mkl_vml_cmpt.dll
mkl_vml_mc.dll
mkl_vml_mc2.dll
mkl_vml_mc3.dll
mkl_vml_p4n.dll

Allegati: 

AllegatoDimensione
Download tinymkl.zip32.69 MB

Hi Yuan,

It is nice test case. I can run it now. But not sure what is your exact mkl 10 version. Could you please add MKL version information and let me know the result?

Another question, you have test two times before_SVR and after_ SVR, what is the purpose? as you have high-level thread,  the threaded mkl is not needed.  how about if call mkl_sequential_dll.lib directly.

Best Regards,

Ying

// This will be called by the main function when testing SVD.
void Run_SVD_Tests()
{
 MKLVersion Version;
mkl_get_version(&Version);
printf("Major version: %d\n",Version.MajorVersion);
printf("Minor version: %d\n",Version.MinorVersion);
printf("Update version: %d\n",Version.UpdateVersion);
printf("Product status: %s\n",Version.ProductStatus);
printf("Build: %s\n",Version.Build);
printf("Platform: %s\n",Version.Platform);
printf("Processor optimization: %s\n",Version.Processor);
printf("================================================================\n");
printf("\n");

>>...we are in the process of testing this to see if this resolves the problem...

Here are test results for your review:

[ Test 1 - Intel Pentium 4 ( 1.60 GHz ) ]

Command line to build the test case: icl.exe /O3 /Qmkl /MD lapacke_zgesvd_col.cpp

...
LAPACKE_zgesvd (column-major, high-level) Example Program Results
Major version: 10
Minor version: 3
Product status: Product
Build: 20120831
Processor optimization: Intel(R) Pentium(R) 4 processor
================================================================

SVD takes : 21.252930 seconds
...

[ Test 2 - Intel Core i7-3840QM ( 2.80 GHz ) ]

Command line to build the test case: The same executable compiled for Test 1 was tested

...
LAPACKE_zgesvd (column-major, high-level) Example Program Results
Major version: 11
Minor version: 0
Product status: Product
Build: 20130123
Processor optimization: Intel(R) Advanced Vector Extensions (Intel(R) AVX) Enabled
Processor
================================================================

SVD takes : 3.842102 seconds
...

[ Test 3 - Intel Core i7-3840QM ( 2.80 GHz ) ]

Command line to build the test case: icl.exe /O3 /Qmkl /MD lapacke_zgesvd_col.cpp

...
LAPACKE_zgesvd (column-major, high-level) Example Program Results
Major version: 11
Minor version: 0
Product status: Product
Build: 20130123
Processor optimization: Intel(R) Advanced Vector Extensions (Intel(R) AVX) Enabled
Processor
================================================================

SVD takes : 3.519592 seconds
...

Note: I don't have MKL version 10 on my Ivy Bridge system.

>>...Note: I don't have MKL version 10 on my Ivy Bridge system.

I just realized that I will be able to test MKL version 10 on the Ivy Bridge system and I'll post results as soon as the test is completed.

Hi Ying,

The previous version of MKL that we were using was v10.3.10.1. We then upgraded to v11.0.4.1 and noticed the degradation. 

We were able to test the latest incremental release  v11.0.5.1 and the problem appears to be resolved in this case. One should note that while the problem noted in http://software.intel.com/en-us/forums/topic/401167 was with the same SVD functions, the problem description was much different. In this case the slow down was noticed when using only one thread with MKL, while the case above involves multiple threads. 

Just our luck that once we were able to produce a repeatable test case for you, that the problem would have been resolved. 

I would like to spend a bit more time working with the console program to ensure that you are seeing the same problems that we are seeing on our end. Additional suggestions for the console test app would be welcome. 

Thanks for you support!

Murray 

Hi, Ying

We intentionally set the thread number to 1 just to remove one uncertainty. In reality, we svd thousands or hundreds of thousands small matrixes and I guess we do need multi-threaded version of mkl. Please correct me if I am wrong. 

Anyway, as you can see from the code, we svd the same test data twice (before and after), the test data are 8950 60-by-60 square complex matrixes. Between the two runs, we svd 15 rectangular matrixes of different dimensions (R_SVD). 

Here is the result I got, you can see the performance of lasvd and gesvd slows down ~50% in mkl 11.0.4, whereas that of gesdd function is fairly consistent.

>>>>>>>>>>>>>>>>>>>>>>+++++++++++++++++++++++++++++++++++++++++++++++

Major version: 11

Minor version: 0

Update version: 4

Product status: Product

Build: 20130517

Platform: Intel(R) 64 architecture

Processor optimization: Intel(R) Core(TM) i7 Processor

================================================================

before R_SVD

lasvd time (ms) =17540   max (ms) =3     min (ms) =1     count =8950

gesvd time (ms) =18176   max (ms) =5     min (ms) =1     count =8950

gesdd time (ms) =14749   max (ms) =3     min (ms) =1     count =8950

R_SVD messing up around  ~!~!~!

gesvd time (ms) =0       max (ms) =0     min (ms) =0     count =15

After R_SVD

lasvd time (ms) =23096   max (ms) =3     min (ms) =2     count =8950

gesvd time (ms) =26358   max (ms) =7     min (ms) =2     count =8950

gesdd time (ms) =15165   max (ms) =4     min (ms) =1     count =8950

+++++++++++++++++++++++++++++++++++<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

FYI, the result for 11.0.5 is much more consistent.

>>>>>>>>>>>>>>>>>>>>>>+++++++++++++++++++++++++++++++++++++++++++++++

Major version: 11

Minor version: 0

Update version: 5

Product status: Product

Build: 20130612

Platform: Intel(R) 64 architecture

Processor optimization: Intel(R) Core(TM) i7 Processor

================================================================

 

before R_SVD

lasvd time (ms) =17682   max (ms) =4     min (ms) =1     count =8950

gesvd time (ms) =17866   max (ms) =3     min (ms) =1     count =8950

gesdd time (ms) =15304   max (ms) =639   min (ms) =1     count =8950

R_SVD messing up around  ~!~!~!

gesvd time (ms) =1       max (ms) =1     min (ms) =0     count =15

After R_SVD

lasvd time (ms) =17434   max (ms) =4     min (ms) =1     count =8950

gesvd time (ms) =18072   max (ms) =4     min (ms) =1     count =8950

gesdd time (ms) =14829   max (ms) =4     min (ms) =1     count =8950

+++++++++++++++++++++++++++++++++++<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Ying, 

Quote from Ying H (Intel) Tue, 07/16/2013 - 20:24

 "Another question, you have test two times before_SVR and after_ SVR, what is the purpose?" 

This was our major problem when reproducing the problem in a console application. In our first attempt we only performed the tests before_SVR. When testing in this manner we did not see the associated slow down between the MKL versions. Yuan had spent quite a bit of time isolating yet another usage of the SVD functions within our final application which caused a subsequent tests after_SVR to become significantly slower when using MKL v11.0.4.  This slowdown between the times before and after SVR did not show up when using MKL v10.3.10. It seems quite strange to us that performing the same series of tests within one application run results in such drastic changes in the time required to peform these tests. 

We speculate that somehow when running the SVR tests, that the new version of MKL v11.0.4 gets in a bad state, that subequently slows down the second set of tests when compared to the 1st run of these tests. 

Hope this helps, 

Murray

Hi Yang, Murray,

Thanks you much for the clarification.  The two issue seems related. The one in DPD20033524 is caused by error code in split workload on OpenMP threads .  And the one here, same as you speculate,  the first run change the second run's OpenMP thread status. I will check with our engineer and get back to you if any news.

@yuan, not sure exact your usage model, if thousand of matrix are small matrix and set mkl thread num=1,  then sequential mkl should more suitable because at lease it save the time of manage openmp threads.  Or you may need the threaded version to get performance by set mkl thread num = 2 or for other functions.  

Best Regards,

Ying

Hi Yuan, Murry,

Our engineer comments regarding the performance drop at second run.

This seems to correlate with issue reported on forum (http://software.intel.com/en-us/forums/topic/373673) which was fixed in Update 5. Even diagnostic is different, there were problem with convergence in one of internal sub algotihms for SVD exposed in MKL 11.0.x line and fixed in MKL 11.0.5. That problem could lead to extra internal iterations (performance drop) or to error report (convergence not reached) depending on input matrix.

Best Regards,

Ying

hi, Ying

Thanks for the update. We are glad that it is solved in the new update

Now about another issue, the sequential mkl vs one thread mkl. I tried to search online but did not find much benchmark against the two. Rather I found this from your website

http://software.intel.com/en-us/articles/recommended-settings-for-callin...

where it said "This case (MKL_NUM_THREADS = 1) is equivalent to linking with sequential MKL, that is, disable threading in MKL or linking with the threaded version on MKL but call mkl_set_num_thread( 1 )" So just for performance-wise, are they the similiar (equivalent). 

On the other hand, people did mention some difference in time, but not how big the difference is and if justifiable to maintain another library.

http://stackoverflow.com/questions/17563552/how-to-use-simultaneous-of-p...

Do you have a clue of how big the performance improvement will be or if there are some benchmark results available? 

Thanks.

Hi Yuan,

Theoretically speaking, yes, the MKL_NUM_THREADS=1 is same effect as sequential MKL.  But as the discussion in stackoverfloww, once you call MKL parallel, whatever MKL_NUM_THREADS, the thread runtime OpenMP  library will be needed, and threads manager need cpu resouce ( you can observe the threads in task manager, there are one more tread  is created when a new OpenMP thread start). So it brings some affect.  But generally, the affect is very small,  for example, it may have 3.84 vs. 3.8s with 1000 loop. So we usually ignore the difference.

On the other hand, regarding the mix multi-thread envionment (like high level window thread or pthread, and OpenMP thread in sub-thread), there are many kind of issues (you can search in the forum). So if you have high-level threads and be sure mkl  will used in 1 thread, then the sequential library should be more suitable. 

Best Regards,
Ying

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi