Unexplained result with mkl_cspblas_dcsrsymv

Unexplained result with mkl_cspblas_dcsrsymv

I think that this is a linking error, maybe someone can help.

I'm using VS2008 compiling a 64bit app and callingmkl_cspblas_dcsrsymv in a c++ routine and am getting incorrect results. The relevant libraries I link aremkl_em64t.lib,mkl_intel_lp64.lib,mkl_intel_thread.lib,mkl_solver_lp64.lib,libiomp5md.lib. (I use the Paradiso solver elsewhere and it works perfectly).

If I replacemkl_intel_thread.lib withmkl_sequential.lib all works fine, except of course I don't get any speedup. Whether or not the compiler option for OpenMP support is enabled seems not to matter. I'm compiling with the /MD (multi-threaded dll runtime library) option set.

23 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

It seems you miss mkl_core.lib into your linking line.
Please try the following linking line for the architecture:
mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib
--Gennady

Quoting - Gennady Fedorov (Intel)

It seems you miss mkl_core.lib into your linking line.
Please try the following linking line for the architecture:
mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib
--Gennady

I've narrowed the problem down further -mkl_cspblas_dcsrsymv seems to fail with repeated calls. The attached MSVC project reproduces the error. The problem does indeed seem to be threading related - with the call mkl_set_num_threads(1) I get the correct results, withmkl_set_num_threads(2), the error appears.

In this example I link:
mkl_intel_lp64.lib
mkl_intel_thread.lib
mkl_core.lib
libiomp5md.lib

Jay,what MKL version you are using? Could you please look into mklsupport.txt file and let us know the PackageID you can find there.
--Gennady

Quoting - Gennady Fedorov (Intel)

Jay,what MKL version you are using? Could you please look into mklsupport.txt file and let us know the PackageID you can find there.
--Gennady

I have:
Package ID: w_mkl_p_10.0.1.015
Package Contents: Intel Math Kernel Library 10.0 Update 1

I have a somewhat of a solution to my problem. Oddly enough, calling the standard interfacemkl_dcsrmv function produces the correct results. It does not seem to use more than 1 cpu however - I never see above 50% cpu usage (on a dual core CPU).

Do you have smth like that?:
Test 0: matrix copy constructor
is equal: 1
Test 1: reference vs mkl
error norm: 1.40368e+069
A equals A0: 1

Quoting - Gennady Fedorov (Intel)

Do you have smth like that?:
Test 0: matrix copy constructor
is equal: 1
Test 1: reference vs mkl
error norm: 1.40368e+069
A equals A0: 1

This is my result:

~~~~~~~~~~~~~~~~~~~~~~
Test 0: matrix copy constructor
is equal: 1
Test 1: reference vs mkl
error norm: 1.23211e-008
A equals A0: 1
Test 2: mkl multiple calls
error norm 12: 9.86824e+006
error norm 13: 1.97365e+007
error norm 14: 2.96047e+007
A equals A0: 1
x - x0 norm: 0
~~~~~~~~~~~~~~~~~~~~~~

If on line 57 I change from 2 to 1 I get (the correct result):
~~~~~~~~~~~~~~~~~~~~~~
Test 0: matrix copy constructor
is equal: 1
Test 1: reference vs mkl
error norm: 1.23211e-008
A equals A0: 1
Test 2: mkl multiple calls
error norm 12: 0
error norm 13: 0
error norm 14: 0
A equals A0: 1
x - x0 norm: 0
~~~~~~~~~~~~~~~~~~~~~~

What compiler are you building this with? If you have VS2008 on a 64 bit Windows system you should just be able to build the project I setup without modification.

i checked the problem with the next versions ( 10.1 and 10.2 beta ). The problem dissapered with these versions.
see the output i've got with version 10.1:
Test 0: matrix copy constructor
is equal: 1
Test 1: reference vs mkl
error norm: 1.39258e-008
A equals A0: 1
Test 2: mkl multiple calls
error norm 12: 0
error norm 13: 0
error norm 14: 0
A equals A0: 1
x - x0 norm: 0
Press any key to continue . . .

Thanks for the info - guess its time to upgrade.

Quoting - Gennady Fedorov (Intel)

i checked the problem with the next versions ( 10.1 and 10.2 beta ). The problem dissapered with these versions.
see the output i've got with version 10.1:
Test 0: matrix copy constructor
is equal: 1
Test 1: reference vs mkl
error norm: 1.39258e-008
A equals A0: 1
Test 2: mkl multiple calls
error norm 12: 0
error norm 13: 0
error norm 14: 0
A equals A0: 1
x - x0 norm: 0
Press any key to continue . . .

Now I am running MKL version10.0.5.025 for Windows, and the SPMV problem persists. I should note that this error does not occur in any of the Linux MKL versions I've tried.

A few notes:
* the behavior seems not to change whether or not I have OpenMP support enabled (/openmp)
* with N=100000 and NZ = 41*N, it seems not to use more than 1 CPU, even when mkl_set_num_cpu(2) is chosen.
* with N=200000 and NZ = 41*N, I get the correct results and it seems that both CPUs are utilized.
* withN=100000 and NZ = 41*N, mkl_set_dynamic(0) seems to have no effect (still looks like it will only use 1 cpu)

Sparse matrix-vector multiply is the only function that I really need from MKL. Its the only thing that I haven't found in any other BLAS libraries, and its the bottleneck in my simulation code, so I really need to get this to work.

Here is the compiler (MSVC 2008 SP1) command line:
/Ox /Oi /Ot /Oy /GT /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /FD /EHsc /MD /Gy /fp:fast /Fo"x64Release" /Fd"x64Releasevc90.pdb" /W3 /nologo /c /Zi /TP /errorReport:prompt

Here is the link line:
/OUT:"C:developmenttestmklx64Releasetestmkl.exe" /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:"x64Releasetestmkl.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /DEBUG /PDB:"c:developmenttestmklx64Releasetestmkl.pdb" /SUBSYSTEM:CONSOLE /OPT:REF /OPT:ICF /LTCG /DYNAMICBASE /NXCOMPAT /MACHINE:X64 /ERRORREPORT:PROMPT mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib

I've uploaded my project (including the exe if you really trust me!). I also uploaded the output of Dependency Walker so you can see what its dynamically linked against.

This is the output that I get. The only change between the first and second call is that the first time is called with mkl_set_num_threads(1), and the second is called withmkl_set_num_threads(2).

****************************************
MKL sparse matrix-vector multiply test
****************************************
Running with 1 thread
Creating square matrix with 100000 rows
Fill factor = 0.00574441
Constucting A..........done!
Test 1: reference vs mkl
error norm: 9.07218e-008
A equals A0: 1
Test 2: mkl multiple calls
error norm 12: 0
error norm 13: 0
error norm 14: 0
A equals A0: 1
x - x0 norm: 0

****************************************
MKL sparse matrix-vector multiply test
****************************************
Running with 2 threads
Creating square matrix with 100000 rows
Fill factor = 0.00574441
Constucting A..........done!
Test 1: reference vs mkl
error norm: 9.07218e-008
A equals A0: 1
Test 2: mkl multiple calls
error norm 12: 9.87039e+006
error norm 13: 1.97408e+007
error norm 14: 1.00678e+009
A equals A0: 1
x - x0 norm: 0

even more troubling, when I do nothing but double the amount of rows I get the correct result:

****************************************
MKL sparse matrix-vector multiply test
****************************************
Running with 1 thread
Creating square matrix with 200000 rows
Fill factor = 0.012042
Constucting A..........done!
Test 1: reference vs mkl
error norm: 2.39593e-007
A equals A0: 1
Test 2: mkl multiple calls
error norm 12: 0
error norm 13: 0
error norm 14: 0
A equals A0: 1
x - x0 norm: 0

****************************************
MKL sparse matrix-vector multiply test
****************************************
Running with 2 threads
Creating square matrix with 200000 rows
Fill factor = 0.012042
Constucting A..........done!
Test 1: reference vs mkl
error norm: 2.39593e-007
A equals A0: 1
Test 2: mkl multiple calls
error norm 12: 0
error norm 13: 0
error norm 14: 0
A equals A0: 1
x - x0 norm: 0

you wrote >>

....
I've uploaded my project (including the exe if you really trust me!). I also uploaded the output of Dependency Walker so you can see what its dynamically linked against.
....
Do you mean attachment you sent at April.13?
--Gennady

Ahh - I see, I missed a step attaching the files.

The problem is quite maddening, I tried everything I could think of, including statically linking everything.

One strange quirk I did notice - if I use the following link linemkl_intel_lp64.lib mkl_intel_thread_dll.lib mkl_core.lib libiomp5mt.lib, then even though I statically linked the new OpenMP library, libguide40.dll gets dynamically linked. If I switch the above mkl_intel_thread_dll.lib to mkl_intel_thread.lib then no DLLs (aside from the standard Windows ones) get linked. In both cases however my test code yields the same (incorrect) result, where given the same input data mkl_cspblas_dcsrsymv gives invalid results for subsequent calls.

Quoting - Gennady Fedorov (Intel)

you wrote >>

....
I've uploaded my project (including the exe if you really trust me!). I also uploaded the output of Dependency Walker so you can see what its dynamically linked against.
....
Do you mean attachment you sent at April.13?
--Gennady

Attachments: 

AttachmentSize
Downloadimage/png dependancies.png143.67 KB
Downloadapplication/zip mkl_spmv.zip398.77 KB

Quoting - jay.oswald

Quote: ".... even when mkl_set_num_cpu(2) is chosen"
Do you mean mkl_set_num_threads(int)?
--Gennady

Quoting - Gennady Fedorov (Intel)

Quote: ".... even when mkl_set_num_cpu(2) is chosen"
Do you mean mkl_set_num_threads(int)?
--Gennady

Yeah, I meant using mkl_set_num_threads(2) with smaller matrix sizes leads to only one CPU being utilized and the wrong result being obtained from the multiplication.

It seems that the case is that when the routine is called with more than one cpu, themkl_cspblas_dcsrsymv function decides whether or not the sparse matrix is large enough to use threads. If threads are used, it seems to work correctly. If only one thread is used, then for any calls after the first one,mkl_cspblas_dcsrsymv returns an incorrect result.

This problem is repeatable on all of the Windows machines I've tried - 32 bit Windows Vista, and two 64 bit Server 2008 machines. The code was compiled with VS 2008 SP1. The only difference is that on the 32 bit machine it seems the threshold matrix size for running on more than one thread is smaller. On any of the Linux machines I have tried, the code runs correctly for all matrix sizes.

This seems to be a very common usage, e.g. anyone trying to run a conjugate gradient solver is going to multiply the same sparse matrix by a vector over many iterations. Is MKL use in Windows uncommon?

And once again, this is time to upgrade

I checked this new test with MKL 10.1 and MKL 10.2 beta
On win32 and WinXP64 with different threads running on ( 1 8 )
All work well. See below the output I ve got

Win32. MKL 10.1 upadate2 and MKL 10.2 beta

****************************************
MKL sparse matrix-vector multiply test
****************************************
Running with 1 thread
Creating square matrix with 100000 rows
Fill factor = 0.00574441
Constucting A..........done!
Test 1: reference vs mkl
error norm: 9.07362e-008
A equals A0: 1
Test 2: mkl multiple calls
error norm 12: 0
error norm 13: 0
error norm 14: 0
A equals A0: 1
x - x0 norm: 0
Press any key to continue . . .

Win64. MKL 10.1 upadate2 and MKL 10.2 beta

mkl_solver_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib

mkl_set_num_threads(1, 2, 4, or 8 )

****************************************
MKL sparse matrix-vector multiply test
****************************************
Running with 1 thread
Creating square matrix with 100000 rows
Fill factor = 0.00574441
Constucting A..........done!
Test 1: reference vs mkl
error norm: 9.07362e-008
A equals A0: 1
Test 2: mkl multiple calls
error norm 12: 0
error norm 13: 0
error norm 14: 0
A equals A0: 1
x - x0 norm: 0
Press any key to continue . . .

Quoting - jay.oswald

Ah thats it. When I downloaded the newer version the 10.0 branch was selected by default. So far 10.1 seems to work perfectly. Thanks for verifying, I can't tell you how many times I double-checked that code.

-Jay

Quoting - Gennady Fedorov (Intel)

And once again, this is time to upgrade

I checked this new test with MKL 10.1 and MKL 10.2 beta
On win32 and WinXP64 with different threads running on ( 1 8 )
All work well. See below the output I ve got

Win32. MKL 10.1 upadate2 and MKL 10.2 beta

****************************************
MKL sparse matrix-vector multiply test
****************************************
Running with 1 thread
Creating square matrix with 100000 rows
Fill factor = 0.00574441
Constucting A..........done!
Test 1: reference vs mkl
error norm: 9.07362e-008
A equals A0: 1
Test 2: mkl multiple calls
error norm 12: 0
error norm 13: 0
error norm 14: 0
A equals A0: 1
x - x0 norm: 0
Press any key to continue . . .

Win64. MKL 10.1 upadate2 and MKL 10.2 beta

mkl_solver_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib

mkl_set_num_threads(1, 2, 4, or 8 )

****************************************
MKL sparse matrix-vector multiply test
****************************************
Running with 1 thread
Creating square matrix with 100000 rows
Fill factor = 0.00574441
Constucting A..........done!
Test 1: reference vs mkl
error norm: 9.07362e-008
A equals A0: 1
Test 2: mkl multiple calls
error norm 12: 0
error norm 13: 0
error norm 14: 0
A equals A0: 1
x - x0 norm: 0
Press any key to continue . . .

Quoting - jay.oswald

Hi Jay,
It's not clear for me what version you are talking about?
Into our registration center I've found that you got MKL 10.1 and I am little confused with it.
We had the similar problem ( MKL 10.0 ) with mkl_cspblas_dcsrsymv routine but it was fixed in 10.1.
Have you had opportunity to check this problem with 10.1?
--Gennady

Quoting - Gennady Fedorov (Intel)

Hi Jay,
It's not clear for me what version you are talking about?
Into our registration center I've found that you got MKL 10.1 and I am little confused with it.
We had the similar problem ( MKL 10.0 ) with mkl_cspblas_dcsrsymv routine but it was fixed in 10.1.
Have you had opportunity to check this problem with 10.1?
--Gennady

I downloaded version 10.1.2.026 yesterday. When I thought I had updated before it was just from10.0.1.015 to 10.0.5.025.

Quoting - jay.oswald

I downloaded version 10.1.2.026 yesterday. When I thought I had updated before it was just from10.0.1.015 to 10.0.5.025.

Jai,
Ive checked the scalability results with 10.1 u2 on win64 ( see below the results )
There is no scalability at all with the matrix you provided.
We will investigate the problem and inform you if any news.
Thanks for the issue.
****************************************
MKL sparse matrix-vector multiply test

#1
****************************************
Creating square matrix with 200000 rows
Fill factor = 0.012042
Constucting A..........done!
Test 1: reference vs mkl
mkl_cspblas_dcsrsymv running is = 0.078933 SIZE == 200000 #THREAD == 1
mkl_cspblas_dcsrsymv running is = 0.079010 SIZE == 200000 #THREAD == 2
mkl_cspblas_dcsrsymv running is = 0.082317 SIZE == 200000 #THREAD == 4
mkl_cspblas_dcsrsymv running is = 0.087804 SIZE == 200000 #THREAD == 8
Press any key to continue . . .

#2
Creating square matrix with 100000 rows
Fill factor = 0.00574441
Constucting A..........done!
Test 1: reference vs mkl
mkl_cspblas_dcsrsymv running is = 0.039769 SIZE == 100000 #THREAD == 1
mkl_cspblas_dcsrsymv running is = 0.040521 SIZE == 100000 #THREAD == 2
mkl_cspblas_dcsrsymv running is = 0.039576 SIZE == 100000 #THREAD == 4
mkl_cspblas_dcsrsymv running is = 0.031763 SIZE == 100000 #THREAD == 8
Press any key to continue . . .

--Gennady

I re-ran my tests (focusing on speedup now that I am getting consistent results).

I'm actually not too surprised that the scaling is poor. The sparsity pattern is completely random, and the matrix is symmetric. This should be a nightmare scenario for thread blocking and memory access patterns.

Here are the speedup factors for a matrix of size N=100000 and B nonzeros per row.

B | speedup (one to two cores)
------------------
2 | 0.88
4 | 0.84
8 | 1.0
16 | 1.2
32 | 1.21
64 | 1.36
128 | 1.25
256 | 1.25
512 | 1.35
1024 | 1.38

and here is what I get when I force the matrix to a band structure. In this case I think the scaling could be better. Still there is a lot of memory transfer happening for not a whole lot of computation. Sparse matrix size goes from 2.3 MB to 1.14 GB which very quickly should overwhelm my 4MB of CPU cache. I'll run the same code on a Linux platform (dual socket quad core and 12MB cache per socket) and report my results.

B | speedup (one to two cores)
------------------
2 | 0.74
4 | 0.75
8 | 0.85
16 | 1.1
32 | 1.23
64 | 1.27
128 | 1.36
256 | 1.5
512 | 1.67
1024 | 1.78

** Edit I've attached my benchmark code.

Attachments: 

AttachmentSize
Downloadapplication/zip mkl_spmv_speed.zip8.88 KB

Here are the results in Linux w/ 8 cores. These results are kind of interesting, the random sparcity pattern actually gets slightly better scaling. Also quad core Xeons running Linux seem to scale much better than my Core 2 Duo running Windows (granted these are E5405 w/ 12 MB cache vs an old E6600 w/ 4 MB cache).

Regardless, I'm fairly happy with the results now, as I run most of my simulations on the Xeon CPUs. I may get an i7 system to tinker around with soon too. If anyone is interested to see the scaling performance on it, I'd be happy to try it.

BANDED matrix
********************************************************
********************* SPEED TEST *********************
********************************************************
# rows = 100000, max threads = 8
B f2 f3 f4 f5 f6 f7 f8 nonzeros
2 0.86 0.87 0.83 0.75 0.68 0.63 0.58 199999
4 1 1.1 1.1 0.97 0.9 0.78 0.76 399994
8 1.3 1.5 1.5 1.4 1.4 1.2 1.2 799972
16 1.4 1.7 2.1 1.8 1.7 1.5 1.5 1599880
32 1.5 1.8 2 1.8 1.8 1.7 1.7 3199504
64 1.8 2 2.2 2 2.1 2 2.1 6397984
128 1.7 2.1 2.5 2.3 2.4 2.4 2.5 12791872
256 1.9 2.3 2.8 2.6 2.7 2.7 2.8 25567360
512 1.9 2.4 2.9 2.7 2.8 2.9 2.9 51069184

RANDOM sparcity
********************************************************
********************* SPEED TEST *********************
********************************************************
# rows = 100000, max threads = 8
B f2 f3 f4 f5 f6 f7 f8 nonzeros
2 0.76 0.78 0.75 0.69 0.65 0.6 0.54 199999
4 1 1.1 1.1 0.98 0.93 0.84 0.74 399994
8 1.4 1.6 1.7 1.5 1.5 1.4 1.3 799972
16 1.7 2.1 2.4 2.2 2.2 2.1 2.1 1599880
32 1.6 2.1 2.5 2.4 2.4 2.5 2.5 3199504
64 1.8 2.4 2.8 2.7 3 3 3.2 6397984
128 1.7 2.2 2.8 2.5 2.8 3 3.2 12791872
256 1.8 2.5 2.9 2.9 3.1 3.3 3.5 25567360
512 1.8 2.4 3.2 2.7 3.4 3.2 3.8 51069184
1024 1.9 2.3 3.1 2.9 3.1 3.3 3.5 101876224

*edit I should explain these tables - B is the number of non-zeros per row (its symmetric so 2B-1 is the actual row size), fx is the speedup factor going from 1 to x cores, i.e. f8 is the time it takes to run with 1 core divided by the time it takes with 8 cores, nonzeros is the total number of nonzeros in the matrix (again this is only counting the upper triangular part, so the actual number is 2*nonzeros - rows).

Hi Jay,
Thanks for the results and for the benchmarks.
We've already encountered with different scaling results linux/windows for this routine and still investigating the problem. We will be back if any news.

--GIF

Leave a Comment

Please sign in to add a comment. Not a member? Join today