I think that this is a linking error, maybe someone can help.

I'm using VS2008 compiling a 64bit app and callingmkl_cspblas_dcsrsymv in a c++ routine and am getting incorrect results. The relevant libraries I link aremkl_em64t.lib,mkl_intel_lp64.lib,mkl_intel_thread.lib,mkl_solver_lp64.lib,libiomp5md.lib. (I use the Paradiso solver elsewhere and it works perfectly).

If I replacemkl_intel_thread.lib withmkl_sequential.lib all works fine, except of course I don't get any speedup. Whether or not the compiler option for OpenMP support is enabled seems not to matter. I'm compiling with the /MD (multi-threaded dll runtime library) option set.

# Unexplained result with mkl_cspblas_dcsrsymv

## Unexplained result with mkl_cspblas_dcsrsymv

For more complete information about compiler optimizations, see our Optimization Notice.

Please try the following linking line for the architecture:

mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib

--Gennady

Quoting - Gennady Fedorov (Intel)

*
*

Please try the following linking line for the architecture:

mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib

--Gennady

I've narrowed the problem down further -mkl_cspblas_dcsrsymv seems to fail with repeated calls. The attached MSVC project reproduces the error. The problem does indeed seem to be threading related - with the call mkl_set_num_threads(1) I get the correct results, withmkl_set_num_threads(2), the error appears.

In this example I link:

mkl_intel_lp64.lib

mkl_intel_thread.lib

mkl_core.lib

libiomp5md.lib

--Gennady

Quoting - Gennady Fedorov (Intel)

*
*

--Gennady

I have:

Package ID: w_mkl_p_10.0.1.015

Package Contents: Intel Math Kernel Library 10.0 Update 1

I have a somewhat of a solution to my problem. Oddly enough, calling the standard interfacemkl_dcsrmv function produces the correct results. It does not seem to use more than 1 cpu however - I never see above 50% cpu usage (on a dual core CPU).

Test 0: matrix copy constructor

is equal: 1

Test 1: reference vs mkl**error norm: 1.40368e+069**A equals A0: 1

Quoting - Gennady Fedorov (Intel)

Do you have smth like that?:

Test 0: matrix copy constructor

is equal: 1

Test 1: reference vs mkl**error norm: 1.40368e+069**A equals A0: 1

This is my result:

~~~~~~~~~~~~~~~~~~~~~~

Test 0: matrix copy constructor

is equal: 1

Test 1: reference vs mkl

error norm: 1.23211e-008

A equals A0: 1

Test 2: mkl multiple calls

error norm 12: 9.86824e+006

error norm 13: 1.97365e+007

error norm 14: 2.96047e+007

A equals A0: 1

x - x0 norm: 0

~~~~~~~~~~~~~~~~~~~~~~

If on line 57 I change from 2 to 1 I get (the correct result):

~~~~~~~~~~~~~~~~~~~~~~

Test 0: matrix copy constructor

is equal: 1

Test 1: reference vs mkl

error norm: 1.23211e-008

A equals A0: 1

Test 2: mkl multiple calls

error norm 12: 0

error norm 13: 0

error norm 14: 0

A equals A0: 1

x - x0 norm: 0

~~~~~~~~~~~~~~~~~~~~~~

What compiler are you building this with? If you have VS2008 on a 64 bit Windows system you should just be able to build the project I setup without modification.

see the output i've got with version 10.1:

Test 0: matrix copy constructor

is equal: 1

Test 1: reference vs mkl

error norm: 1.39258e-008

A equals A0: 1

Test 2: mkl multiple calls

error norm 12: 0

error norm 13: 0

error norm 14: 0

A equals A0: 1

x - x0 norm: 0

Press any key to continue . . .

Quoting - Gennady Fedorov (Intel)

i checked the problem with the next versions ( 10.1 and 10.2 beta ). The problem dissapered with these versions.

see the output i've got with version 10.1:

Test 0: matrix copy constructor

is equal: 1

Test 1: reference vs mkl

error norm: 1.39258e-008

A equals A0: 1

Test 2: mkl multiple calls

error norm 12: 0

error norm 13: 0

error norm 14: 0

A equals A0: 1

x - x0 norm: 0

Press any key to continue . . .

Now I am running MKL version10.0.5.025 for Windows, and the SPMV problem persists. I should note that this error does not occur in any of the Linux MKL versions I've tried.

A few notes:

* the behavior seems not to change whether or not I have OpenMP support enabled (/openmp)

* with N=100000 and NZ = 41*N, it seems not to use more than 1 CPU, even when mkl_set_num_cpu(2) is chosen.

* with N=200000 and NZ = 41*N, I get the correct results and it seems that both CPUs are utilized.

* withN=100000 and NZ = 41*N, mkl_set_dynamic(0) seems to have no effect (still looks like it will only use 1 cpu)

Sparse matrix-vector multiply is the only function that I really need from MKL. Its the only thing that I haven't found in any other BLAS libraries, and its the bottleneck in my simulation code, so I really need to get this to work.

Here is the compiler (MSVC 2008 SP1) command line:

/Ox /Oi /Ot /Oy /GT /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /FD /EHsc /MD /Gy /fp:fast /Fo"x64Release" /Fd"x64Releasevc90.pdb" /W3 /nologo /c /Zi /TP /errorReport:prompt

Here is the link line:

/OUT:"C:developmenttestmklx64Releasetestmkl.exe" /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:"x64Releasetestmkl.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /DEBUG /PDB:"c:developmenttestmklx64Releasetestmkl.pdb" /SUBSYSTEM:CONSOLE /OPT:REF /OPT:ICF /LTCG /DYNAMICBASE /NXCOMPAT /MACHINE:X64 /ERRORREPORT:PROMPT mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib

I've uploaded my project (including the exe if you really trust me!). I also uploaded the output of Dependency Walker so you can see what its dynamically linked against.

This is the output that I get. The only change between the first and second call is that the first time is called with mkl_set_num_threads(1), and the second is called withmkl_set_num_threads(2).

****************************************

MKL sparse matrix-vector multiply test

****************************************

Running with 1 thread

Creating square matrix with 100000 rows

Fill factor = 0.00574441

Constucting A..........done!

Test 1: reference vs mkl

error norm: 9.07218e-008

A equals A0: 1

Test 2: mkl multiple calls

error norm 12: 0

error norm 13: 0

error norm 14: 0

A equals A0: 1

x - x0 norm: 0

****************************************

MKL sparse matrix-vector multiply test

****************************************

Running with 2 threads

Creating square matrix with 100000 rows

Fill factor = 0.00574441

Constucting A..........done!

Test 1: reference vs mkl

error norm: 9.07218e-008

A equals A0: 1

Test 2: mkl multiple calls

error norm 12: 9.87039e+006

error norm 13: 1.97408e+007

error norm 14: 1.00678e+009

A equals A0: 1

x - x0 norm: 0

even more troubling, when I do nothing but double the amount of rows I get the correct result:

****************************************

MKL sparse matrix-vector multiply test

****************************************

Running with 1 thread

Creating square matrix with 200000 rows

Fill factor = 0.012042

Constucting A..........done!

Test 1: reference vs mkl

error norm: 2.39593e-007

A equals A0: 1

Test 2: mkl multiple calls

error norm 12: 0

error norm 13: 0

error norm 14: 0

A equals A0: 1

x - x0 norm: 0

****************************************

MKL sparse matrix-vector multiply test

****************************************

Running with 2 threads

Creating square matrix with 200000 rows

Fill factor = 0.012042

Constucting A..........done!

Test 1: reference vs mkl

error norm: 2.39593e-007

A equals A0: 1

Test 2: mkl multiple calls

error norm 12: 0

error norm 13: 0

error norm 14: 0

A equals A0: 1

x - x0 norm: 0

you wrote >>

....

I've uploaded my project (including the exe if you really trust me!). I also uploaded the output of Dependency Walker so you can see what its dynamically linked against.

....

Do you mean attachment you sent at April.13?

--Gennady

Ahh - I see, I missed a step attaching the files.

The problem is quite maddening, I tried everything I could think of, including statically linking everything.

One strange quirk I did notice - if I use the following link linemkl_intel_lp64.lib mkl_intel_thread_dll.lib mkl_core.lib libiomp5mt.lib, then even though I statically linked the new OpenMP library, libguide40.dll gets dynamically linked. If I switch the above mkl_intel_thread_dll.lib to mkl_intel_thread.lib then no DLLs (aside from the standard Windows ones) get linked. In both cases however my test code yields the same (incorrect) result, where given the same input data mkl_cspblas_dcsrsymv gives invalid results for subsequent calls.

Quoting - Gennady Fedorov (Intel)

you wrote >>

....

I've uploaded my project (including the exe if you really trust me!). I also uploaded the output of Dependency Walker so you can see what its dynamically linked against.

....

Do you mean attachment you sent at April.13?

--Gennady

## Attachments:

Attachment | Size |
---|---|

Download dependancies.png | 143.67 KB |

Download mkl_spmv.zip | 398.77 KB |

Quoting - jay.oswald

Quote: ".... even when mkl_set_num_cpu(2) is chosen"

Do you mean mkl_set_num_threads(int)?

--Gennady

Quoting - Gennady Fedorov (Intel)

* Quote: ".... even when mkl_set_num_cpu(2) is chosen"Do you mean mkl_set_num_threads(int)?--Gennady*

Yeah, I meant using mkl_set_num_threads(2) with smaller matrix sizes leads to only one CPU being utilized and the wrong result being obtained from the multiplication.

It seems that the case is that when the routine is called with more than one cpu, themkl_cspblas_dcsrsymv function decides whether or not the sparse matrix is large enough to use threads. If threads are used, it seems to work correctly. If only one thread is used, then for any calls after the first one,mkl_cspblas_dcsrsymv returns an incorrect result.

This problem is repeatable on all of the Windows machines I've tried - 32 bit Windows Vista, and two 64 bit Server 2008 machines. The code was compiled with VS 2008 SP1. The only difference is that on the 32 bit machine it seems the threshold matrix size for running on more than one thread is smaller. On any of the Linux machines I have tried, the code runs correctly for all matrix sizes.

This seems to be a very common usage, e.g. anyone trying to run a conjugate gradient solver is going to multiply the same sparse matrix by a vector over many iterations. Is MKL use in Windows uncommon?

And once again, this is time to upgrade

I checked this new test with MKL 10.1 and MKL 10.2 beta

On win32 and WinXP64 with different threads running on ( 1 8 )

All work well. See below the output I ve got

Win32. MKL 10.1 upadate2 and MKL 10.2 beta

****************************************

MKL sparse matrix-vector multiply test

****************************************

Running with 1 thread

Creating square matrix with 100000 rows

Fill factor = 0.00574441

Constucting A..........done!

Test 1: reference vs mkl

error norm: 9.07362e-008

A equals A0: 1

Test 2: mkl multiple calls

error norm 12: 0

error norm 13: 0

error norm 14: 0

A equals A0: 1

x - x0 norm: 0

Press any key to continue . . .

Win64. MKL 10.1 upadate2 and MKL 10.2 beta

mkl_solver_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib

mkl_set_num_threads(1, 2, 4, or 8 )

****************************************

MKL sparse matrix-vector multiply test

****************************************

Running with 1 thread

Creating square matrix with 100000 rows

Fill factor = 0.00574441

Constucting A..........done!

Test 1: reference vs mkl

error norm: 9.07362e-008

A equals A0: 1

Test 2: mkl multiple calls

error norm 12: 0

error norm 13: 0

error norm 14: 0

A equals A0: 1

x - x0 norm: 0

Press any key to continue . . .

Quoting - jay.oswald

Ah thats it. When I downloaded the newer version the 10.0 branch was selected by default. So far 10.1 seems to work perfectly. Thanks for verifying, I can't tell you how many times I double-checked that code.

-Jay

Quoting - Gennady Fedorov (Intel)

And once again, this is time to upgrade

I checked this new test with MKL 10.1 and MKL 10.2 beta

On win32 and WinXP64 with different threads running on ( 1 8 )

All work well. See below the output I ve got

Win32. MKL 10.1 upadate2 and MKL 10.2 beta

****************************************

MKL sparse matrix-vector multiply test

****************************************

Running with 1 thread

Creating square matrix with 100000 rows

Fill factor = 0.00574441

Constucting A..........done!

Test 1: reference vs mkl

error norm: 9.07362e-008

A equals A0: 1

Test 2: mkl multiple calls

error norm 12: 0

error norm 13: 0

error norm 14: 0

A equals A0: 1

x - x0 norm: 0

Press any key to continue . . .

Win64. MKL 10.1 upadate2 and MKL 10.2 beta

mkl_solver_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib

mkl_set_num_threads(1, 2, 4, or 8 )

MKL sparse matrix-vector multiply test

****************************************

Running with 1 thread

Creating square matrix with 100000 rows

Fill factor = 0.00574441

Constucting A..........done!

Test 1: reference vs mkl

error norm: 9.07362e-008

A equals A0: 1

Test 2: mkl multiple calls

error norm 12: 0

error norm 13: 0

error norm 14: 0

A equals A0: 1

x - x0 norm: 0

Press any key to continue . . .

Quoting - jay.oswald

Hi Jay,

It's not clear for me what version you are talking about?

Into our registration center I've found that you got MKL 10.1 and I am little confused with it.

We had the similar problem ( MKL 10.0 ) with mkl_cspblas_dcsrsymv routine but it was fixed in 10.1.

Have you had opportunity to check this problem with 10.1?

--Gennady

Quoting - Gennady Fedorov (Intel)

*
*

*Hi Jay,It's not clear for me what version you are talking about?Into our registration center I've found that you got MKL 10.1 and I am little confused with it.We had the similar problem ( MKL 10.0 ) with mkl_cspblas_dcsrsymv routine but it was fixed in 10.1.Have you had opportunity to check this problem with 10.1?--Gennady*

I downloaded version 10.1.2.026 yesterday. When I thought I had updated before it was just from10.0.1.015 to 10.0.5.025.

Quoting - jay.oswald

*
I downloaded version 10.1.2.026 yesterday. When I thought I had updated before it was just from10.0.1.015 to 10.0.5.025.
*

Jai,

Ive checked the scalability results with 10.1 u2 on win64 ( see below the results )

There is no scalability at all with the matrix you provided.

We will investigate the problem and inform you if any news.

Thanks for the issue.

****************************************

MKL sparse matrix-vector multiply test

#1

****************************************

Creating square matrix with 200000 rows

Fill factor = 0.012042

Constucting A..........done!

Test 1: reference vs mkl

mkl_cspblas_dcsrsymv running is = 0.078933 SIZE == 200000 #THREAD == 1

mkl_cspblas_dcsrsymv running is = 0.079010 SIZE == 200000 #THREAD == 2

mkl_cspblas_dcsrsymv running is = 0.082317 SIZE == 200000 #THREAD == 4

mkl_cspblas_dcsrsymv running is = 0.087804 SIZE == 200000 #THREAD == 8

Press any key to continue . . .

#2

Creating square matrix with 100000 rows

Fill factor = 0.00574441

Constucting A..........done!

Test 1: reference vs mkl

mkl_cspblas_dcsrsymv running is = 0.039769 SIZE == 100000 #THREAD == 1

mkl_cspblas_dcsrsymv running is = 0.040521 SIZE == 100000 #THREAD == 2

mkl_cspblas_dcsrsymv running is = 0.039576 SIZE == 100000 #THREAD == 4

mkl_cspblas_dcsrsymv running is = 0.031763 SIZE == 100000 #THREAD == 8

Press any key to continue . . .

--Gennady

I re-ran my tests (focusing on speedup now that I am getting consistent results).

I'm actually not too surprised that the scaling is poor. The sparsity pattern is completely random, and the matrix is symmetric. This should be a nightmare scenario for thread blocking and memory access patterns.

Here are the speedup factors for a matrix of size N=100000 and B nonzeros per row.

B | speedup (one to two cores)

------------------

2 | 0.88

4 | 0.84

8 | 1.0

16 | 1.2

32 | 1.21

64 | 1.36

128 | 1.25

256 | 1.25

512 | 1.35

1024 | 1.38

and here is what I get when I force the matrix to a band structure. In this case I think the scaling could be better. Still there is a lot of memory transfer happening for not a whole lot of computation. Sparse matrix size goes from 2.3 MB to 1.14 GB which very quickly should overwhelm my 4MB of CPU cache. I'll run the same code on a Linux platform (dual socket quad core and 12MB cache per socket) and report my results.

B | speedup (one to two cores)

------------------

2 | 0.74

4 | 0.75

8 | 0.85

16 | 1.1

32 | 1.23

64 | 1.27

128 | 1.36

256 | 1.5

512 | 1.67

1024 | 1.78

** Edit I've attached my benchmark code.

## Attachments:

Attachment | Size |
---|---|

Download mkl_spmv_speed.zip | 8.88 KB |

Here are the results in Linux w/ 8 cores. These results are kind of interesting, the random sparcity pattern actually gets slightly better scaling. Also quad core Xeons running Linux seem to scale much better than my Core 2 Duo running Windows (granted these are E5405 w/ 12 MB cache vs an old E6600 w/ 4 MB cache).

Regardless, I'm fairly happy with the results now, as I run most of my simulations on the Xeon CPUs. I may get an i7 system to tinker around with soon too. If anyone is interested to see the scaling performance on it, I'd be happy to try it.

BANDED matrix

********************************************************

********************* SPEED TEST *********************

********************************************************

# rows = 100000, max threads = 8

B f2 f3 f4 f5 f6 f7 f8 nonzeros

2 0.86 0.87 0.83 0.75 0.68 0.63 0.58 199999

4 1 1.1 1.1 0.97 0.9 0.78 0.76 399994

8 1.3 1.5 1.5 1.4 1.4 1.2 1.2 799972

16 1.4 1.7 2.1 1.8 1.7 1.5 1.5 1599880

32 1.5 1.8 2 1.8 1.8 1.7 1.7 3199504

64 1.8 2 2.2 2 2.1 2 2.1 6397984

128 1.7 2.1 2.5 2.3 2.4 2.4 2.5 12791872

256 1.9 2.3 2.8 2.6 2.7 2.7 2.8 25567360

512 1.9 2.4 2.9 2.7 2.8 2.9 2.9 51069184

RANDOM sparcity

********************************************************

********************* SPEED TEST *********************

********************************************************

# rows = 100000, max threads = 8

B f2 f3 f4 f5 f6 f7 f8 nonzeros

2 0.76 0.78 0.75 0.69 0.65 0.6 0.54 199999

4 1 1.1 1.1 0.98 0.93 0.84 0.74 399994

8 1.4 1.6 1.7 1.5 1.5 1.4 1.3 799972

16 1.7 2.1 2.4 2.2 2.2 2.1 2.1 1599880

32 1.6 2.1 2.5 2.4 2.4 2.5 2.5 3199504

64 1.8 2.4 2.8 2.7 3 3 3.2 6397984

128 1.7 2.2 2.8 2.5 2.8 3 3.2 12791872

256 1.8 2.5 2.9 2.9 3.1 3.3 3.5 25567360

512 1.8 2.4 3.2 2.7 3.4 3.2 3.8 51069184

1024 1.9 2.3 3.1 2.9 3.1 3.3 3.5 101876224

*edit I should explain these tables - B is the number of non-zeros per row (its symmetric so 2B-1 is the actual row size), fx is the speedup factor going from 1 to x cores, i.e. f8 is the time it takes to run with 1 core divided by the time it takes with 8 cores, nonzeros is the total number of nonzeros in the matrix (again this is only counting the upper triangular part, so the actual number is 2*nonzeros - rows).

Hi Jay,

Thanks for the results and for the benchmarks.

We've already encountered with different scaling results linux/windows for this routine and still investigating the problem. We will be back if any news.

--GIF