I'm trying to find the fastest way to do a multithreaded sparse matrix-vector multiply. I've written some benchmarking code to form a large random sparse matrix in CSR format, and then time 3 different implementations to compute y = y + A*x. I have a serial implementation, an openMP implementation, and mkl_dcsrmv. I'm computing the average and minimum time over a number of runs, say, 10.
Strangely, though, the openMP implementation beats MKL always. For the matrix sizes in the code, openMP has a min time of 0.199272 seconds, while MKL has a min time of 0.249399 seconds over 10 runs. This is for a matrix with about 256 million nonzeros.
I'm running this on a machine with 32 cores. I've adjusted the number of threads and played with the KMP_AFFINITY environment variable. The openMP code does better in every case.
Any idea why I'm getting these results? Perhaps I'm using MKL sub-optimally? Any help would be greatly appreciated.
I've attached the code I'm running. I compile with "icc -mkl -openmp rand_mat.c"