Temporary arrays in Sparse Matrix Vector Multiply Format Prototype Package

Temporary arrays in Sparse Matrix Vector Multiply Format Prototype Package

Considering "The Intel® Math Kernel Library Sparse Matrix Vector Multiply Format Prototype Package", I have two questions:

1) The use of a temporary array for each thread may not pay off when the number of threads increases, especially when Xeon Phi is considered. Is this problem efficiently solved in the package? In which call are memory allocated for the temporary arrays? In sparseCreateESBMatrix(), sparseDcsr2esb() or sparseDesbmv() function?

2) How is the reduction of temporary arrays performed on Xeon Phi? Is there any architecture-specific way? As far as I know, it is not mentioned in the paper describing the package.

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.


One of our MKL team is looking at your question and will get back to you soon.


Hi Kadir,

The package performs sparse matrix-vector multiplication for general case only, i.e. y = alpha * A * x + beta * y. So this case requires almost no synchronization between threads. For ESB functionality we need to allocate the following arrays in sparseDcsr2esb() function: arrays for matrix representation in ESB format and some arrays for workload balancing and synchronization - sizes of these arrays depend on sparse matrix structure and don't depend on number of threads.

Because synchronization may require for consecutive threads only, so SpMV functionality should demonstrate good scalability on Intel Xeon Phi.

Could you please clarify your questions about temporary arrays?

Regards, Sergey

2nd paragraph of Subsection 4.3 entitled "SpMV Kernel with ESB Format" taken from the paper describing the package is as follows:

"In practice, we also parallelize the SpMV operation across col-
umn blocks. Here, we use a temporary copy of y for each block and
use a reduce operation across these temporary copies at the end of
the computation."

Caption of Algorithm 2 is as follows:

"Algorithm 2: Multiply the i th column block of matrix in ESB

As far as I understand, there is a synchronization point after the multiplication of column blocks. After the synchronization point, the  temporary copy of y for each block undergoes a reduction operation.


Hi Kadir,

Matrix in ESB format is stored in slices: each slice consists of 8 rows and stored in ELLPACK format. In fact, each thread computes update for output vector Y in the form of vector register variable of 512 bit length. So when several threads computed parts of the same slice (parallelize across column blocks), they need to synchronize update of just these small variables. So we don't need to allocate a lot of additional memory for every thread.

Regards, Sergey

How are the synchronizations and atomic updates implemented? Performance of #pragma omp atomic is too much low. Are there any other efficient solutions for synchronizations and atomic updates when there are more than 200 threads on Intel Xeon Phi? I was not able to find any information about implementation of the synchronizations and atomic updates in the paper.

Leave a Comment

Please sign in to add a comment. Not a member? Join today