Considering "The Intel® Math Kernel Library Sparse Matrix Vector Multiply Format Prototype Package", I have two questions:
1) The use of a temporary array for each thread may not pay off when the number of threads increases, especially when Xeon Phi is considered. Is this problem efficiently solved in the package? In which call are memory allocated for the temporary arrays? In sparseCreateESBMatrix(), sparseDcsr2esb() or sparseDesbmv() function?
2) How is the reduction of temporary arrays performed on Xeon Phi? Is there any architecture-specific way? As far as I know, it is not mentioned in the paper describing the package.