performance numbers MKL 11.0 vs Eigen?

performance numbers MKL 11.0 vs Eigen?


I found the results here a bit surprising specially the MVM one (matrix vector multiplication with and without transposition) ... how come MKL that has even AVX and is heavily optimized gets lower performance than Eigen that only has implemented SSE2?

They also show that the benchmarks correspond to the latest MKL 11.0

I understand they outperform MKL for "complex expressions" using expression templates, it is clear but how come they still show to outperform MKL in MVM primitives???

Thanks in advance,

Best regards,


publicaciones de 5 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

what are the problem sizes in that case?
it might happens for the smal inputs

Indeed, the sizes at the MV chart are 100-1000 that's very small and quite unusual for HPC. As you can see, there's a significant drop near 1000 that means the task doesn't fit into last level cache anymore. Frankly speaking, it makes sence to assess memory limited MV operation starting nearly from this point (but not finishing measurements there). And another unclear aspect of all those charts is using only 1 threads on the machine w/ 4 cores. I can only guess that the reason is that the majority of Eigen operations are not threaded.

Considering only 1-thread MV performance on such small sizes - yes, it might be that Eigen is faster than all other libraries for this particular case. But this is due to all the libraries has additional overhead associated with calling stack and, probably, because this case has the lowest priority for real tasks.

BTW, Eigen provides an easy way to use Intel(R) MKL as a backend:

With respect to AVX - please notice that Intel(R) Core(TM)2 Quad CPU Q9400 used in measurements doesn't support AVX yet.

Indeed, this benchmark is quite old and was performed on a CPU with no AVX support. Activating multi-threading for a matrix-vector operation makes little since most of the time the application is paralelized at a higher level (e.g., matrix factorization). The benchmark goes to matrix sizes of 3000 (not 1000). For larger matrices, all libraries perform poorly since caching strategies cannot be used for level2 operations. The good performance of Eigen here is mainly due to a clever trick to completely avoid unaligned memory access in all situations: we form one unaligned packet from two aligned loads. More details in the code!

Inicie sesión para dejar un comentario.