Using Vtune Amplifier concurrency analysis on an example code of dgemm (link here), the overhead and spin time surprisingly covered almost 100% of the CPU usage bar! (reported here). I tried VTune concurrency profiling tool for sparse matrix by vector multiplication kernel mkl_dcsrsymv as well, and similar result was obtained. Since in the examples mentioned here, a very high performance is achieved, the large overhead reported seems irrelevant. I initially asked for an explanation in VTune Amplifier forum (here) and I was advised to ask the question in this forum.
Do you have any explanation for the large overhead and spin time?
note: Vtune Amplifier update 11, Intel Composer XE 2013 are used.