Profiling my codes, I observed some curiously large overhead and spin time. Using Vtune Amplifier concurrency analysis on another example code of dgemm from MKL tutorial (link here), I learnt the overhead and spin time, surprisingly covered almost 100% of the CPU usage bar! (see the figure below)
According to what I know about overhead and spin time (consistent with the definition in Intel® Vtune Amplifier help), in an efficient parallel code, these metrics should be small and close to zero. It surprises me to see MKL matrix-matrix multiplication profiling shows almost 100% overhead and spin time. In the summary page, it shows : CPU time: 12.421, Overhead time: 10.125, Spin time: 2.170, concurrency ideal, CPU usage histogram shows almost zero usage! Can you please clarify this issue?
note: I am using update 11 of Vtune Amplifier.