Using the VTune hotspots analysis shows the majority of my application time is being spent in the call to mkl_blas_mc3_ddot. Using the VTune General Exploration facility shows that the CPI for this call is 1.124, Retire stalls is 0.718, LLC miss is 1.335 and Exec stalls is 0.327. These event results seem very high.
The application is rrunning single-threaded on a node reserved for timing runs.
Any suggestions on what might be causing this performance?