You got a significant performance boost by optimizing the memory access for the
multiply1 function. According to the data provided in the Summary window for your updated result, r001ge, you still have high CPI rate and Retire Stalls issues. You can try to optimize your code further following the steps below:
Analyze Results after Optimization
You see that the
multiply2 function (in fact, updated
multiply1 function) is still a hotspot. Double-click this function to view the source code and click both the Source and Assembly buttons on the toolbar to enable the Source and Assembly panes.
In the Source pane, the VTune Amplifier highlights line 53 that took the highest number of Clockticks samples. This is again the section where matrices are multiplied. The Assembly pane is automatically synchronized with the Source pane. It highlights the basic blocks corresponding to the code line highlighted in the Source pane. If you compiled the application with the Intel® Compiler, you can see that highlighted block 156 includes vectorization instructions added after your previous optimization. All vectorization instructions have the
p (packed) postfix (for example,
mulpd). You may use the
/Qvec-report3 option of the Intel compiler to generate the compiler optimization report and see which cycles were not vectorized and why. For more details, see the Intel compiler documentation.
Use More Advanced Algorithms
For Visual Studio IDE: From the VTune Amplifier toolbar, click the down arrow next to the New Analysis button and select General Exploration - Nehalem / Westmere Analysis
For Standalone UI: From the File menu, select New > General Exploration - Nehalem / Westmere Analysis.
You see that the Elapsed time has reduced a little: from 9.122 seconds to 8.896 seconds but the hardware issues identified in the previous run, CPI Rateand Retire Stalls, stayed practically the same. This means that there is more room for improvement and you can try other, more effective, mechanisms of matrix multiplication.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804