Performance evaluation: Intel(R) C++ Compiler XE Version 12 vs. Microsoft (R) C/C++ Optimizing Compiler Version 14

Performance evaluation: Intel(R) C++ Compiler XE Version 12 vs. Microsoft (R) C/C++ Optimizing Compiler Version 14

I'd like to share some results of performance evaluation for Intel(R) C++ Compiler XE Version 12 and Microsoft (R) C/C++ Optimizing Compiler Version 14. Tests are done for a C++ template function for matrix multiplication. Please see next post for results.

When options /O2 and /O3 are used Intel C++ compiler outperformed Microsoft C++ compiler by 43% and 44% respectively.

 

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

[ Test 1.1 - Intel C++ compiler - /O2 and Floating Point Model /fp:precise ]
Data Set: TMatrixSetF - 16 x 16
Sub-Test 3 completed in: 24016 ticks

[ Test 1.2 - Microsoft C++ compiler - /O2 and Floating Point Model /fp:precise ]
Data Set: TMatrixSetF - 16 x 16
Sub-Test 3 completed in: 24719 ticks

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

[ Test 2.1 - Intel C++ compiler - /O2 and Floating Point Model /fp:fast ]
Data Set: TMatrixSetF - 16 x 16
Sub-Test 3 completed in: 13547 ticks

[ Test 2.2 - Intel C++ compiler - /O2 and Floating Point Model /fp:fast=2 ]
Data Set: TMatrixSetF - 16 x 16
Sub-Test 3 completed in: 13484 ticks

[ Test 2.3 - Intel C++ compiler - /O3 and Floating Point Model /fp:fast=2 ]
Data Set: TMatrixSetF - 16 x 16
Sub-Test 3 completed in: 13469 ticks

[ Test 2.4 - Microsoft C++ compiler - /O2 and Floating Point Model /fp:fast ]
Data Set: TMatrixSetF - 16 x 16
Sub-Test 3 completed in: 23984 ticks

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

Note: A C++ template function for matrix multiplication was executed 262144 times. The function uses SSE2 _mm_mul_ps intrinsic function ( 4 multiplications at the same time for one unrolled iteration ).

 In my experience, the Microsoft compiler doesn't perform any vectorizations beyond those performed by ICL at /fp:source or precise; this appears consistent with your result. Although Microsoft doesn't use any restrict or pragma assertions in optimizations that I've been able to find, it does fairly well at /arch:AVX under those restrictions.

If you dictated all important optimizations by intrinsics, I'd expect it to be possible to bring Microsoft compiler up to parity.  Even without using intrinsics, the Intel compiler is clever now about unroll_and_jam optimizations for matrix multiplication, even getting up into the size range where OpenMP is useful.  For larger matrices, of course, you likely would use MKL or (with MSVC) some other performance library.

>>[ Test 2.1 - Intel C++ compiler - /O2 and Floating Point Model /fp:fast ]
>>Data Set: TMatrixSetF - 16 x 16
>>Sub-Test 3 completed in: 13547 ticks
>>
>>[ Test 2.2 - Intel C++ compiler - /O2 and Floating Point Model /fp:fast=2 ]
>>Data Set: TMatrixSetF - 16 x 16
>>Sub-Test 3 completed in: 13484 ticks

Tim, I wonder if you, or somebody, could provide some additional details for /fp:fast and /fp:fast=2 command line options of Intel C++ compiler. As you can see there is ~0.5% performance improvement when /fp:fast=2 is used. Thanks in advance.

Leave a Comment

Please sign in to add a comment. Not a member? Join today