Performance evaluation of ippsAdd_32f and ippsSub_32f vs. a simple 2-for-loop implementation with /O3 optimization

Performance evaluation of ippsAdd_32f and ippsSub_32f vs. a simple 2-for-loop implementation with /O3 optimization

I've completed a performance evaluation of some linear algebra algorithm that uses ippsAdd_32f and ippsSub_32f IPP functions vs. a simple 2-for-loop implementation ( of the same functionality in the same algorithm ) compiled with /O3 ( Intel C++ compiler ) and /O2 ( Microsoft C++ compiler ) optimizations and my results are very interesting.

In a couple of words: There was just ~0.30% performance improvement when IPP functions are used and I would consider it as negligible. I also provide test results later.

Thanks and ask questions if interested.

 

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

A question to IDZ Administrators / Moderators:

How could anyone edit an original ( 1st ) post of a just created thread? I remember that editing was available in the past.

[ Test results when IPP library is Not Used ]
...
Calculating...
Add - Completed in 3.554 ms
Add - Completed in 3.395 ms
Add - Completed in 3.525 ms
Sub - Completed in 3.367 ms
Sub - Completed in 3.127 ms
Add - Completed in 3.126 ms
Sub - Completed in 3.364 ms
Add - Completed in 3.506 ms
Sub - Completed in 3.491 ms
Add - Completed in 3.441 ms
Add - Completed in 3.103 ms
Sub - Completed in 2.968 ms
Add - Completed in 3.294 ms
Add - Completed in 3.094 ms
Add - Completed in 3.114 ms
Sub - Completed in 2.777 ms
Add - Completed in 2.756 ms
Add - Completed in 3.009 ms
( Algorithm ) - Pass 1 - Completed: 75.89500 secs
Add - Completed in 3.541 ms
Add - Completed in 3.556 ms
Add - Completed in 3.526 ms
Sub - Completed in 3.384 ms
Sub - Completed in 3.143 ms
Add - Completed in 3.148 ms
Sub - Completed in 3.363 ms
Add - Completed in 3.419 ms
Sub - Completed in 3.484 ms
Add - Completed in 3.423 ms
Add - Completed in 3.124 ms
Sub - Completed in 3.084 ms
Add - Completed in 2.904 ms
Add - Completed in 3.202 ms
Add - Completed in 3.128 ms
Sub - Completed in 2.770 ms
Add - Completed in 2.779 ms
Add - Completed in 3.039 ms
( Algorithm ) - Pass 2 - Completed: 75.87800 secs
...

[ Test results when IPP library is Used ]
...
Calculating...
Add - Completed in 3.518 ms
Add - Completed in 3.401 ms
Add - Completed in 3.364 ms
Sub - Completed in 3.280 ms
Sub - Completed in 2.754 ms
Add - Completed in 2.830 ms
Sub - Completed in 3.280 ms
Add - Completed in 3.311 ms
Sub - Completed in 3.305 ms
Add - Completed in 3.062 ms
Add - Completed in 2.954 ms
Sub - Completed in 2.595 ms
Add - Completed in 2.790 ms
Add - Completed in 3.178 ms
Add - Completed in 3.177 ms
Sub - Completed in 2.726 ms
Add - Completed in 2.724 ms
Add - Completed in 2.997 ms
( Algorithm ) - Pass 1 - Completed: 75.63000 secs
Add - Completed in 3.500 ms
Add - Completed in 3.381 ms
Add - Completed in 3.443 ms
Sub - Completed in 3.256 ms
Sub - Completed in 2.773 ms
Add - Completed in 2.839 ms
Sub - Completed in 3.296 ms
Add - Completed in 3.431 ms
Sub - Completed in 3.290 ms
Add - Completed in 3.062 ms
Add - Completed in 2.955 ms
Sub - Completed in 2.594 ms
Add - Completed in 2.844 ms
Add - Completed in 3.173 ms
Add - Completed in 3.181 ms
Sub - Completed in 2.742 ms
Add - Completed in 3.123 ms
Add - Completed in 2.938 ms
( Algorithm ) - Pass 2 - Completed: 75.61300 secs
...

With reduced output details...

[ A larger Data set - Test 1 - Algorithm with IPP - faster for 0.29% then Test 2 ]
...
Calculating...
Algorithm - Pass 1 - Completed: 114.35901 secs
Algorithm - Pass 2 - Completed: 114.10901 secs
Algorithm - Pass 3 - Completed: 114.07801 secs Note: Best Time ( BT1 )
Algorithm - Pass 4 - Completed: 114.07901 secs
Algorithm - Pass 5 - Completed: 114.09301 secs
...

[ A larger Data set - Test 2 - Algorithm without IPP - slower for 0.29% then Test 1 ]
...
Calculating...
Algorithm - Pass 1 - Completed: 114.76601 secs
Algorithm - Pass 2 - Completed: 114.40601 secs Note: Best Time ( BT2 )
Algorithm - Pass 3 - Completed: 114.40601 secs
Algorithm - Pass 4 - Completed: 114.46901 secs
Algorithm - Pass 5 - Completed: 114.42201 secs
...

Hardware & Software details:

Dell Precision Mobile M4700
Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 )
32GB RAM
320GB HDD
NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory )
Windows 7 Professional 64-bit

Size of L3 Cache = 8MB ( shared between all cores for data & instructions )
Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions )
Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions )

Leave a Comment

Please sign in to add a comment. Not a member? Join today