AVX Optimizations and Performance: VisualStudio vs GCC

AVX Optimizations and Performance: VisualStudio vs GCC

Bild des Benutzers James S.

Greetings,

   I have recently written some code using AVX function calls to perform a convolution in my software. I have compiled and run this code on two platforms with the following compilation settings of note:

1. Windows 7 w/ Visual Studio 2010 on a i7-2760QM

   Optimization: Maximize Speed (/O2)

   Inline Function Expansion: Only __inline(/Ob1)

   Enable Intrinsic Functions: No

   Favor Size or Speed: Favor fast code (/Ot)

2. Fedora Linux 15 w/ gcc 4.6 on a i7-3612QE

   Flags: -O3 -mavx -m64 -march=corei7-avx -mtune=corei7-avx

For my testing I ran the C implementation and the AVX implementation on both platforms and got the following timing results:

In Visual Studio:

C Implementation: 30ms

AVX Implementation: 5ms

In GCC:

C Implementation: 9ms

AVX Implementation: 57ms

As you can tell my AVX numbers on Linux are very large by comparison. My concern and reason for this post is that I may not have a proper understanding of using AVX and the settings to properly them in both scenarios. For example, take my Visual Studio run. If I change the flag Enable Intrinsics to Yes, my AVX numbers go from 5ms to 59ms. Does that mean disabling the compiler to optimize with intrinsics and manually setting them in Visual Studio give that much better results? Last I checked there is nothing similar in gcc. Could Microsoft be that more capable of a better compile than gcc in this case? Any ideas why my AVX numbers on gcc are just that much larger? Any help is most appreciated. Cheers.

43 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.
Bild des Benutzers iliyapolak

Sorry but I am confused.Did you use inline AVX assembly in your code or SImd AVX intrinsics?

Bild des Benutzers James S.

My apologies for not being more specific. I used SIMD AVX intrinsics...more specifically the functions: _mm256_loadu_ps, _mm256_mul_ps, _mm256_add_ps, and _mm256_storeu_ps.

Bild des Benutzers iliyapolak

First question I see that you are comparing compiled code on two different processor generations.How do you measure your code performance?

Bild des Benutzers James S.

I am measuring performance with timing of the operation (operation is performing the convolution on the data). So, I am using native libraries to grab a timestamp and determine the length in milliseconds. Yes, they are different generations, but I would presume the newer generation would give better numbers on AVX than the older. This is why I am thinking this something wrong with the gcc version or how I have set the optimization flags with it.

Bild des Benutzers iliyapolak

Have you looked at disassembled code as it was generated by those two compilers?Some of the intrinsics are not directly translated to single machine code instruction , but I presume that you are doing convolution on digital data so the intrinsics used mainly should be load ,store ,add and mul.Moreover there are an additional factors like memory and cache performance and overall load of the system at the time of measurement.

Bild des Benutzers iliyapolak

There are also additional factors like uncertainties related to thread being swapped in the middle of your code being measured.So basicly when the thread's execution is resumed the wait time can be also included.

Bild des Benutzers James S.

iliypolak,

   Thank you very much for your responses. I retrieved the assembly code for gcc and Visual Studio for both the AVX and C implementations of what I am doing. The Visual Studio comparison was fairly clear- The AVX implementation showed the following assembly for where my AVX calls were made:

; Line 190
    vmovups    ymm3, YMMWORD PTR [eax]
; Line 192
    vmulps    ymm3, ymm3, YMMWORD PTR [ecx]
    add    eax, edi
    add    ecx, 32                    ; 00000020H
; Line 194
    vaddps    ymm0, ymm3, ymm0

The C implementation was much larger by comparison (I will not post) and contained a plethora of moves, adds, and multiplies. Thus, it was clear to see that the Visual Studio compiler utilized the AVX intrinsics and reduced by code size considerably. The gcc assembly, however, was not as clear. The AVX version contains what I believe to be the AVX assembly, but it differs from what Visual Studio produced:

vmulps    %ymm1, %ymm6, %ymm1

vmulps    %ymm1, %ymm5, %ymm1

etc... As this occurs 5 times over. I do notice that in visual studio the vmulps call referenced a pointer location with "YMMWORD PTR [ecx]" whereas gcc uses direct variables. The C implementation of gcc did not contain any of the AVX assembly, however, it was shorter in length than the AVX version in overall size.

In regards to your second question, the code running on linux with gcc has its affinity set to avoid context switching if that is what you were referring to. Thanks again for all of your help.

Bild des Benutzers iliyapolak

VS implementation as seen in that assembly code snippet loads ( or dereferences a pointer to the array) [line:190] probably an input to your convolution function next at [line:192] there is a multiplication by convolution coefficients which is a part of the loop not seen in that code snippet and two lines below there is pointer arithmetics.At [line:194] there is a summation by not shown in code snippet load onto ymm0 register.GCC implementation probably preloads ymm registers and do multiplication on registers directly.

Bild des Benutzers James S.

Do you think that this ("GCC implementation probably preloads ymm registers and do multiplication on registers directly.") is the reason gcc is performing so much slower than its Visual Studio counterpart?

Bild des Benutzers iliyapolak

Hi James

I cannot answer it because you did not upload a full disassembly of GCC generated code.But I suppose that ymm register(s) must have been loaded with either with convolution function input or convolution function coefficients.On Haswell  two loads can performed in parallel.In VS code you have load of one data stream and mul of that stream with another stream loaded from the memory or cache  I think that two operations can be performed in parallel by using physical registers of register file.The last operation is dependent on the previous two operations.

Bild des Benutzers Sergey Kostrov

>>In Visual Studio:
>>
>>C Implementation: 30ms
>>
>>AVX Implementation: 5ms
>>
>>In GCC:
>>
>>C Implementation: 9ms
>>
>>AVX Implementation: 57ms

In essence, your results are very different from my results based on performance evaluation of some linear algebra algorithms.

I would rate three the most widelly used C++ compilers as follows:

1. Intel C++ compiler ( versions 12.x and 13.x )
2. GCC-like MinGW ( version 4.8.1 )
3. Microsoft C++ compiler ( VS 2010 )

Take into account, that core parts of these linear algebra algorithms individually optimized for every C++ compiler in order to get as better as possible performance because every compiler uses different techniques to optimize codes, to do vectorization, etc. Another thing is compiler options and I've also tuned that as better as possible.

Bild des Benutzers iliyapolak

Did you rate compilers according to the  code optimization techniques?

Bild des Benutzers Sergey Kostrov

The fastest execution is better then slower. For example, older Intel C++ v12.x outperforms the most latest MinGW v4.8.1 by ~10-15%.

Bild des Benutzers Tim Prince

When comparing performance of AVX intrinsics against compiler's choice of AVX instructions, you must observe the recommendation that _mm256_loadu_ps must be used only on aligned data for Sandy Bridge.  Even on the newer generations, splitting unaligned loads, as the AVX compilation options do, will frequently run faster.  _mm256_storeu_ps requires aligned data for satisfactory performance on both Sandy and Ivy Bridge CPUs, so compilers will use peeling for alignment or split them to AVX-128 when permitted to do so.

The CPU architects were aware of the tendency of VS2010 coders to use _mm256_loadu_ps and so put in a fix in Ivy Bridge to alleviate the penalty for unaligned data. 

VS2012 introduced a limited degree of auto-vectorization as an alternative to vectorization by intrinsics.  gcc 4.6 as well is a bit too old for use in evaluating AVX auto-vectorization.

We never found out why so much emphasis was placed on reduced numbers of instructions with AVX when it was well known that this would produce little performance gain in many situations.
 

Bild des Benutzers iliyapolak

Quote:

Sergey Kostrov wrote:

The fastest execution is better then slower. For example, older Intel C++ v12.x outperforms the most latest MinGW v4.8.1 by ~10-15%.

 Was the performance of Intel C++ version 12.x better than  MS VC++ compiler?

I bet that Intel compiler writers expertise could outperform competing compilers mainly in the area of code optimization as a function of specific microarchitecture and code parallelization and vectorization.

Bild des Benutzers Tim Prince

You could make up a benchmark entirely within the range of situations where MSVC++ (VS2012 or 2013) auto-vectorizes, and find that compiler performing fully as well as the others.

You could set ground rules, as many people do, where you enable aggressive optimizations on one compiler and not another.

Any percentage performance rankings are highly dependent on benchmark content.

You might perhaps set up a table of which compilers perform selected categories of optimizations, according to compilation flags.

Bild des Benutzers iliyapolak

When I will receive my Parallel Studio licence file I plan to test Intel, MSVC++ and MinGW compilers.

Thanks for interesting advise on how to perform such test.

Bild des Benutzers Sergey Kostrov

>>...Was the performance of Intel C++ version 12.x better than MS VC++ compiler?

Yes.

Bild des Benutzers Tim Prince

Quote:

Sergey Kostrov wrote:

>>...Was the performance of Intel C++ version 12.x better than MS VC++ compiler?

Yes.

MSVC++ sometimes optimizes loop carried data dependency recursions and switch better than ICL. 

On the other side, in auto-vectorization (first implemented in VS2012, where ICL had it for well over a decade), the following optimizations seem to be missing in MSVC++:

taking advantage of __RESTRICT to enable vectorization

simd optimization of sum and inner_product reductions

simd optimization based on assertions to overcome "protects exception"

simd optimization of OpenMP for loops (some of these not introduced in ICL or gcc until this year)

simd optimization of non-unitary strides

vectorizable math functions

simd optimization of STL transform()

optimizations depending on non-overlapping array sections (for which ICL requires assertions, but gcc optimizes without assertion)

simd optimizations depending on in-lining

optimization based on "node splitting"

optimization of std::max and min (g++ doesn't optimize these, although it seemingly could use gfortran machinery to do so)

       g++ can optimize fmax/fmin when -ffinite-math-only is set (so why not std:max/min?)

optimization based on data alignment assertion

Of course, most of these optimizations are more relevant to floating point and parallelizable applications than to those for which MSVC++ is more directly targeted.  Even in the floating point applications, MSVC++ is likely to optimize at least 50% of vectorizable loops.

Bild des Benutzers iliyapolak

Quote:

iliyapolak wrote:

Quote:

Sergey Kostrov wrote:

The fastest execution is better then slower. For example, older Intel C++ v12.x outperforms the most latest MinGW v4.8.1 by ~10-15%.

 

 Was the performance of Intel C++ version 12.x better than  MS VC++ compiler?

I bet that Intel compiler writers expertise could outperform competing compilers mainly in the area of code optimization as a function of specific microarchitecture and code parallelization and vectorization.

I should have asked about how much was Intel compiler faster than its Microsoft counterpart.

Bild des Benutzers Sergey Kostrov
Best Reply

>>...2. Fedora Linux 15 w/ gcc 4.6 on a i7-3612QE

I recommend you to upgrade GCC to version 4.8.1 Release 4.

AVX Performance Tests

[ Microsoft C++ compiler VS 2010 ( AVX ) ]
...
Matrix Size: 8192 x 8192
Processing... ( Add - 1D-based )
_TMatrixSetF::Add - Pass 01 - Completed: 62.50000 ticks
_TMatrixSetF::Add - Pass 02 - Completed: 58.50000 ticks
_TMatrixSetF::Add - Pass 03 - Completed: 62.25000 ticks
_TMatrixSetF::Add - Pass 04 - Completed: 62.50000 ticks
_TMatrixSetF::Add - Pass 05 - Completed: 58.50000 ticks
Add - 1D-based - Passed
Processing... ( Sub - 1D-based )
_TMatrixSetF::Sub - Pass 01 - Completed: 62.50000 ticks
_TMatrixSetF::Sub - Pass 02 - Completed: 66.25000 ticks
_TMatrixSetF::Sub - Pass 03 - Completed: 62.25000 ticks
_TMatrixSetF::Sub - Pass 04 - Completed: 62.50000 ticks
_TMatrixSetF::Sub - Pass 05 - Completed: 62.50000 ticks
Sub - 1D-based - Passed
...

Bild des Benutzers Sergey Kostrov

[ MinGW C++ compiler version 4.8.1 Release 4 ( AVX ) ]
...
Matrix Size: 8192 x 8192
Processing... ( Add - 1D-based )
_TMatrixSetF::Add - Pass 01 - Completed: 50.50000 ticks
_TMatrixSetF::Add - Pass 02 - Completed: 50.75000 ticks
_TMatrixSetF::Add - Pass 03 - Completed: 50.75000 ticks
_TMatrixSetF::Add - Pass 04 - Completed: 46.75000 ticks
_TMatrixSetF::Add - Pass 05 - Completed: 50.75000 ticks
Add - 1D-based - Passed
Processing... ( Sub - 1D-based )
_TMatrixSetF::Sub - Pass 01 - Completed: 50.75000 ticks
_TMatrixSetF::Sub - Pass 02 - Completed: 50.75000 ticks
_TMatrixSetF::Sub - Pass 03 - Completed: 50.50000 ticks
_TMatrixSetF::Sub - Pass 04 - Completed: 50.75000 ticks
_TMatrixSetF::Sub - Pass 05 - Completed: 50.75000 ticks
Sub - 1D-based - Passed
...

Bild des Benutzers Sergey Kostrov

[ Intel C++ Compiler XE 13.1.0.149 ( AVX ) ]
...
Matrix Size: 8192 x 8192
Processing... ( Add - 1D-based )
_TMatrixSetF::Add - Pass 01 - Completed: 50.75000 ticks
_TMatrixSetF::Add - Pass 02 - Completed: 54.50000 ticks
_TMatrixSetF::Add - Pass 03 - Completed: 50.75000 ticks
_TMatrixSetF::Add - Pass 04 - Completed: 50.75000 ticks
_TMatrixSetF::Add - Pass 05 - Completed: 50.50000 ticks
Add - 1D-based - Passed
Processing... ( Sub - 1D-based )
_TMatrixSetF::Sub - Pass 01 - Completed: 50.75000 ticks
_TMatrixSetF::Sub - Pass 02 - Completed: 46.75000 ticks
_TMatrixSetF::Sub - Pass 03 - Completed: 50.75000 ticks
_TMatrixSetF::Sub - Pass 04 - Completed: 50.75000 ticks
_TMatrixSetF::Sub - Pass 05 - Completed: 50.75000 ticks
Sub - 1D-based - Passed
...

Note: Take into account that quality of codes generation, especially for legacy instruction sets, like SSE2 and SSE4, and of course for AVX, of latest version of GCC is improved compared to version 3.4.2. In several my test cases it already outperforms Intel C++ compiler.

Bild des Benutzers Tim Prince

OK, situations where I see gcc out-performing icc:

Unrolled source, requiring re-roll optimization, where the compiler replaces the source code unrolling by its own optimization. icc dropped re-roll back around version 10.0.  We can argue that source code unrolling is an undesirable practice, given that modern software and hardware techniques eliminate the need for it in trivial cases.  Unroll-and-jam is another story.

Variable stride (even when written with CEAN so as to get AVX-128 in icc) e.g.

      for (i__ = *n1; i__ <= i__2; i__ += i__3)
          a[i__] += b[*n - (k += j) + 1];

Some cases where intrinsics are used to dictate code generation, or vectorization isn't possible, and the superior automatic unrolling facilities of gcc come in, if you are willing to use them (and adjust the unrolling limit to your CPU) e.g.

CFLAGS = -O3 -std=c99 -funroll-loops --param max-unroll-times=2 -ffast-math -fno-cx-limited-range -march=corei7-avx -fopenmp -g3 -gdwarf-2

Those debug options are recommended for using Windows gcc with Intel(r) VTune(tm), in case you missed the hint.

gcc -ffast-math -fno-cx-limited-range is roughly equivalent to icl -fp:fast=1, the latter being a default.

If you don't study the options, you won't get the best out of gcc.

Intel corei7-4 CPUs want less software unrolling than their predecessors, while gcc's aggressive unlimited unrolling was better suited to the Harpertown generation.  I'm not getting consistently satisfactory results with avx2 with either compiler; avx2 seems to expose bugs in gcc with openmp, while icc drops some i7-2 optimizations which remain better on i7-4.  I'll wait to optimize for corei7-5 when it arrives, if my retirement permits.

It's not usually difficult to discover and correct situations where icc doesn't match gcc performance, while there are many situations where it's easier to get full performance with icc.

Bild des Benutzers iliyapolak

Aggressively unrolling more than two beside  increasing register usage pressure will only utilize two execution Ports(depends on instructions) in parallel per cycle.When coupled only with prefetching outstanding decoded machine code instructions (micro-ops) which correspond  to prefetched data can be pipelined inside SIMD execution stack.Here ICache can speed up execution by caching decoded frequently machine code instructions. 

Bild des Benutzers iliyapolak

@Sergey

 Thanks for posting the results of compiler comparison.

Bild des Benutzers James S.

Sergey,

   I recompiled my software on a machine that had gcc 4.8.2 and even updated my compiler flags to reflect the following:

-O3 -march=core-avx-i -mtune=core-avx-i

I am however getting on the average the exact same timing numbers as before...which to me is very odd. I can't help but to think I am missing something trivial...

Thanks again for your help in this matter.

Bild des Benutzers iliyapolak

Can you post a full disassembly of GCC generated code?Moreover I would like to advise you to perform profiling of FIR code with the help of VTune.

Bild des Benutzers Sergey Kostrov

>>-O3 -march=core-avx-i -mtune=core-avx-i
>>
>>I am however getting on the average the exact same timing numbers as before...which to me is very odd. I can't help
>>but to think I am missing something trivial...

Your set of options is very simple and, I would say, basic. So, you need to use more GCC compiler options and please review as many as possible ( try to tune up your application / this is a very time consuming procedure ).

If you use lots of for-loops in codes take a look at how __builtin_assume_aligned internal function needs to be used. There is some "magic" related to that internal function and it really speed ups processing. Take into account that in almost all cases when memory allocated dynamically I use _mm_malloc and _mm_free intrinsic functions ( however, there are some exceptions... ).

I'll post more performance results later.

This week I've spent a significant amount of time on combining auto-vectorization and manual-software-pipelining. Results are positive and there is ~1.5 percent of improvement in performance.

Bild des Benutzers Sergey Kostrov

>>...I can't help but to think I am missing something trivial...

James, I will follow up on that and I will explain in a generic way how I tune up algorithms.

Thanks for the update related to GCC version 4.8.2 and I'll update as well.

Bild des Benutzers Sergey Kostrov

>>>>...I can't help but to think I am missing something trivial...
>>
>>James, I will follow up on that and I will explain in a generic way how I tune up algorithms.

Please consider a fine-tuning optimization of you convolution algorithm. I use 5 different C++ compilers and in most cases core parts of some algorithms in a project I work for fine-tuned for each C++ compiler. Here is an example ( performance results for a classic matrix multiplication algorithm ):
...
#if ( defined ( _WIN32_MGW ) )
#define MatrixMulProcessingCTv1 MatrixMulProcessingCTUnRvA1
#define MatrixMulProcessingCv1 MatrixMulProcessingCUnRvA1
// #define MatrixMulProcessingCTv1 MatrixMulProcessingCTvB1
// #define MatrixMulProcessingCv1 MatrixMulProcessingCvB1
// #define MatrixMulProcessingCTv1 MatrixMulProcessingCTvC1
// #define MatrixMulProcessingCv1 MatrixMulProcessingCvC1
// #define MatrixMulProcessingCTv1 MatrixMulProcessingCTvD1
// #define MatrixMulProcessingCv1 MatrixMulProcessingCvD1
// #define MatrixMulProcessingCTv1 MatrixMulProcessingCTvE1 // ***
// #define MatrixMulProcessingCv1 MatrixMulProcessingCvE1 // ***
#endif
...

[ Performance Results ]

Matrix Size : 1024 x 1024
Matrix Size Threshold: N/A
Matrix Partitions : N/A
ResultSets Reflection: N/A
Calculating...
...

Test-Case 1 - Version MatrixMulProcessingCTUnRvA1 used
...
Classic A - Pass 01 - Completed: 3.17200 secs
Classic A - Pass 02 - Completed: 3.17200 secs
Classic A - Pass 03 - Completed: 3.17200 secs
Classic A - Pass 04 - Completed: 3.17200 secs
Classic A - Pass 05 - Completed: 3.17100 secs
...

Note: Worst Performance

Test-Case 2 - Version MatrixMulProcessingCTvB1 used
...
Classic A - Pass 01 - Completed: 2.73500 secs
Classic A - Pass 02 - Completed: 2.73400 secs
Classic A - Pass 03 - Completed: 2.73400 secs
Classic A - Pass 04 - Completed: 2.73400 secs
Classic A - Pass 05 - Completed: 2.73500 secs
...

Test-Case 3 - Version MatrixMulProcessingCTvC1 used
...
Classic A - Pass 01 - Completed: 2.73500 secs
Classic A - Pass 02 - Completed: 2.73400 secs
Classic A - Pass 03 - Completed: 2.73400 secs
Classic A - Pass 04 - Completed: 2.73400 secs
Classic A - Pass 05 - Completed: 2.73500 secs
...

Test-Case 4 - Version MatrixMulProcessingCTvD1 used
...
Classic A - Pass 01 - Completed: 2.71900 secs
Classic A - Pass 02 - Completed: 2.71900 secs
Classic A - Pass 03 - Completed: 2.71800 secs
Classic A - Pass 04 - Completed: 2.71900 secs
Classic A - Pass 05 - Completed: 2.70300 secs
...

Note: Best Performance

Test-Case 5 - Version MatrixMulProcessingCTvE1 used
...
Classic A - Pass 01 - Completed: 2.71800 secs
Classic A - Pass 02 - Completed: 2.71900 secs
Classic A - Pass 03 - Completed: 2.71900 secs
Classic A - Pass 04 - Completed: 2.71900 secs
Classic A - Pass 05 - Completed: 2.71800 secs
...

Note: Best Performance

Bild des Benutzers Sergey Kostrov

Just for comparison here are results for Microsoft C++ compiler:

Test-Case 1 - Version MatrixMulProcessingCTUnRvA1 used
...
Classic A - Pass 01 - Completed: 3.34400 secs
Classic A - Pass 02 - Completed: 3.32800 secs
Classic A - Pass 03 - Completed: 3.32800 secs
Classic A - Pass 04 - Completed: 3.32800 secs
Classic A - Pass 05 - Completed: 3.31300 secs
...

Note: Best Performance ( however, it is slower by ~18 percent compared to MinGW )

Bild des Benutzers Sergey Kostrov

Here is a set of follow ups...

Bild des Benutzers Sergey Kostrov

Performance Tests

[ MinGW C++ compiler version 4.8.1 Release 4 ]
...
Matrix Size: 8192 x 8192
Processing... ( Add - 1D-based )
_TMatrixSetF::Add - Pass 01 - Completed: 511.50000 ticks
_TMatrixSetF::Add - Pass 02 - Completed: 511.75000 ticks
_TMatrixSetF::Add - Pass 03 - Completed: 511.75000 ticks
_TMatrixSetF::Add - Pass 04 - Completed: 511.75000 ticks
_TMatrixSetF::Add - Pass 05 - Completed: 507.75000 ticks
Add - 1D-based - Passed
Processing... ( Sub - 1D-based )
_TMatrixSetF::Sub - Pass 01 - Completed: 511.75000 ticks
_TMatrixSetF::Sub - Pass 02 - Completed: 511.75000 ticks
_TMatrixSetF::Sub - Pass 03 - Completed: 511.75000 ticks
_TMatrixSetF::Sub - Pass 04 - Completed: 511.75000 ticks
_TMatrixSetF::Sub - Pass 05 - Completed: 511.75000 ticks
Sub - 1D-based - Passed
...

[ Intel C++ compiler XE 12.1.7.371 ]
...
Matrix Size: 8192 x 8192
Processing... ( Add - 1D-based )
_TMatrixSetF::Add - Pass 01 - Completed: 519.50000 ticks
_TMatrixSetF::Add - Pass 02 - Completed: 519.50000 ticks
_TMatrixSetF::Add - Pass 03 - Completed: 519.75000 ticks
_TMatrixSetF::Add - Pass 04 - Completed: 519.50000 ticks
_TMatrixSetF::Add - Pass 05 - Completed: 519.50000 ticks
Add - 1D-based - Passed
Processing... ( Sub - 1D-based )
_TMatrixSetF::Sub - Pass 01 - Completed: 519.50000 ticks
_TMatrixSetF::Sub - Pass 02 - Completed: 519.50000 ticks
_TMatrixSetF::Sub - Pass 03 - Completed: 519.50000 ticks
_TMatrixSetF::Sub - Pass 04 - Completed: 519.50000 ticks
_TMatrixSetF::Sub - Pass 05 - Completed: 519.50000 ticks
Sub - 1D-based - Passed
...

[ Microsoft C++ compiler VS 2005 ]
...
Matrix Size: 8192 x 8192
Processing... ( Add - 1D-based )
_TMatrixSetF::Add - Pass 01 - Completed: 562.50000 ticks
_TMatrixSetF::Add - Pass 02 - Completed: 562.50000 ticks
_TMatrixSetF::Add - Pass 03 - Completed: 558.50000 ticks
_TMatrixSetF::Add - Pass 04 - Completed: 558.75000 ticks
_TMatrixSetF::Add - Pass 05 - Completed: 558.50000 ticks
Add - 1D-based - Passed
Processing... ( Sub - 1D-based )
_TMatrixSetF::Sub - Pass 01 - Completed: 558.50000 ticks
_TMatrixSetF::Sub - Pass 02 - Completed: 558.75000 ticks
_TMatrixSetF::Sub - Pass 03 - Completed: 558.50000 ticks
_TMatrixSetF::Sub - Pass 04 - Completed: 558.50000 ticks
_TMatrixSetF::Sub - Pass 05 - Completed: 558.75000 ticks
Sub - 1D-based - Passed
...

Note: MinGW C++ compiler outperforms Intel C++ compiler by ~2.3 percent and
Microsoft C++ compiler by ~9 percent.

Bild des Benutzers Sergey Kostrov

Here is an example when MinGW C++ compiler outperforms Microsoft C++ compiler:

[ MinGW C++ compiler version 4.8.1 Release 4 ]
...
Strassen HBI
Matrix Size : 2048 x 2048
Matrix Size Threshold: 1024 x 1024
Matrix Partitions : 1
ResultSets Reflection: N/A
Calculating...
Strassen HBI - Pass 01 - Completed: 20.62500 secs
Strassen HBI - Pass 02 - Completed: 20.45300 secs
Strassen HBI - Pass 03 - Completed: 20.25000 secs
Strassen HBI - Pass 04 - Completed: 20.25000 secs
Strassen HBI - Pass 05 - Completed: 20.25000 secs
ALGORITHM_STRASSENHBI - Passed
Strassen HBC
Matrix Size : 2048 x 2048
Matrix Size Threshold: 256 x 256
Matrix Partitions : 400
ResultSets Reflection: Enabled
Calculating...
Strassen HBC - Pass 01 - Completed: 235.23501 secs
Strassen HBC - Pass 02 - Completed: 20.43800 secs
Strassen HBC - Pass 03 - Completed: 20.35900 secs
Strassen HBC - Pass 04 - Completed: 20.35900 secs
Strassen HBC - Pass 05 - Completed: 20.45300 secs
ALGORITHM_STRASSENHBC - 1 - Passed
...

[ Microsoft C++ compiler VS 2008 ]
...
Strassen HBI
Matrix Size : 2048 x 2048
Matrix Size Threshold: 1024 x 1024
Matrix Partitions : 1
ResultSets Reflection: N/A
Calculating...
Strassen HBI - Pass 01 - Completed: 22.04600 secs
Strassen HBI - Pass 02 - Completed: 21.96900 secs
Strassen HBI - Pass 03 - Completed: 21.98500 secs
Strassen HBI - Pass 04 - Completed: 22.31200 secs
Strassen HBI - Pass 05 - Completed: 22.09400 secs
ALGORITHM_STRASSENHBI - Passed
Strassen HBC
Matrix Size : 2048 x 2048
Matrix Size Threshold: 256 x 256
Matrix Partitions : 400
ResultSets Reflection: Enabled
Calculating...
Strassen HBC - Pass 01 - Completed: 261.70301 secs
Strassen HBC - Pass 02 - Completed: 23.73500 secs
Strassen HBC - Pass 03 - Completed: 23.68700 secs
Strassen HBC - Pass 04 - Completed: 23.68800 secs
Strassen HBC - Pass 05 - Completed: 23.64000 secs
ALGORITHM_STRASSENHBC - 1 - Passed
...

Bild des Benutzers Sergey Kostrov

Here is an example when Vectorization combined with Software Pipelining improves performance
by ~1.5 percent:

[ MinGW C++ compiler version 4.8.1 Release 4 - Vectorized ]
...
Matrix Size: 10240 x 10240
Processing... ( Add - 1D-based )
_TMatrixSetF::Add - Pass 01 - Completed: 785.25000 ticks
_TMatrixSetF::Add - Pass 02 - Completed: 781.25000 ticks
_TMatrixSetF::Add - Pass 03 - Completed: 781.25000 ticks
_TMatrixSetF::Add - Pass 04 - Completed: 781.25000 ticks
_TMatrixSetF::Add - Pass 05 - Completed: 781.50000 ticks
Add - 1D-based - Passed
Processing... ( Sub - 1D-based )
_TMatrixSetF::Sub - Pass 01 - Completed: 781.25000 ticks
_TMatrixSetF::Sub - Pass 02 - Completed: 781.25000 ticks
_TMatrixSetF::Sub - Pass 03 - Completed: 781.50000 ticks
_TMatrixSetF::Sub - Pass 04 - Completed: 781.25000 ticks
_TMatrixSetF::Sub - Pass 05 - Completed: 781.25000 ticks
Sub - 1D-based - Passed
...

[ MinGW C++ compiler version 4.8.1 Release 4 - Vectorized and Software Pipelined ]
...
Matrix Size: 10240 x 10240
Processing... ( Add - 1D-based )
_TMatrixSetF::Add - Pass 01 - Completed: 777.25000 ticks
_TMatrixSetF::Add - Pass 02 - Completed: 777.50000 ticks
_TMatrixSetF::Add - Pass 03 - Completed: 773.50000 ticks
_TMatrixSetF::Add - Pass 04 - Completed: 777.25000 ticks
_TMatrixSetF::Add - Pass 05 - Completed: 773.50000 ticks
Add - 1D-based - Passed
Processing... ( Sub - 1D-based )
_TMatrixSetF::Sub - Pass 01 - Completed: 769.50000 ticks
_TMatrixSetF::Sub - Pass 02 - Completed: 769.50000 ticks
_TMatrixSetF::Sub - Pass 03 - Completed: 769.50000 ticks
_TMatrixSetF::Sub - Pass 04 - Completed: 773.50000 ticks
_TMatrixSetF::Sub - Pass 05 - Completed: 769.50000 ticks
Sub - 1D-based - Passed
...

Bild des Benutzers Sergey Kostrov

Here is an example when Software Pipelining improves performance by ~7.2 percent of a legacy Borland C++ compiler version 5.5.1:

[ Borland C++ compiler version 5.5.1 - Unrolled Loops 8-in-1 ]
...
Matrix Size: 10240 x 10240
Processing... ( Add - 1D-based )
_TMatrixSetF::Add - Pass 01 - Completed: 976.50000 ticks
_TMatrixSetF::Add - Pass 02 - Completed: 976.75000 ticks
_TMatrixSetF::Add - Pass 03 - Completed: 980.50000 ticks
_TMatrixSetF::Add - Pass 04 - Completed: 976.50000 ticks
_TMatrixSetF::Add - Pass 05 - Completed: 976.50000 ticks
Add - 1D-based - Passed
Processing... ( Sub - 1D-based )
_TMatrixSetF::Sub - Pass 01 - Completed: 976.50000 ticks
_TMatrixSetF::Sub - Pass 02 - Completed: 976.50000 ticks
_TMatrixSetF::Sub - Pass 03 - Completed: 976.50000 ticks
_TMatrixSetF::Sub - Pass 04 - Completed: 976.75000 ticks
_TMatrixSetF::Sub - Pass 05 - Completed: 976.50000 ticks
Sub - 1D-based - Passed
...

[ Borland C++ compiler version 5.5.1 - Software Pipelined and Rolled Loops ]
...
Matrix Size: 10240 x 10240
Processing... ( Add - 1D-based )
_TMatrixSetF::Add - Pass 01 - Completed: 910.25000 ticks
_TMatrixSetF::Add - Pass 02 - Completed: 910.25000 ticks
_TMatrixSetF::Add - Pass 03 - Completed: 910.00000 ticks
_TMatrixSetF::Add - Pass 04 - Completed: 906.25000 ticks
_TMatrixSetF::Add - Pass 05 - Completed: 910.25000 ticks
Add - 1D-based - Passed
Processing... ( Sub - 1D-based )
_TMatrixSetF::Sub - Pass 01 - Completed: 914.00000 ticks
_TMatrixSetF::Sub - Pass 02 - Completed: 914.00000 ticks
_TMatrixSetF::Sub - Pass 03 - Completed: 914.00000 ticks
_TMatrixSetF::Sub - Pass 04 - Completed: 914.00000 ticks
_TMatrixSetF::Sub - Pass 05 - Completed: 918.00000 ticks
Sub - 1D-based - Passed
...

Note: Vectorization is Not supported by that version of the compiler because it is too old.

Bild des Benutzers Sergey Kostrov

Also, a priority boost to High or Realtime will improve performance by ~1.5 percent ( applicable to codes compiled with any C++ compiler ):

Bild des Benutzers iliyapolak

Quote:

Sergey Kostrov wrote:

Also, a priority boost to High or Realtime will improve performance by ~1.5 percent ( applicable to codes compiled with any C++ compiler ):

Hi Sergey

Did you try to disable some hardware like NIC's and rerun your tests?

Bild des Benutzers Sergey Kostrov

>>...Did you try to disable some hardware like NIC's and rerun your tests?..

No, I did not disable any hardware and I'm not going to re-run these tests. However, I'm going to post another set of performance results some time later.

Bild des Benutzers emmanuel.attia

Seems that Visual Studio has enough hint to put your "kernel" (YMMWORD [ecx]) right into the instruction, that means the it knows that the [ecx] pointer is aligned. It is hard to say more without the source code. But I guess on g++ it does an additional vmovups to load the kernel register, even worst it might be from somewhere actually not aligned, even even worst maybe it does this for every packs of pixel where it can do once for all the loop.

Maybe Visual is being more agressive on the inlining, have you tried that kind of flags on g++ that forces inlining deeply (which is critical in a convolution algorithm if you wrote it in multiple functions / functors )  ?

Is it a good idea to put flags like " -mtune=corei7-avx" that might perform optimization that counter yours ?

Bild des Benutzers iliyapolak

It seems that ecx contains pointer to aligned data which is accessed lineary(array index is incremented lineary) hence probably usage of

vmulps ymm3,ymm3,ymmword ptr[ecx] instruction.

Melden Sie sich an, um einen Kommentar zu hinterlassen.