SSE vs AVX optimized code generation

SSE vs AVX optimized code generation

Hi all,

Have been using templates to write efficient, natural-looking maths that can be compiled against a variety of instruction sets and word widths. The basic idea is to write template classes which take a datatype & math operations class as a template argument. The type & math op class implements those operations for a specific instruction set and layout (whether that be x87 FPU, SSE, AVX etc.).

I have both SSE and AVX versions working well for single-element processing - however, when I extended the code to support higher order parallelism, the compiler continues to generate great code for SSE, but bogs right down for AVX - generating lots of seemingly unnecessary loads, stores and moves that it's not able to optimize out. Which has a huge detrimental impact on performance.

Running ICC 13 beta, VS2012 RC, Win 7. Command line options for compiler as follows:-

/MP /GS- /Qftz /W3 /QxAVX /Gy /Zc:wchar_t  /Zi /Ox /Ob2  /fp:fast /D "__INTEL_COMPILER=1300"  /Qip /Zc:forScope /GR /arch:AVX /Gd /Oy /Oi /MT  /EHsc /nologo  /FAs /Ot

Have attached C++ source, and SSE & AVX code generated by the compiler. (NB: have done search-and-replace on a few symbols for clarity, but is otherwise unmodified).

SSE code looks pretty much as I'd expect, aside from the compiler being smart enough to figure out that myTest2 is a constant expression & it doesn't need to load it in multiple times; and interleaving an add in with the multiplies to help IPC along a bit. Really happy with being able to write natural C++ looking code, and the compiler doing the hard work here!

Unfortunately the AVX code - same compiler version, same level of optimization, very similar MathOps base classes - is far from optimal. Dozens of move instructions, switching between YMM and XMM registers etc.. I'll attach the relevant snippets of the MathOps template classes below.

C++ source - see attachments for SSE & AVX MathOps base classes

template <class MathOps> class MyFilter : public MathOps

{

public:

      static void TestFunction()

      {

        _ReadWriteBarrier();            // prevent optimizations bleeding later/earlier
        vec_float myTest1;
        myTest1.m[0] = _mm_set1_ps(125.f);    // (switch these for _mm256_set1_ps for AVX version)
        myTest1.m[1] = _mm_set1_ps(126.f);
        myTest1.m[2] = _mm_set1_ps(127.f);
        myTest1.m[3] = _mm_set1_ps(128.f);
        vec_float myTest2 = 135.f;
        myTest1 += (myTest1 * myTest2);
        static vec_float myTest_Out = myTest1;    // store it in a static so the compiler won't discard the calculations entirely.
        _ReadWriteBarrier();

        };

};

int main()

{

   MyFilter<MathOps_SSEx4>::TestFunction();

//   MyFilter<MathOps_AVXx4>::TestFunction();

};

SSE code generated by compiler

;;;         vec_float myTest1;
;;;         myTest1.m[0] = _mm_set1_ps(125.f);
;;;         myTest1.m[1] = _mm_set1_ps(126.f);
;;;         myTest1.m[2] = _mm_set1_ps(127.f);
;;;         myTest1.m[3] = _mm_set1_ps(128.f);
;;;         vec_float myTest2 = 135.f;

        vmovups   xmm0, XMMWORD PTR [_2il0floatpacket.1741]     ;535.13
;;;         myTest1 += (myTest1 * myTest2);

        vmulps    xmm1, xmm0, XMMWORD PTR [_2il0floatpacket.1737] ;536.3
        vmulps    xmm2, xmm0, XMMWORD PTR [_2il0floatpacket.1738] ;536.3
        vmulps    xmm3, xmm0, XMMWORD PTR [_2il0floatpacket.1739] ;536.3
        vaddps    xmm5, xmm1, xmm1                              ;536.3
        vmulps    xmm4, xmm0, XMMWORD PTR [_2il0floatpacket.1740] ;536.3
        vaddps    xmm0, xmm2, xmm2                              ;536.3
        vaddps    xmm1, xmm3, xmm3                              ;536.3
        vaddps    xmm2, xmm4, xmm4                              ;536.3
;;;         static vec_float myTest_Out = myTest1;    // store it in a static so the compiler won't discard the calculations entirely.
        vmovups   XMMWORD PTR [?myTest_Out@?1@Z@4Vvec_float@Mathops_SSEx4@4@A+16], xmm0 ;537.20
        vmovups   XMMWORD PTR [?myTest_Out@?1@Z@4Vvec_float@Mathops_SSEx4@4@A+32], xmm1 ;537.20
        vmovups   XMMWORD PTR [?myTest_Out@?1@Z@4Vvec_float@Mathops_SSEx4@4@A+48], xmm2 ;537.20
        vmovups   XMMWORD PTR [?myTest_Out@?1@Z@4Vvec_float@Mathops_SSEx4@4@A], xmm5 ;537.20

AVX code generated by compiler
;;;
;;;         vec_float myTest1;
;;;         myTest1.m[0] = _mm256_set1_ps(125.f);

        vmovups   ymm0, YMMWORD PTR [_2il0floatpacket.1745]     ;531.3
$LN1856:

;;;         myTest1.m[1] = _mm256_set1_ps(126.f);

        vmovups   ymm2, YMMWORD PTR [_2il0floatpacket.1746]     ;532.3
$LN1857:

;;;         myTest1.m[2] = _mm256_set1_ps(127.f);

        vmovups   ymm4, YMMWORD PTR [_2il0floatpacket.1747]     ;533.3
$LN1858:

;;;         myTest1.m[3] = _mm256_set1_ps(128.f);

        vmovups   ymm6, YMMWORD PTR [_2il0floatpacket.1748]     ;534.3
$LN1859:

;;;         vec_float myTest2 = 135.f;

        vmovups   ymm7, YMMWORD PTR [_2il0floatpacket.1749]     ;535.13
$LN1860:
        vmovups   YMMWORD PTR [esp], ymm0                       ;531.3
$LN1861:
        vmovups   YMMWORD PTR [32+esp], ymm2                    ;532.3
$LN1862:
        vmovups   YMMWORD PTR [64+esp], ymm4                    ;533.3
$LN1863:
        vmovups   YMMWORD PTR [96+esp], ymm6                    ;534.3
$LN1864:

;;;         myTest1 += (myTest1 * myTest2);

        vmulps    ymm1, ymm0, ymm7                              ;536.3
$LN1865:
        vmulps    ymm3, ymm2, ymm7                              ;536.3
$LN1866:
        vmulps    ymm5, ymm4, ymm7                              ;536.3
$LN1867:
        vmulps    ymm0, ymm6, ymm7                              ;536.3
$LN1868:
        vmovups   YMMWORD PTR [128+esp], ymm1                   ;536.3
$LN1869:
        vmovups   YMMWORD PTR [160+esp], ymm3                   ;536.3
$LN1870:
        vmovups   YMMWORD PTR [192+esp], ymm5                   ;536.3
$LN1871:
        vmovups   YMMWORD PTR [224+esp], ymm0                   ;536.3
$LN1872:
        vmovups   xmm1, XMMWORD PTR [144+esp]                   ;536.3
$LN1873:
        vmovups   XMMWORD PTR [16+esp], xmm1                    ;536.3
$LN1874:
        vmovups   xmm2, XMMWORD PTR [160+esp]                   ;536.3
$LN1875:
        vmovups   XMMWORD PTR [32+esp], xmm2                    ;536.3
$LN1876:
        vmovups   xmm3, XMMWORD PTR [176+esp]                   ;536.3
$LN1877:
        vmovups   XMMWORD PTR [48+esp], xmm3                    ;536.3
$LN1878:
        vmovups   xmm4, XMMWORD PTR [192+esp]                   ;536.3
$LN1879:
        vmovups   XMMWORD PTR [64+esp], xmm4                    ;536.3
$LN1880:
        vmovups   xmm5, XMMWORD PTR [208+esp]                   ;536.3
$LN1881:
        vmovups   XMMWORD PTR [80+esp], xmm5                    ;536.3
$LN1882:
        vmovups   xmm6, XMMWORD PTR [224+esp]                   ;536.3
$LN1883:
        vmovups   XMMWORD PTR [96+esp], xmm6                    ;536.3
$LN1884:
        vmovups   xmm7, XMMWORD PTR [240+esp]                   ;536.3
$LN1885:
        vmovups   XMMWORD PTR [112+esp], xmm7                   ;536.3
$LN1886:
        vmovups   xmm0, XMMWORD PTR [128+esp]                   ;536.3
$LN1887:
        vmovups   XMMWORD PTR [esp], xmm0                       ;536.3
$LN1888:
                                ; LOE eax edx esi
.B48.3:                         ; Preds .B48.2
$LN1889:
        vmovups   ymm0, YMMWORD PTR [esp]                       ;536.3
$LN1890:
        vmovups   ymm2, YMMWORD PTR [32+esp]                    ;536.3
$LN1891:
        vmovups   ymm4, YMMWORD PTR [64+esp]                    ;536.3
$LN1892:
        vmovups   ymm6, YMMWORD PTR [96+esp]                    ;536.3
$LN1893:
        vaddps    ymm1, ymm0, ymm0                              ;536.3
$LN1894:
        vaddps    ymm3, ymm2, ymm2                              ;536.3
$LN1895:
        vaddps    ymm5, ymm4, ymm4                              ;536.3
$LN1896:
        vaddps    ymm7, ymm6, ymm6                              ;536.3
$LN1897:
        vmovups   YMMWORD PTR [128+esp], ymm1                   ;536.3
$LN1898:
        vmovups   YMMWORD PTR [160+esp], ymm3                   ;536.3
$LN1899:
        vmovups   YMMWORD PTR [192+esp], ymm5                   ;536.3
$LN1900:
        vmovups   YMMWORD PTR [224+esp], ymm7                   ;536.3
$LN1901:
        vmovups   xmm0, XMMWORD PTR [144+esp]                   ;536.3
$LN1902:
        vmovups   XMMWORD PTR [16+esp], xmm0                    ;536.3
$LN1903:
        vmovups   xmm1, XMMWORD PTR [160+esp]                   ;536.3
$LN1904:
        vmovups   XMMWORD PTR [32+esp], xmm1                    ;536.3
$LN1905:
        vmovups   xmm2, XMMWORD PTR [176+esp]                   ;536.3
$LN1906:
        vmovups   XMMWORD PTR [48+esp], xmm2                    ;536.3
$LN1907:
        vmovups   xmm3, XMMWORD PTR [192+esp]                   ;536.3
$LN1908:
        vmovups   XMMWORD PTR [64+esp], xmm3                    ;536.3
$LN1909:
        vmovups   xmm4, XMMWORD PTR [208+esp]                   ;536.3
$LN1910:
        vmovups   XMMWORD PTR [80+esp], xmm4                    ;536.3
$LN1911:
        vmovups   xmm5, XMMWORD PTR [224+esp]                   ;536.3
$LN1912:
        vmovups   XMMWORD PTR [96+esp], xmm5                    ;536.3
$LN1913:
        vmovups   xmm6, XMMWORD PTR [240+esp]                   ;536.3
$LN1914:
        vmovups   XMMWORD PTR [112+esp], xmm6                   ;536.3
$LN1915:
        vmovups   xmm7, XMMWORD PTR [128+esp]                   ;536.3
$LN1916:
        vmovups   XMMWORD PTR [esp], xmm7                       ;536.3
$LN1917:
                                ; LOE eax edx esi
.B48.4:                         ; Preds .B48.3
$LN1918:

;;;         static vec_float myTest_Out = myTest1;    // store it in a static so the compiler won't discard the calculations entirely.

        movzx     ecx, BYTE PTR [??_B?1??ProcessBufferLoop@?$SAVXF@VIOGather_Original_AVXx4@VDSP@@@@SAXPBVSongInfo@VDSP@@PBUVoiceInfo@3@AAV?$VArray1@PAVVDSPVoice@VDSPNode@VDSP@@@VFoundation@@@Z@51] ;537.20
$LN1919:
        bts       ecx, 0                                        ;537.20
$LN1920:
        jc        .B48.6        ; Prob 40%                      ;537.20
$LN1921:
                                ; LOE eax edx ecx esi
.B48.5:                         ; Preds .B48.4
$LN1922:
        mov       BYTE PTR [??_B?1??ProcessBufferLoop@?$SAVXF@VIOGather_Original_AVXx4@VDSP@@@@SAXPBVSongInfo@VDSP@@PBUVoiceInfo@3@AAV?$VArray1@PAVVDSPVoice@VDSPNode@VDSP@@@VFoundation@@@Z@51], cl ;537.31
$LN1923:
        vmovups   xmm0, XMMWORD PTR [16+esp]                    ;537.20
$LN1924:
        vmovups   XMMWORD PTR [?myTest_Out@?1??ProcessBufferLoop@?$SAVXF@VIOGather_Original_AVXx4@VDSP@@@@SAXPBVSongInfo@VDSP@@PBUVoiceInfo@4@AAV?$VArray1@PAVVDSPVoice@VDSPNode@VDSP@@@VFoundation@@@Z@4Vvec_float@Mathops_AVXx4@4@A+16], xmm0 ;537.20
$LN1925:
        vmovups   xmm1, XMMWORD PTR [32+esp]                    ;537.20
$LN1926:
        vmovups   XMMWORD PTR [?myTest_Out@?1??ProcessBufferLoop@?$SAVXF@VIOGather_Original_AVXx4@VDSP@@@@SAXPBVSongInfo@VDSP@@PBUVoiceInfo@4@AAV?$VArray1@PAVVDSPVoice@VDSPNode@VDSP@@@VFoundation@@@Z@4Vvec_float@Mathops_AVXx4@4@A+32], xmm1 ;537.20
$LN1927:
        vmovups   xmm2, XMMWORD PTR [48+esp]                    ;537.20
$LN1928:
        vmovups   XMMWORD PTR [?myTest_Out@?1??ProcessBufferLoop@?$SAVXF@VIOGather_Original_AVXx4@VDSP@@@@SAXPBVSongInfo@VDSP@@PBUVoiceInfo@4@AAV?$VArray1@PAVVDSPVoice@VDSPNode@VDSP@@@VFoundation@@@Z@4Vvec_float@Mathops_AVXx4@4@A+48], xmm2 ;537.20
$LN1929:
        vmovups   xmm3, XMMWORD PTR [64+esp]                    ;537.20
$LN1930:
        vmovups   XMMWORD PTR [?myTest_Out@?1??ProcessBufferLoop@?$SAVXF@VIOGather_Original_AVXx4@VDSP@@@@SAXPBVSongInfo@VDSP@@PBUVoiceInfo@4@AAV?$VArray1@PAVVDSPVoice@VDSPNode@VDSP@@@VFoundation@@@Z@4Vvec_float@Mathops_AVXx4@4@A+64], xmm3 ;537.20
$LN1931:
        vmovups   xmm4, XMMWORD PTR [80+esp]                    ;537.20
$LN1932:
        vmovups   XMMWORD PTR [?myTest_Out@?1??ProcessBufferLoop@?$SAVXF@VIOGather_Original_AVXx4@VDSP@@@@SAXPBVSongInfo@VDSP@@PBUVoiceInfo@4@AAV?$VArray1@PAVVDSPVoice@VDSPNode@VDSP@@@VFoundation@@@Z@4Vvec_float@Mathops_AVXx4@4@A+80], xmm4 ;537.20
$LN1933:
        vmovups   xmm5, XMMWORD PTR [96+esp]                    ;537.20
$LN1934:
        vmovups   XMMWORD PTR [?myTest_Out@?1??ProcessBufferLoop@?$SAVXF@VIOGather_Original_AVXx4@VDSP@@@@SAXPBVSongInfo@VDSP@@PBUVoiceInfo@4@AAV?$VArray1@PAVVDSPVoice@VDSPNode@VDSP@@@VFoundation@@@Z@4Vvec_float@Mathops_AVXx4@4@A+96], xmm5 ;537.20
$LN1935:
        vmovups   xmm6, XMMWORD PTR [112+esp]                   ;537.20
$LN1936:
        vmovups   XMMWORD PTR [?myTest_Out@?1??ProcessBufferLoop@?$SAVXF@VIOGather_Original_AVXx4@VDSP@@@@SAXPBVSongInfo@VDSP@@PBUVoiceInfo@4@AAV?$VArray1@PAVVDSPVoice@VDSPNode@VDSP@@@VFoundation@@@Z@4Vvec_float@Mathops_AVXx4@4@A+112], xmm6 ;537.20
$LN1937:
        vmovups   xmm7, XMMWORD PTR [esp]                       ;537.20
$LN1938:
        vmovups   XMMWORD PTR [?myTest_Out@?1??ProcessBufferLoop@?$SAVXF@VIOGather_Original_AVXx4@VDSP@@@@SAXPBVSongInfo@VDSP@@PBUVoiceInfo@4@AAV?$VArray1@PAVVDSPVoice@VDSPNode@VDSP@@@VFoundation@@@Z@4Vvec_float@Mathops_AVXx4@4@A], xmm7 ;537.20

6 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Further to this:-

It seems that when a simple, temporary object can be entirely represented by the state of XMM registers, the compiler is able to recognize this and doesn't store the object on the stack unless it needs to (i.e. it runs out of registers). If, however, the object's state can only be represented by the YMM registers, the compiler is rather more conservative - it will store the object to the stack, and reload it the next time it's used.

Moving to the final release of the v13 compiler seems to have fixed the problems. AVX now performing up-to-spec. Happy to share the latest version of the project if any compiler people are interested.

Hi,

I'm happy to hear that the last version of the compiler fixed the issue for you :-)

Alex

This looks similar to an issue I experienced with dvec.h whereby operator functions (e.g. +-*/) lost ability to inline. I could not ascertain the triggering condition. As a "hack" work around I produced a inline_dvec.h that expressly added the __forceinline attribute to the operator functions and other pertinant member functions. You might try the same with emmintrin.h and smmintrin.h (i.e. make inline_emmintrin.h, ...).

Note, I did not experience the intermittant failure to inline for the SSE expansions but did so for the AVX expansions (at least in dvec.h)

Allegati: 

AllegatoDimensione
Download inline_dvec.h.txt85.05 KB
www.quickthreadprogramming.com

It wasn't exactly an inlining failure - I don't see calls being generated - more that the compiler seemed to lose the ability to keep members of temporary structs in registers.

Interestingly, I found I got a substantial performance boost by changing the way the template gets invoked. Originally the code was written with the 'interleave' variable in the mathops class set as a compile-time constant, and I'd set it up before building for each run:-

template class Mathops_SSExN
{
int interleave = 8;

However, changing it to a template argument produced a moderate speedup for Intel's (already fast) compiler, a drastic speedup for GCC and LLVM-Clang... and none for Microsoft's slow one, which is really a non-starter when it comes to loop unrolling.

template class Mathops_SSExN
{
public:
static const int interleave = intrlv;

I've been collecting some benchmarks for a whole raft of compilers and CPUs, thought the results might be interesting. The AVX results will look better still on a Haswell, they suffer a bit as AVX1 lacks all but the most basic integer operations. Looking forward to trying this stuff out on a Phi too one of these days.

Anyway - here are the results - happy to share source code with interested parties.

Best,
Angus.

==================
Compiler Tests
==================

With 6th September, 20:00 code base. Mac.
Added Windows results for same codebase.
Haven't checked output for validity.

Data set size: 32smps/channel (= 256 bytes per channel)
All times shown are for 5000x32 iterations

=================================================
Xeon Harpertown Penryn @ 2.8GHz (Mac Pro early 2008)
L1d 32k @ latency 4
L2 6M @ latency 15

=================================================
Mac OS X 10.6
=================================================

========
GCC 4.2 -O3 -msse3.1
Elapsed time with interleave 1: 36.514999 ms.
Elapsed time with interleave 2: 58.163998 ms.
Elapsed time with interleave 4: 62.850998 ms.
Elapsed time with interleave 8: 448.190979 ms.
Elapsed time with interleave 16: 1283.882935 ms.
========
GCC 4.8 -O3 -msse3.1
Elapsed time with interleave 1: 34.779999 ms.
Elapsed time with interleave 2: 39.860996 ms.
Elapsed time with interleave 4: 54.723061 ms.
Elapsed time with interleave 8: 145.048996 ms.
Elapsed time with interleave 16: 767.032043 ms.
========
clang 3.2 -O3 -msse3.1
Elapsed time with interleave 1: 29.199001 ms.
Elapsed time with interleave 2: 42.349998 ms.
Elapsed time with interleave 4: 62.494999 ms.
Elapsed time with interleave 8: 78.683998 ms. <== equivalent to 39ms at 4x - better throughput than many ICC results!
Elapsed time with interleave 16: 889.553955 ms.
========
Intel v12 x86 -O3
Elapsed time with interleave 1: 29.342999 ms.
Elapsed time with interleave 2: 104.820007 ms.
Elapsed time with interleave 4: 185.223007 ms.
Elapsed time with interleave 8: 186.266998 ms.
Elapsed time with interleave 16: 370.201019 ms.
========
Intel v12 x86 -O2
Elapsed time with interleave 1: 29.510002 ms.
Elapsed time with interleave 2: 34.452003 ms.
Elapsed time with interleave 4: 50.379002 ms.
Elapsed time with interleave 8: 108.731003 ms.
Elapsed time with interleave 16: 861.251038 ms.
========
Intel v12 x64 -O3 <== Atypical result at 1x-4x. O3 is usually bad with ICC on Mac
Elapsed time with interleave 1: 25.794003 ms.
Elapsed time with interleave 2: 28.842001 ms.
Elapsed time with interleave 4: 35.980000 ms.
Elapsed time with interleave 8: 142.888000 ms.
Elapsed time with interleave 16: 341.058990 ms.
========
Intel v12 x64 -O2
Elapsed time with interleave 1: 25.813000 ms.
Elapsed time with interleave 2: 32.116997 ms.
Elapsed time with interleave 4: 42.858997 ms.
Elapsed time with interleave 8: 85.279999 ms.
Elapsed time with interleave 16: 907.706787 ms.
========
Intel v13 x86 -O3
Elapsed time with interleave 1: 29.175999 ms.
Elapsed time with interleave 2: 52.771999 ms.
Elapsed time with interleave 4: 96.564003 ms.
Elapsed time with interleave 8: 189.548996 ms.
Elapsed time with interleave 16: 374.730011 ms.
========
Intel v13 x86 -O2
Elapsed time with interleave 1: 29.247997 ms.
Elapsed time with interleave 2: 34.834999 ms.
Elapsed time with interleave 4: 50.506996 ms.
Elapsed time with interleave 8: 110.920998 ms.
Elapsed time with interleave 16: 858.751892 ms.
========
Intel v13 x64 -O3
Elapsed time with interleave 1: 25.579000 ms.
Elapsed time with interleave 2: 42.453999 ms.
Elapsed time with interleave 4: 75.388992 ms.
Elapsed time with interleave 8: 143.520966 ms.
Elapsed time with interleave 16: 336.766022 ms.
========
Intel v13 x64 -O2
Elapsed time with interleave 1: 25.570000 ms.
Elapsed time with interleave 2: 32.951000 ms.
Elapsed time with interleave 4: 43.009998 ms.
Elapsed time with interleave 8: 90.303001 ms.
Elapsed time with interleave 16: 916.098083 ms.

=================================================
Windows 7
=================================================

=================================================
Windows - Intel compiler with O2, SSE, x86
Elapsed time with interleave 1: 23.905266 ms.
Elapsed time with interleave 2: 32.720844 ms.
Elapsed time with interleave 4: 51.977993 ms.
Elapsed time with interleave 8: 115.710083 ms.
Elapsed time with interleave 16: 935.244263 ms.

=================================================
Windows - Intel compiler with O3, SSE, x86 <== O3 good on Windows
Elapsed time with interleave 1: 23.944492 ms.
Elapsed time with interleave 2: 27.581081 ms.
Elapsed time with interleave 4: 36.696926 ms. <-- very quick
Elapsed time with interleave 8: 88.766212 ms. <-- not quite as quick - register spills
Elapsed time with interleave 16: 757.163635 ms.

=================================================
MS compiler with Ox (microsoft, you need to unroll those loops)
Elapsed time with interleave 1: 33.599281 ms.
Elapsed time with interleave 2: 161.689331 ms.
Elapsed time with interleave 4: 226.229095 ms.
Elapsed time with interleave 8: 611.853149 ms.
Elapsed time with interleave 16: 1201.150269 ms.

=================================================
Windows - Intel compiler with O2, SSE, x64
Elapsed time with interleave 1: 24.809738 ms.
Elapsed time with interleave 2: 29.801746 ms.
Elapsed time with interleave 4: 41.776932 ms.
Elapsed time with interleave 8: 86.854980 ms.
Elapsed time with interleave 16: 942.762207 ms

=================================================
Windows - Intel compiler with O3, SSE, x64
Elapsed time with interleave 1: 22.946529 ms.
Elapsed time with interleave 2: 26.854057 ms.
Elapsed time with interleave 4: 34.143726 ms. <-- very quick
Elapsed time with interleave 8: 67.131470 ms. <-- very very quick.
Elapsed time with interleave 16: 642.396606 ms..

=================================================

=================================================
Ultrabook i5 2557m @ 1.7 turbo 2.7GHz (Asus UX31)
L1d 32k @ latency 3
L2 256k @ latency 8

=================================================
Windows 7
=================================================

Windows - Intel compiler with O2 optimizations, SSE+AVX, x86
=== Running tests with SSE instruction set
Elapsed time with interleave 1: 24.168627 ms.
Elapsed time with interleave 2: 32.396488 ms. (vs 32ms for 2.8GHz Harpertown Xeon)
Elapsed time with interleave 4: 43.116436 ms. (vs 51ms for 2.8GHz Harpertown Xeon)
Elapsed time with interleave 8: 81.263794 ms. (vs 115ms for 2.8GHz Harpertown Xeon)
Elapsed time with interleave 16: 814.660339 ms.

=== Running tests with AVX instruction set (double the effective throughput)
Elapsed time with interleave 1: 26.812241 ms.
Elapsed time with interleave 2: 36.414734 ms.
Elapsed time with interleave 4: 56.105698 ms. <== winner, equiv to 28ms SSE execution time
Elapsed time with interleave 8: 167.520798 ms.
Elapsed time with interleave 16: 1161.347412 ms.

Windows - Intel compiler with O3 optimizations, SSE+AVX, x86
=== Running tests with SSE instruction set
Elapsed time with interleave 1: 23.452637 ms.
Elapsed time with interleave 2: 30.544933 ms.
Elapsed time with interleave 4: 39.956047 ms.
Elapsed time with interleave 8: 73.113815 ms.
Elapsed time with interleave 16: 551.361511 ms.

=== Running tests with AVX instruction set
Elapsed time with interleave 1: 25.353693 ms.
Elapsed time with interleave 2: 32.536549 ms.
Elapsed time with interleave 4: 47.269306 ms. <== winner, equiv. to 23.5ms SSE execution time
Elapsed time with interleave 8: 137.423828 ms.
Elapsed time with interleave 16: 880.159607 ms.

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi