Comparing scalar, SSE and AVX basics....poor performances ??

Comparing scalar, SSE and AVX basics....poor performances ??

Hi, Please, look at these pieces of code, consisting of three versions to calculate the length of a set of 3-D vectors. Let's assume, vector v, with components x, y, z. The length (l) of vector v, is l = sqrt((x*x) + (y*y) + (z*z)) I implemented three versions based on scalar, SSE and AVX instruction, to compute the length of 90 000 000 vectors. I hope to get much better performance using SSE, and AVX, but no...., here the results: ======================================= TEST 0: l = sqrt((x*x) + (y*y) + (z*z)) ======================================= Scalar time: 0.46051 SSE time : 0.18613 AVX time : 0.19043 Speed-up Scalar vs SSE : 2.47 Speed-up Scalar vs AVX : 2.42 I hope a speed-up of 4 when using SSE, a much more with AVX, but there is no difference between SSE and AVX.

Target architecture:

• Intel Xeon CPU E31245 @ 3.30GHz
• 4 CPU dual-core (but I only use one core)
Command line to compile: gcc -O3 -std=c99 -mavx main.c -o main -lm And the code: Allocating memory for the SSE version: x = (float*)_mm_malloc(len * sizeof(float), 16); y =(float*)_mm_malloc(len * sizeof(float), 16); .... Allocating memory for the AVX version: x = (float*)_mm_malloc(len * sizeof(float), 32); y =(float*)_mm_malloc(len * sizeof(float), 32); ....

//---------------------------------------------------------------------------------------------------------------------- void length_scalar(float *x, float *y, float *z, float *l, unsigned int length) { for (int i = 0; i l[i] = sqrt((x[i]*x[i]) + (y[i]*y[i]) + (z[i]*z[i])); } } //---------------------------------------------------------------------------------------------------------------------- void length_sse(float *x, float *y, float *z, float *l, unsigned int length) { __m128 xmm0, xmm1, xmm2, xmm3; for (int i = 0; i xmm0 = _mm_load_ps(&x[i]); xmm1 = _mm_load_ps(&y[i]); xmm2 = _mm_load_ps(&z[i]); xmm3 = _mm_add_ps(_mm_mul_ps(xmm0, xmm0), _mm_mul_ps(xmm1, xmm1)); xmm3 = _mm_add_ps(_mm_mul_ps(xmm2, xmm2), xmm3); xmm3 = _mm_sqrt_ps(xmm3); _mm_store_ps(&l[i], xmm3); } } //----------------------------------------------------------------------------------------------------------------------
void length_avx(float *x, float *y, float *z, float *l, unsigned int length) {

__m256 ymm0, ymm1, ymm2, ymm3; for (int i = 0; i ymm0 = _mm256_load_ps(&x[i]); ymm1 = _mm256_load_ps(&y[i]); ymm2 = _mm256_load_ps(&z[i]); ymm3 = _mm256_add_ps(_mm256_mul_ps(ymm0, ymm0), _mm256_mul_ps(ymm1, ymm1)); ymm3 = _mm256_add_ps(_mm256_mul_ps(ymm2, ymm2), ymm3); ymm3 = _mm256_sqrt_ps(ymm3); _mm256_store_ps(&l[i], ymm3); } //---------------------------------------------------------------------------------------------------------------------- Could you, please, give me some hints, suggestions....to explain that? I think it is due to the 4 instructions to move data (memory /register, i.e., the load and store instructions), what do you think? If I ran a example more simple (addition of the 3 components of a vector, for 90 000 000 vectors) and I got worse results: ======================================= TEST 1: l = x + y + z ======================================= Scalar time: 0.61573 SSE time : 0.34304 AVX time : 0.34770 Speed-up Scalar vs SSE : 1.79 Speed-up Scalar vs AVX : 1.77 Any idea? Thanks a lot -- Joaqun

22 posts / 0 new
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

if you do that with 90M vectors your application is more a memory bandwidth testthan anything else I'll say

what I'llsuggest you to do totest the speed of the kernel in isolation:
- work with a small working set that fit in the L1D cache, for example 1000 vectors only
- repeat the test a lof of time with an outer loop (for example executed1 million times) to have accurate timings

then the speedupvs the scalar version shouldbe better (provided that the scalar version is really scalar, i.e. not vectorized by the compiler),forAVX vs SSE the botlleneckis clearly the(high latency / low throughput) sqrt whichhas the same throughput on Sandy Bridge with SSE and AVX-256, Ivy Bridge will enjoy a better speedup heresince the throughput is doubled for AVX-256

if youtry a fast sqrt approximation such as _mm256_mul_ps(m,_mm256_rsqrt_ps(m)) you should see a betterAVX-256vs SSE speedupon Sandy Bridge

now this code is heavily load port bound apparently so I'll not expect better than 1.3x speedup from AVX-256 vs. SSE even with a workload that fit at 100% in theL1D cache

just one more thing, you can also help your compiler a bit by using the const keyword when it applies such as :

void length_sse(const float *x, const float *y, const float *z, float *l, unsigned int length)

EDIT : I see that you use in fact AVX-128 (!) so the first thing to do is to switch to AVX-256 to hope for anyspeedup, i.e. use _mm256_load_ps, _mm256_mul_ps etc.

As the AVX-256 sqrt is sequenced as 2 128-bit sqrt instructions, you could expect little difference in performance between SSE and AVX-256. Given that the SSE parallel sqrt is reasonably efficient, it may seem a backward step to use the iterative method in an attempt to improve throughput on AVX-256.

In practice on Sandy Bridge a 2nd order Newton-Raphson is clearlyfaster for AVX-256 (1 rsqrt + 5 mul + 1 sub instead of sqrt with its 28 clock rcp throughput), so something to consider using if it's in a hotspot

it's even more important when you normalize vectors (a very common use case for 3D applications) to useNewton-Raphson since in this case it's 1 rsqrt + 4 mul + 1 sub instead of 1 sqrt + 1 div (i.e. a chain of very high latency / low throughput instructions)

OP didn't even tell us whether gcc is generating AVX-256 instructions, and we can't see his source code. I'm skeptical whether the _mm_malloc(...,32) would be sufficient to push gcc into that mode. Even if it did so, OP seems to be ignoring the architectural requirement for the splitting of AVX-256 memory accesses as well as sqrt.

note that he has now updated his code with an AVX-256 path, the previous version was with 2x the SSE path probably due to a copy&paste error, isn't it Joaquin?

one more thing Joaquin:

the way you name your variables is very confusing, the code generator automatically choose the register it wants to use so it may well use ymm6 in 32-bit and ymm13 in 64-bit mode for what you call "ymm0" in your code, it may be confusing if you have to do low level debugging and it's not very readable for the maintener of this code

I'll advise to use another notation such as :

"px" for "packed x", other ideas include"ox" for "octo x", etc. your code will then look like my example below, for any complex project I'll advise to use at least operator overloading if your vectorizer isn't up to the task

inline __m256 Sqr(const __m256 &px) {return _mm256_mul_ps(px,px);}
void length_avx(const float *x, const float *y, const float *z, float *l, unsigned int length)

{

for (unsigned int i=0; i
{

}

}

which splitting of memory accesses are you refering to ? it looks alright to me like this for 32B aligned arrays

Thanks for your comments and suggestions, and for the included code (bronxzv) If I work by chunks of 1024 floats, i.e., now I have two loops, the outer one is: for (int i = 0; i < 90000000; i += 1024), I get a speed-up of 4 (aprox.) for, both, SSE and AVX (see results below). TimP (Intel) commented: "As the AVX-256 sqrt is sequenced as 2 128-bit sqrt instructions, you could expect little difference in performance between SSE and AVX-256". It couldexplain why there's no difference between SSE and AVX in test0 (sqrt), but in test1 (where only three additions are performed) the speed-up is the same for AVX and SSE ! By executing "objdump -S ", I checked the assembly instructions are AVX, vaddps, vmovaps,vmulps, vsqrtps..., for both the SSE and the AVX functions, the difference cames from the inner loop 'step/stride' value, 4 for SSE and 8 for AVX. ======================================= TEST 0: l = sqrt((x*x) + (y*y) + (z*z)) ======================================= Seq time: 3.681436e-01 SSE time: 9.068346e-02 AVX time: 9.062290e-02 Speed-up Seq vs SSE : 4.06 Speed-up Seq vs AVX : 4.06 ======================================= TEST 1: l = x + y + z ======================================= Seq time: 3.898194e-01 SSE time: 1.120391e-01 AVX time: 1.076577e-01 Speed-up Seq vs SSE : 3.48 Speed-up Seq vs AVX : 3.62

it's true, sorry !

If I work by chunks of 1024 floats, i.e., now I have two loops, the outer one is: for (int i = 0; i < 90000000; i += 1024),

it's not perfectly clear reading this that you work with a L1 cache-blocked buffer (i.e. that you do the same computations a lot of times redundantly to measure the timings withhigh L1D hit %), I'll advise to post full source code

sorry, see next comment!

Exactly. What I want to know is why there's no difference between SSE and AVX for two simple functions, the first calculates sqrt((x*x) + (y*y) + (z*z)), and the second calculates x + y + z, in both cases, I get the same speed-up for SSE and AVX, when AVX speed-up should be two times SSE speed-up, right? Playing with the 'chunk' value (see code below), when chunk > 8 * 1024, the speed-up decreases from around 4.19 to around 2.33 len = 90000000; chunk = 1024; // AVX x = (float*)_mm_malloc(chunk * sizeof(float), 32); y = (float*)_mm_malloc(chunk * sizeof(float), 32); z = (float*)_mm_malloc(chunk * sizeof(float), 32); l = (float*)_mm_malloc(chunk * sizeof(float), 32); for(int j = 0; j < chunk; j++) { x[j] = j*1.0; y[j] = j*2.0; z[j] = j*3.0; } partial_t = tic(); for (int i = 0; i < len; i += chunk) { add_avx1(x, y, z, l, chunk); } avx_t += toc(partial_t); _mm_free(x); _mm_free(y); _mm_free(z); _mm_free(l); Thanks again for your comments

for big chunks you are clearly memory bandwidth bound, it's normal thatyou see drastic changes when you overflow the L1D cache (chunk > ~ 2000), then the L2 cache (chunk> ~16000) and eventually the LLC (not sure about your Xeon LLC capacity)

now, for chunk = 1000 you should see better speedups from AVX-256 vs SSE for the cases not using VSQRTPS, I'll expect something like 1.3x speedup

to ensure good timings I'll advise to disable enhanced speedstep and the turbo mode

now that we have agreed on the test procedure I suggest to post an ASM dump of the code of the two inner loops you are comparing, the SSE and AVX-256 version of the simple 3 x add case

That's right, bronxzv. But why I can't get the ideal speed-up of 8 (or close) for AVX, the maximum, I had, was 3.97, too far, even when I run with little chunks (such as 32, 64, 128, 256, 512)? Using SSE, I'm close to the ideal speed-up of 4 with chunks of 512.

Here you are (sorry but I don't know how to put it in a fancy format like you do): 00000000004020a0 : 4020a0: 45 85 c0 test %r8d,%r8d 4020a3: 74 2c je 4020d1 4020a5: 31 c0 xor %eax,%eax 4020a7: 45 31 c9 xor %r9d,%r9d 4020aa: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 4020b0: c5 f8 28 04 07 vmovaps (%rdi,%rax,1),%xmm0 4020b5: 41 83 c1 04 add \$0x4,%r9d 4020b9: c5 f8 58 04 06 vaddps (%rsi,%rax,1),%xmm0,%xmm0 4020be: c5 f8 58 04 02 vaddps (%rdx,%rax,1),%xmm0,%xmm0 4020c3: c5 f8 29 04 01 vmovaps %xmm0,(%rcx,%rax,1) 4020c8: 48 83 c0 10 add \$0x10,%rax 4020cc: 45 39 c8 cmp %r9d,%r8d 4020cf: 77 df ja 4020b0 4020d1: f3 c3 repz retq 4020d3: 66 66 66 66 2e 0f 1f data32 data32 data32 nopw %cs:0x0(%rax,%rax,1) 4020da: 84 00 00 00 00 00 00000000004020e0 : 4020e0: 55 push %rbp 4020e1: 48 89 e5 mov %rsp,%rbp 4020e4: 48 83 e4 e0 and \$0xffffffffffffffe0,%rsp 4020e8: 48 83 c4 10 add \$0x10,%rsp 4020ec: 45 85 c0 test %r8d,%r8d 4020ef: 74 30 je 402121 4020f1: 31 c0 xor %eax,%eax 4020f3: 45 31 c9 xor %r9d,%r9d 4020f6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 4020fd: 00 00 00 402100: c5 fc 28 04 07 vmovaps (%rdi,%rax,1),%ymm0 402105: 41 83 c1 08 add \$0x8,%r9d 402109: c5 fc 58 04 06 vaddps (%rsi,%rax,1),%ymm0,%ymm0 40210e: c5 fc 58 04 02 vaddps (%rdx,%rax,1),%ymm0,%ymm0 402113: c5 fc 29 04 01 vmovaps %ymm0,(%rcx,%rax,1) 402118: 48 83 c0 20 add \$0x20,%rax 40211c: 45 39 c8 cmp %r9d,%r8d 40211f: 77 df ja 402100 402121: c9 leaveq 402122: c5 f8 77 vzeroupper 402125: c3 retq 402126: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 40212d: 00 00 00

a big limiter is the fact that the 2 load ports and the store port are only 128-bit wide, it's a perfetct match for SSE and AVX-128 but usually a bottleneck for AVX-256

I note that you are effectively not comparing SSE with AVX but AVX-128 with AVX-256.

I don't see obvious problems with your code but since youhave only 2 fast computation instructionsfor 3 loads+ 1 store I'm afraid the load/store bottleneck I was refering to is the cause of the deceptive speedup

out of curiosity I'll study the timings on my own

if you want more convincing speedup from AVX-256 you can try to do more work on registers within the same loop, for example compute a bounding box for your vectors, you'll have typically instructions like

vminps ymm0,ymm6
vmaxps ymm0,ymm7

in my case it's the use case with the best observed speedup, more than 1.8 x for AVX-256 vs SSE

btw to add code snippets to this forum without going crazy simply write it in your favorite text editor then click on the icon with the orange pen then paste it in the edit box and specifythe syntax, for ex. C++

I have tested yoursimple add example and I find a very good 2x speedup(AVX-256 vs AVX-128) for the workloads fitting in the L1D cache, that's a lot better than I was expecting, when the L1D cache is overflowed timings are sometimes worse for AVX-256 than for AVX-128, though, then for big workloads timings are much the same (as expected) since we are mostly RAM bandwidth bound

my timings are as follows:

Core i7 3770K@ 3.5 GHz, enhanced speedstep disabled, turbo off
woking set size: AVX-128 time AVX-256 time

128: 22.5 ms 12.9 ms

256: 19.3 ms 11.2 ms

512: 19.4 ms 9.63 ms

1024: 19.3 ms 9.84 ms

2048: 19.2 ms 9.72 ms

4096: 20.8 ms 10.1 ms

8192: 20.1 ms 10.7 ms

16384: 19.7 ms 10.1 ms

32768: 19.5 ms 17.6 ms

65536: 24.5 ms 28.8 ms

131072: 23.6 ms 28.3 ms

262144: 28.4 ms 34 ms

524288: 38.8 ms 40.6 ms

1048576: 39 ms 41.5 ms

2097152: 39.1 ms 41.2 ms

4194304: 41 ms 43.1 ms

8388608: 53.6 ms 50.8 ms

16777216: 94.6 ms 85.9 ms

33554432: 110 ms 109 ms

67108864: 115 ms 113 ms

134217728: 118 ms 113 ms

source code:

template

inline T *AAlloc(size_t size)

{

return (T *)_aligned_malloc(sizeof(T)*size,32);

}
inline void AFree(void *p)

{

if (p) _aligned_free(p);

}
void AddTestAVX128(const float *x, const float *y, const float *z, float *l, unsigned int length)

{

for (unsigned int i=0; i
{

}

}
void AddTestAVX256(const float *x, const float *y, const float *z, float *l, unsigned int length)

{

for (unsigned int i=0; i
{

}

}
void JTTest(int chunkSize)

{

const int len = 90000000;
float *x = AAlloc(chunkSize), *y = AAlloc(chunkSize), *z = AAlloc(chunkSize), *l = AAlloc(chunkSize);

for (int j=0; j
Chrono chrono("");

const float start = chrono.getTime();

for (int i=0; i
const float t128 = chrono.getTime()-start;

for (int i=0; i
const float t256 = chrono.getTime()-t128;

(void)DS.width(9); DS.precision(3);

DS << chunkSize*sizeof(float)*4 << ": " << t128 << " ms " << t256 << " msn";

AFree(x); AFree(y); AFree(z); AFree(l);

}
// main call:
for (int chunkSize=8; chunkSize<10000000; chunkSize<<=1) JTTest(chunkSize);

ASM dumps :

.B51.3::                        ; Preds .B51.3 .B51.2
;;;   {

vmovups   xmm0, XMMWORD PTR [rcx+r10*4]                 ;478.35

vaddps    xmm1, xmm0, XMMWORD PTR [rdx+r10*4]           ;479.33

vaddps    xmm2, xmm1, XMMWORD PTR [r8+r10*4]            ;479.22

vmovups   XMMWORD PTR [r9+r10*4], xmm2                  ;479.18

mov       r10d, eax                                     ;476.36

cmp       eax, r11d                                     ;476.28

jb        .B51.3        ; Prob 82%                      ;476.28

.B52.3::                        ; Preds .B52.3 .B52.2
;;;   {

vmovups   ymm0, YMMWORD PTR [rcx+r10*4]                 ;487.38

vaddps    ymm1, ymm0, YMMWORD PTR [rdx+r10*4]           ;488.39

vaddps    ymm2, ymm1, YMMWORD PTR [r8+r10*4]            ;488.25

vmovups   YMMWORD PTR [r9+r10*4], ymm2                  ;488.21

mov       r10d, eax                                     ;485.36

cmp       eax, r11d                                     ;485.28

jb        .B52.3        ; Prob 82%                      ;485.28

I think it's the best speed-up we can get.
Thanks a lot, bronxzv, for all your time and posts,

nothing, actually I learned something from this experiment: loads/stores are far less a bottleneck than I was expecting and AVX-256 can be actually slower than AVX-128 for some working set sizes!

sinceyou basically test the same example you should be able to see the save nice speedup for small chunks (chunk size =~ 1000, working set =~ 16 KB), i.e. nearly a 8x speedup vs. your scalar path

all of this show very well, one more time,how important it is to use cache blocking techniques whenever it's possible

Note: on Sandy Bridge one can recover and improve 256-bit loads performance that are missing L1 vs 128-bit ones (256-bit loads are indeed slower than 2x128-bit loads on Sandy Bridge when missing L1, especially if the data is actually misalgned) by issuing prefetch'es to the cache lines before loads

-Max