Hi,Please, look at these pieces of code, consisting of three versions tocalculate the length of a set of 3-D vectors.Let's assume, vector v, with components x, y, z.The length (l) of vector v, is l = sqrt((x*x) + (y*y) + (z*z))I implemented three versions based on scalar, SSE and AVX instruction, to compute thelength of 90 000 000 vectors. I hope to get much better performance using SSE,and AVX, but no...., here the results:=======================================TEST 0: l = sqrt((x*x) + (y*y) + (z*z))======================================= Scalar time: 0.46051 SSE time : 0.18613 AVX time : 0.19043 Speed-up Scalar vs SSE : 2.47 Speed-up Scalar vs AVX : 2.42I hope a speed-up of 4 when using SSE, and much more with AVX,but there is no difference between SSE and AVX.Target architecture:

- Intel Xeon CPU E31245 @ 3.30GHz
- 4 CPU dual-core (but I only use one core)

Command line to compile:gcc -O3 -std=c99 -mavx main.c -o main -lm(with ic compiler SSE and AVX are similiar too)And the code:Allocating memory for the SSE version:x = (float*)_mm_malloc(len * sizeof(float), 16);y =(float*)_mm_malloc(len * sizeof(float), 16);....

//----------------------------------------------------------------------------------------------------------------------void length_scalar(float *x, float *y, float *z, float *l, unsigned int length) { for (int i = 0; i l[i] = sqrt((x[i]*x[i]) + (y[i]*y[i]) + (z[i]*z[i])); }}//----------------------------------------------------------------------------------------------------------------------void length_sse(float *x, float *y, float *z, float *l, unsigned int length) { __m128 xmm0, xmm1, xmm2, xmm3; for (int i = 0; i xmm0 = _mm_load_ps(&x[i]); xmm1 = _mm_load_ps(&y[i]); xmm2 = _mm_load_ps(&z[i]); xmm3 = _mm_add_ps(_mm_mul_ps(xmm0, xmm0), _mm_mul_ps(xmm1, xmm1)); xmm3 = _mm_add_ps(_mm_mul_ps(xmm2, xmm2), xmm3); xmm3 = _mm_sqrt_ps(xmm3); _mm_store_ps(&l[i], xmm3); }}//----------------------------------------------------------------------------------------------------------------------void length_avx(float *x, float *y, float *z, float *l, unsigned int length) { for (int i = 0; i ymm0 = _mm256_load_ps(&x[i]); ymm1 = _mm256_load_ps(&y[i]); ymm2 = _mm256_load_ps(&z[i]); ymm3 = _mm256_add_ps(_mm256_mul_ps(ymm0, ymm0), _mm256_mul_ps(ymm1, ymm1)); ymm3 = _mm256_add_ps(_mm256_mul_ps(ymm2, ymm2), ymm3); ymm3 = _mm256_sqrt_ps(ymm3); _mm256_store_ps(&l[i], ymm3); }}

//----------------------------------------------------------------------------------------------------------------------

Could you, please, give me some hints, suggestions....to explain that?I think it is due to the 4 instructions to move data (memory /register, i.e., the load and store instructions),what do you think?If I ran a example more simple (addition of the 3 components of a vector, for 90 000 000 vectors)and I got worse results:=======================================TEST 1: l = x + y + z======================================= Scalar time: 0.61573 SSE time : 0.34304 AVX time : 0.34770 Speed-up Scalar vs SSE : 1.79 Speed-up Scalar vs AVX : 1.77Any idea?Thanks a lot--Joaqun