Problems encountred during vectorization of code using SSE intrinsics

Problems encountred during vectorization of code using SSE intrinsics

I have been struggling with vectorizing a particular application for
sometime now and I have tried everything. From autovectorization, to
handcoded SSE intrinsics. But somehow I am unable to obtain speedup on
my stencil based application.

Following is a snippet of my current code, which I have vectorized using SSE intrinsics.

//#pragma ivdep
for ( i = STENCIL; i < z - STENCIL; i+=4 )
{
it = it2 + i;
__m128 center = _mm_mul_ps(_mm_load_ps(&p2[it]),C00_i);

u_j4 = _mm_load_ps(&p2[i+j*it_j-it_j4+k*it_k]); //Line 180
u_j3 = _mm_load_ps(&p2[i+j*it_j-it_j3+k*it_k]);
u_j2 = _mm_load_ps(&p2[i+j*it_j-it_j2+k*it_k]);
u_j1 = _mm_load_ps(&p2[i+j*it_j-it_j +k*it_k]);
u_j8 = _mm_load_ps(&p2[i+j*it_j+it_j4+k*it_k]);
u_j7 = _mm_load_ps(&p2[i+j*it_j+it_j3+k*it_k]);
u_j6 = _mm_load_ps(&p2[i+j*it_j+it_j2+k*it_k]);
u_j5 = _mm_load_ps(&p2[i+j*it_j+it_j +k*it_k]);

__m128 tmp2i = _mm_mul_ps(_mm_add_ps(u_j4,u_j8),X4_i);
__m128 tmp3 = _mm_mul_ps(_mm_add_ps(u_j3,u_j7),X3_i);
__m128 tmp4 = _mm_mul_ps(_mm_add_ps(u_j2,u_j6),X2_i);
__m128 tmp5 = _mm_mul_ps(_mm_add_ps(u_j1,u_j5),X1_i);

__m128 tmp6 = _mm_add_ps(_mm_add_ps(tmp2i,tmp3),_mm_add_ps(tmp4,tmp5));
__m128 tmp7 = _mm_add_ps(tmp6,center);

_mm_store_ps(&tmp2[i],tmp7); //Line 196

}

When I compile (icc) the above code without #pragma ivdep I get the following message: remark: loop was not vectorized: existence of vector dependence. vector dependence: assumed FLOW dependence between tmp2 line 196 and tmp2 line 196.
vector dependence: assumed ANTI dependence between tmp2 line 196 and tmp2 line 196.

When I compile (icc) it with the #pragma ivdep, I get the following message: remark: loop was not vectorized: unsupported data type. //Line 180

Why is there a dependence suggested for Line 196? How can I eliminate the suggested vector dependence?


3 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione
Ritratto di Thomas Willhalm (Intel)

When you already have vectorized the code with intrinsics, the compiler cannot auto-vectorize it: You already have done the vectorization. If you want to investigate in auto-vectorization, you have to go back to the scalar version.

Ritratto di jimdempseyatthecove

Consider writing your sse intrinsics to use fewer temps and to interleave the loads with the multiplys

Jim Dempsey

www.quickthreadprogramming.com

Accedere per lasciare un commento.