Problems encountred during vectorization of code using SSE intrinsics

Problems encountred during vectorization of code using SSE intrinsics

I have been struggling with vectorizing a particular application for
sometime now and I have tried everything. From autovectorization, to
handcoded SSE intrinsics. But somehow I am unable to obtain speedup on
my stencil based application.

Following is a snippet of my current code, which I have vectorized using SSE intrinsics.

//#pragma ivdep
for ( i = STENCIL; i < z - STENCIL; i+=4 )
{
it = it2 + i;
__m128 center = _mm_mul_ps(_mm_load_ps(&p2[it]),C00_i);

u_j4 = _mm_load_ps(&p2[i+j*it_j-it_j4+k*it_k]); //Line 180
u_j3 = _mm_load_ps(&p2[i+j*it_j-it_j3+k*it_k]);
u_j2 = _mm_load_ps(&p2[i+j*it_j-it_j2+k*it_k]);
u_j1 = _mm_load_ps(&p2[i+j*it_j-it_j +k*it_k]);
u_j8 = _mm_load_ps(&p2[i+j*it_j+it_j4+k*it_k]);
u_j7 = _mm_load_ps(&p2[i+j*it_j+it_j3+k*it_k]);
u_j6 = _mm_load_ps(&p2[i+j*it_j+it_j2+k*it_k]);
u_j5 = _mm_load_ps(&p2[i+j*it_j+it_j +k*it_k]);

__m128 tmp2i = _mm_mul_ps(_mm_add_ps(u_j4,u_j8),X4_i);
__m128 tmp3 = _mm_mul_ps(_mm_add_ps(u_j3,u_j7),X3_i);
__m128 tmp4 = _mm_mul_ps(_mm_add_ps(u_j2,u_j6),X2_i);
__m128 tmp5 = _mm_mul_ps(_mm_add_ps(u_j1,u_j5),X1_i);

__m128 tmp6 = _mm_add_ps(_mm_add_ps(tmp2i,tmp3),_mm_add_ps(tmp4,tmp5));
__m128 tmp7 = _mm_add_ps(tmp6,center);

_mm_store_ps(&tmp2[i],tmp7); //Line 196

}

When I compile (icc) the above code without #pragma ivdep I get the following message: remark: loop was not vectorized: existence of vector dependence. vector dependence: assumed FLOW dependence between tmp2 line 196 and tmp2 line 196.
vector dependence: assumed ANTI dependence between tmp2 line 196 and tmp2 line 196.

When I compile (icc) it with the #pragma ivdep, I get the following message: remark: loop was not vectorized: unsupported data type. //Line 180

Why is there a dependence suggested for Line 196? How can I eliminate the suggested vector dependence?


publicaciones de 3 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.
Imagen de Thomas Willhalm (Intel)

When you already have vectorized the code with intrinsics, the compiler cannot auto-vectorize it: You already have done the vectorization. If you want to investigate in auto-vectorization, you have to go back to the scalar version.

Imagen de jimdempseyatthecove

Consider writing your sse intrinsics to use fewer temps and to interleave the loads with the multiplys

Jim Dempsey

www.quickthreadprogramming.com

Inicie sesión para dejar un comentario.