help to understand overhead with movaps

help to understand overhead with movaps

I am now learning how to use vtune to profile my code and optimize the hotspots. The core computation of my code is an inline function containing SSE instructions. With VTune, I found the most expensive statement was a "movaps %xmm10, %xmm9" which is part of the second statement of the function. The first half of the function looks like the following

inline int havelsse4(float4 *vecN, float4 *pout,float4 *bary, const __m128 o,const __m128 d,const __m128 int_coef){
    const __m128 n = _mm_load_ps(&vecN->x);
    const __m128 det = _mm_dp_ps(n, d, 0x7f);
    float vecalign;
        return 0;
    const __m128 dett = _mm_dp_ps(_mm_mul_ps(int_coef, n), o, 0xff);
    const __m128 oldt = _mm_load_ss(&bary->x);

The VTune hotspot screenshot is attached below. From the assembly, the movaps statement basically prepare for the dpps between n and d (xmm10 points to n), and then, comiss tests for vecalign<0.f.

In fact, I am not familiar with SSE. Does the timing info make sense to you? is movaps typically more costly than dpps? Any tricks to optimize this statement?

pointer &vecN->x points to a 16B aligned struct here.

1 post / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.