missing compiler prefetches in intrinsics code with linear memory access

missing compiler prefetches in intrinsics code with linear memory access

Hi all,

the Intel Compiler 14.0.3 does not insert software prefetches for the following linear test program

#include <iostream>
#include <immintrin.h>

int main() {

    const int elements = 1e7;

    const int mem_size = 16 * elements * sizeof(float); // 640 MB

    float *vec_a = (float*)_mm_malloc( mem_size, 64 );
    float *vec_b = (float*)_mm_malloc( mem_size, 64 );

    // initialization
    for ( int i = 0; i < 16*elements ; ++i ) {

        vec_a[i] = 0.8f;
        vec_b[i] = 0.6f;

    #pragma omp parallel
        const __m512 mass_ = _mm512_set1_ps( 0.123f );

        __m512 vec_a_, vec_b_;

        #pragma omp for schedule(static)
        for ( int i = 0; i < 16*elements ; i += 16 ) {

            vec_a_ = _mm512_load_ps( vec_a + i );
            vec_b_ = _mm512_load_ps( vec_b + i );

            vec_a_ = _mm512_fmadd_ps( mass_, vec_a_, vec_b_ );

            _mm512_storenrngo_ps( vec_b + i, vec_a_ );

    // prevent deadcode optimizations
    float delta = 0.0f;

    for ( int i = 0; i < 16*elements ; ++i ) {

        delta += vec_b[i];

    std::cout << delta << std::endl;

    _mm_free( vec_a );
    _mm_free( vec_b );

The Compiler generates the following assembler (icpc -O3 -mmic -openmp -S -masm=intel linear.cpp)

             mov       r8, QWORD PTR [r13]                           
             add       rcx, 16                                     
             vmovaps   zmm0, ZMMWORD PTR [r8+rax]                    
             mov       dl, dl                                        
             mov       r9, QWORD PTR [r14]                    
             vfmadd213ps zmm0, zmm1, ZMMWORD PTR [r9+rax]           
             vmovnrngoaps ZMMWORD PTR [r9+rax], zmm0               
             add       rax, 64                               
             cmp       rcx, rdx                                      
             jle       ..B1.36  

so... no software prefetches. Of course, I could insert prefetch intrinsics, but I guess that the compiler should be much better in doing that for a linear memory access? I did try to use #pragma prefetch and -opt-prefetch=4 with no success. It seems to be a compiler problem, because the Intel compiler 15.0b does insert prefetch instructions.

However, the current 15.0b compiler generates a 30% slower code for my bigger program. 

So my question is: How can I force the 14.0 compiler to insert software prefetches for linear intrinsics code?




6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Perhaps the compiler is a bit confused by specification of storenrngo on data which you first brought into cache.

Did the 15.0 release compiler return to the 14.0 treatment?

Did you compare performance of various alternatives, including plain C code, with various prefetch and streaming store and pragma nontemporal options?   I'd like to know motivation when using intrinsics rather than plain code, but I concede for some people it's the other way around.

Yes, you are correct. The 14.0 compiler inserts software prefetches when I use _mm512_store_ps. 

Of course, It does not make sense to use intrinsics for such simple programs. This is just a small program to show the problem. I just want to understand the compiler for intrinsics code! I am not interested in plain C code.

So I still have the problem with the 14.0 compiler for the program with streaming stores!

I am checking w/Development about 14.0. I confirmed your findings with our latest 14.0 Update 4 and 15.0 release.

Did you get a response from the Development?

My apologies. Yes I did in the form of a forwarded email that included yourself and others from Intel. Based on that I thought you had the response already. Here are the Developer's comments from the thread regarding your test case:

In general, the compiler will not insert prefetches for a cache-line that is stored using nrngo inside a loop - this is intended since prefetching that cache-line would defeat the advantage (of avoiding the RFO) you get from using streaming-stores on KNC. This should be true for compiler-generated nrngo instructions OR user-inserted intrinsic (for the same) inside a loop. In the particular case below, the use of nrngo doesn’t really help, since the loop is actually doing a load-modify-store of the same elements (there appears to be a load with the same address as the store immediately preceding the store).

The compiler relies on identifying linear access patterns of the load/store addresses in-order to insert the prefetches. If the compiler is not able to identify the linear access pattern, then the prefetches will not be inserted.
With 15.0, the prefetch that happens appears to be for another load-address unconnected to the store.
If the nrngo is replaced by regular store, even the 14.0 compiler does better. But generally, 15.0 is more aggressive in terms of prefetching for intrinsic loads/stores.

Either way, I would recommend:
a) Use regular store instead of nrngo
b) Use 15.0 and if you run into any (other) issues, please file a bug-report.

I saw your comment that the 15.0 Beta produces 30% slower code and if that’s still the case with the official 15.0 release version then please let us know so we can try to obtain reproducers and have Development investigate and address any associated defects.

Again, my apologies.


Leave a Comment

Please sign in to add a comment. Not a member? Join today