internal error: 04010002_1535 with ICC 13 on MIC

internal error: 04010002_1535 with ICC 13 on MIC


Probably I'm doing something wrong because I'm not an expert in inline assembler and I don't know what the syntax of the vpcmpd instruction is, but I get this error compiling this test for the MIC architecture:

#include <immintrin.h>
void foo(void * ptr)
   __m512i zero = _mm512_setzero_epi32();
   __m512i a = _mm512_load_epi32(ptr);
      vpcmpd k0, zero, a, 4;

I compile it with: 

icc foo.c -c -mmic -fasm-blocks
": internal error: 04010002_1535
compilation aborted for foo.c (code 4)

ICC version: icc (ICC) 13.0.0 20120731

If the code it's ok, it would be nice if someone could provide me with a workaround. Basicaly what I'm trying to do is something like:

vpcmpd k0, zero, vector_load(pointer), 4;

Thanks in advance.

10 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

This reference: (Moderator edit: added public documentation link for intrinsics) has a list of intrinsics supported on the current Intel Xeon Phi Coprocessor

It does look like a bug if the compiler throws internal error when you try to use intrinsics for a different architecture.
Intrinsics are easier to use than inline asm but they don't give you much more portability.
Note that the compiler you have was superseded by Update 1 today.

I have all the information about KNC intrinsics but not about assembler instructions.
Unfortunately I cannot use intrinsics directly because ICC optimizes “too much” my code. Let’s say I wanted to do something like:

volatile int* pointer;

while(_mm512_cmpneq_epi32_mask( _mm512_load_epi32((void *) pointer), _mm512_setzero_epi32()));

Even using the volatile qualifier, icc optimizes out the whole loop, probably because of the void* casting. It only works if I declare “pointer” as “int* volatile”, but then I get an extra load of the address in each iteration. This extra load is very important in my case and in addition I don’t like very much the resulting code generated with –O3. For this reason I was trying to implement this using inline assembler.

Would be there any other possibility or workaround?
If it is a bug, where should I report it?

Thank you!

I don't see how you can expect to use an mm512 intrinsic (vector of 16 32-bit values) in while(). Probably the compiler should warn, regardless of whether it treats it as dead code, but such warnings have been voted down many times over the years.
If you would post enough C++ code to show what you want, possibly you may get suggestions on how to optimize with icpc.
If you are using intrinsics as a stepping stone to assembler, you need to get the code working at each step before taking another.

agreed that the compiler should not throw an internal error. bug ID is DPD200237792 for this internal error.

Thank you.

_mm512_cmpneq_epi32_mask returns a __mmask16 data type which is not a 16-byte vector register but a 2-byte data type, so it should be possible to use it in a while(). In fact, I get the expected behavior, but not the expected assembler.


void foo(volatile int * pointer)


    while(_mm512_cmpneq_epi32_mask( _mm512_load_epi32((void *) pointer), _mm512_setzero_epi32()));

void foo2(int * volatile pointer)


    while(_mm512_cmpneq_epi32_mask( _mm512_load_epi32((void *) pointer), _mm512_setzero_epi32()));


Compiling with icc foo.c -S -mmic -O3, "foo" is optimized out and foo2 contains a loop like this:


        movq      -8(%rsp), %rax

        vpcmpd    $4, (%rax), %zmm0, %k0


        jknzd     ..B2.3, %k0

The volatile qualifier on the pointer avoids the while optimization but generates the movq in the loop.
It is subtle difference but what I'm looking for, using intrinsics or inline assembler, is something like:

        movq      -8(%rsp), %rax


        vpcmpd    $4, (%rax), %zmm0, %k0


        jknzd     ..B2.3, %k0

Thank you


Is there any news on the bug ID DPD200237792?


Hi Diego,

>>...I cannot use intrinsics directly because ICC optimizes “too much” my code...

Did you try to Turn off all optimizations ( globally ) or Use '#pragma optimize' directive to control optimizations of some blocks in a source file?

Best regards,

Hi Sergey,

thank you for your reply. If I turn off all optimizations I get a code that takes so much in my application. For example, compiling the "foo" function:

# parameter 1: %rdi
..B1.1: # Preds ..B1.0
..___tag_value_foo.1: #5.1
pushq %rbx #5.1
..___tag_value_foo.3: #
movq %rsp, %rbx #5.1
..___tag_value_foo.4: #
andq $-64, %rsp #5.1
subq $56, %rsp #5.1
pushq %rbp #5.1
movq 8(%rbx), %rbp #5.1
movq %rbp, 8(%rsp) #5.1
movq %rsp, %rbp #5.1
..___tag_value_foo.6: #
subq $256, %rsp #5.1
movq %rdi, -248(%rbp) #5.1
..B1.2: # Preds ..B1.2 ..B1.1
movq -248(%rbp), %rax #7.15
vmovdqa32 (%rax), %zmm0 #7.15
vmovaps %zmm0, -192(%rbp) #7.15
vpxord %zmm0, %zmm0, %zmm0 #7.15
vmovaps %zmm0, -128(%rbp) #7.15
vmovaps -128(%rbp), %zmm0 #7.15
vmovaps %zmm0, -64(%rbp) #7.15
vmovaps -192(%rbp), %zmm0 #7.15
vmovaps -64(%rbp), %zmm1 #7.15
vpcmpd $4, %zmm1, %zmm0, %k0 #7.15
kmov %k0, %eax #7.15
movw %ax, -256(%rbp) #7.15
movzwl -256(%rbp), %eax #7.15
kmov %eax, %k0 #7.15
jknzd ..B1.2, %k0 # Prob 50% #7.15
..B1.3: # Preds ..B1.2
leave #9.1
..___tag_value_foo.7: #
movq %rbx, %rsp #9.1
popq %rbx


This is what I see in DPD200237792:

This is a test error: two memory references are used in one instruction. Of course the compiler should report it in a more appropriate way. I am closing this as a duplicate of CQ179989.

Instead you can write the following:
vmovaps zmm0, zero;
vpcmpd k0, zmm0, a, 4;

Leave a Comment

Please sign in to add a comment. Not a member? Join today