I need some help about "undefined reference to _mm_tzcnti_64" link error

I need some help about "undefined reference to _mm_tzcnti_64" link error

#include <stdio.h>
#include <immintrin.h>
#pragma offload_attribute(push,target(mic))
class mic_f{
public:

  union{
    __m512 v;
    float f[16];
  };

  __forceinline operator __m512 () const { return v; };
  __forceinline float& operator[](const size_t index) { return f[index]; };
  __forceinline mic_f& operator=(const mic_f& f) { v = f.v; return *this; };
  __forceinline mic_f(const __m512& t) { v = t; };
  __forceinline mic_f() {};

};
#pragma offload_attribute(pop)
class mic_m{
public:
  __mmask16 v;
#pragma offload_attribute(push,target(mic))
  __forceinline operator __mmask () const { return v; };
  __forceinline mic_m(int t ) { v = (__mmask16)t; };
  __forceinline mic_m(unsigned int t ) { v = (__mmask16)t; };
#pragma offload_attribute(pop)
};

#pragma offload_attribute(push,target(mic))
__forceinline size_t bitscan64(const ssize_t index,const size_t v) { 
  return _mm_tzcnti_64(index,v); 
};
#pragma offload_attribute(pop)

void main(){
  mic_f data[3];
  for(int i=0;i<16;i++){
    data[0][i]=i%4+1;
    data[1][i]=4;
    data[2][i]=0;
  }
  float* indata = (float*)data;
#pragma offload target(mic)inout(indata:length(48))
  {
    mic_f* ineedyou = (mic_f*)indata;
    const unsigned long hiti = 0x8888;
    const unsigned long pos_first = 4;

    const unsigned long pos_second = bitscan64(pos_first,hiti);
  }
}

when I compile the code with "icc -o test includetest.cpp" I got the error "In function `bitscan64(long, unsigned long)':
includetest.cpp:(.text._Z9bitscan64lm[_Z9bitscan64lm]+0xb): undefined reference to `_mm_tzcnti_64",and when I change the intrincs "_mm_tzcnti_64" into "_mm_tzcnt_64" everything is OK.

How can I solve this problem? I think may be there need some Compiler Options?

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

This looks like a probable defect. I don't know/understand why, but in the failing case the compiler generates a jmp _mm_tzcnti_64 leading to the unresolved symbol. Let me inquire w/Development and post again when I know more.

Sorry, I overlooked this earlier. Looking at this more closely the issue relates to using the Intel Xeon Phi™ specific intrinsic _mm_tzcnti_64 but not guarding it accordingly for use only on the target CPU. This intrinsic is only available for Xeon Phi so one must protect the use within the offload code with something like the __MIC__ predefine and then also provide an equivalent for the host CPU that would execute in the case where the code does not offload.

So your bitscan64() function needs to have structure like:

#pragma offload_attribute(push,target(mic))
__forceinline size_t bitscan64(const ssize_t index,const size_t v) 
{
#ifdef __MIC__
   return _mm_tzcnti_64(index,v);  
#else
  // insert the host CPU equivalent function/code here
#endif
};
#pragma offload_attribute(pop)

The reason this works with _mm_tzcnt_64 relates to the support for _tzcnt_u64 on the host CPU from what I can see.

I don’t know what can be used for the host CPU equivalent to the _mm_tzcnti_64 intrinsic. I can inquire w/others and let you know.

Quote:

Kevin Davis (Intel) wrote:

Sorry, I overlooked this earlier. Looking at this more closely the issue relates to using the Intel Xeon Phi™ specific intrinsic _mm_tzcnti_64 but not guarding it accordingly for use only on the target CPU. This intrinsic is only available for Xeon Phi so one must protect the use within the offload code with something like the __MIC__ predefine and then also provide an equivalent for the host CPU that would execute in the case where the code does not offload.

So your bitscan64() function needs to have structure like:

#pragma offload_attribute(push,target(mic))
__forceinline size_t bitscan64(const ssize_t index,const size_t v) 
{
#ifdef __MIC__
   return _mm_tzcnti_64(index,v);  
#else
  // insert the host CPU equivalent function/code here
#endif
};
#pragma offload_attribute(pop)

The reason this works with _mm_tzcnt_64 relates to the support for _tzcnt_u64 on the host CPU from what I can see.

I don’t know what can be used for the host CPU equivalent to the _mm_tzcnti_64 intrinsic. I can inquire w/others and let you know.

That really solve my problem. Thank you very much!

And if you have time, can you give me some informations about the detail when icc compile and link the source file with and without the macro __MIC__  with offload mode about its two obj files ( xx.o and xxMIC.o) ?

 

Great, glad that helped.

Use of the __MIC__ macro is not sufficient to trigger the “offload compilation” which comprises of compiling for the host CPU and target (coprocessor) CPU; therefore, if the source file only contains that macro then you will not see the two object files created when compiling to object (i.e. -c). It is the presence of offload language extensions (e.g. #pragma offload) that trigger the offload compilation and creation of the two object files you noted. The object files are only left on disk when compiling to object using -c. (A side note: The release due out later this year produces a single merged .o containing both the host CPU and target CPU objects.) When you compile to an executable without any intermediate compilation to object then you do not see the two object files and the compiler creates a single final executable containing the host CPU and target CPU executables merged into a single executable file. The run-time knows how to split the merged executable file and load the target CPU executable on the coprocessor for execution.

When compiling to object, you also only need to reference to the host CPU .o file. The compiler and other associated tools (xiar, xild - these tools require at least the -qoffload-build option) handle the target CPU .o file invisibly.

Let me know if you have any other questions.

Best Reply

FYI, here is a function that was offered as the host-equivalent to _mm_tzcnti_64.

__int64 _mm_tzcnti_64_emulate(__int64 dest, unsigned __int64 src)
{
    if (dest < 0) {
        return _mm_tzcnt_64(src);
    }
    if (dest >= 63) {
        return 64L;
    }
    __int64 mask = (1L<<(dest+1))-1;
    return _mm_tzcnt_64(src & ~mask);
}

 

Quote:

Kevin Davis (Intel) wrote:

FYI, here is a function that was offered as the host-equivalent to _mm_tzcnti_64.

__int64 _mm_tzcnti_64_emulate(__int64 dest, unsigned __int64 src)
{
    if (dest < 0) {
        return _mm_tzcnt_64(src);
    }
    if (dest >= 63) {
        return 64L;
    }
    __int64 mask = (1L<<(dest+1))-1;
    return _mm_tzcnt_64(src & ~mask);
}

 

uh a little too long, thank you Kevin. I'm very grateful for your help.

Leave a Comment

Please sign in to add a comment. Not a member? Join today