offload error: cannot load library to the device 0 (error code 20)

offload error: cannot load library to the device 0 (error code 20)


      I'm trying to get a peice of compiler assisted offoad code working. The code uses MKL - it's just a test code though. At the moment, the code generates some random numbers on the host CPU, then offloads a functon to the Phi to print "Hello World!". I'm compiling as follows

icpc -std=c++11 -gxx-name=g++-4.8 -DMKL_ILP64 -I$MKLROOT/include  -offload-attribute-target=mic -offload-option,mic,compiler,"  -Wl,--start-group $MKLROOT/lib/mic/libmkl_intel_ilp64.a $MKLROOT/lib/mic/libmkl_core.a $MKLROOT/lib/mic/libmkl_sequential.a -Wl,--end-group" hellomkl_func.cpp -Wl,--start-group $MKLROOT/lib/intel64/libmkl_intel_ilp64.a $MKLROOT/lib/intel64/libmkl_core.a $MKLROOT/lib/intel64/libmkl_sequential.a -Wl,--end-group -lpthread -lm -o hellomkl_func

where the Compiler Options and MKL Link Line come from the Link Line Advisor. I want ILP64 integers (in the real code). The code compiles just fine. I've tested with both static and dynamic inking, but bth give the same error when the code executes -

Hello World from the host!
32 cores present on host
Num CoProcessors: 2
On the sink, dlopen() returned NULL. The result of dlerror() is "/tmp/coi_procs/1/4803/load_lib/icpcout508HQv: undefined symbol: __builtin_signbit"
On the remote process, dlopen() failed. The error message sent back from the sink is /tmp/coi_procs/1/4803/load_lib/icpcout508HQv: undefined symbol: __builtin_signbit
offload error: cannot load library to the device 0 (error code 20)


The sae offoad works if I comment out the MKL lines and don't link with MKL. What's going wrong?



Downloadtext/x-c++src hellomkl_func.cpp1.45 KB
16 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I'm ataching the reslt of running micinfo.



Downloadtext/plain micinfo_5.txt2.61 KB

Using dynamic libs, the sample built/ran using:

icpc -DMKL_ILP64 -I$MKLROOT/include -offload-option,mic,compiler,"-L$MKLROOT/lib/mic -lmkl_intel_ilp64
-lmkl_intel_thread -lmkl_core" -offload-attribute-target=mic hellomkl_func.cpp -lmkl_intel_ilp64
-lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -o hellomkl_func

With static, I noticed the MKL Link advisor suggested using the “intel64” libs in the -offload-option string
instead of “mic”. After fixing that, with static libs, the sample built/ran using:

icpc -DMKL_ILP64 -I$MKLROOT/include -offload-option,mic,compiler,"$MKLROOT/lib/mic/libmkl_intel_ilp64.a
$MKLROOT/lib/mic/libmkl_intel_thread.a $MKLROOT/lib/mic/libmkl_core.a" -offload-attribute-target=mic
hellomkl_func.cpp  -Wl,--start-group  $MKLROOT/lib/intel64/libmkl_intel_ilp64.a
$MKLROOT/lib/intel64/libmkl_intel_thread.a $MKLROOT/lib/intel64/libmkl_core.a -Wl,--end-group -liomp5
-lpthread -lm -o hellomkl_func

NOTE for both:  I do not have g++-4.8 installed so I left our the -std option; however, I reproduced your error using your command-line without this-std option also so I do not believe it is a factor.


$ icpc -V
Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version Build 20131008
Copyright (C) 1985-2013 Intel Corporation.  All rights reserved.

$ icpc -DMKL_ILP64 -I/opt/intel/composer_xe_2013_sp1.1.106/mkl/include '-offload-option,mic,compiler,
-L/opt/intel/composer_xe_2013_sp1.1.106/mkl/lib/mic -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core'
-offload-attribute-target=mic hellomkl_func.cpp -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5
-lpthread -lm -o hellomkl_func_dyn

$ ./hellomkl_func_dyn
Hello World from the host!
32 cores present on host
Num CoProcessors: 2
Nums: 4.20817
Nums: -0.136043
Nums: 4.92799
Nums: 1.57837
Nums: -3.545
Hello world from mic!
228 cores present on mic

$ icpc -DMKL_ILP64 -I/opt/intel/composer_xe_2013_sp1.1.106/mkl/include '-offload-option,mic,compiler,
-offload-attribute-target=mic hellomkl_func.cpp -Wl,--start-group
/opt/intel/composer_xe_2013_sp1.1.106/mkl/lib/intel64/libmkl_core.a -Wl,--end-group
-liomp5 -lpthread -lm -o hellomkl_func_static

$ ./hellomkl_func_static
Hello World from the host!
32 cores present on host
Num CoProcessors: 2
Nums: 4.20817
Nums: -0.136043
Nums: 4.92799
Nums: 1.57837
Nums: -3.545
Hello world from mic!
228 cores present on mic


Downloadimage/jpeg mkl_advisor_u500749.jpg280.17 KB


      I can build and run using your build line, but I'm confused about what the differences are. Comparing your build line with mine - for the static build, I notice the following major differences. 

1. You use $MKLROOT/lib/mic/libmkl_intel_thread.a and $MKLROOT/lib/mic/libmkl_core.a. in both the compiler options as well as the actual link line. MKL's link line advises $MKLROOT/lib/mic/libmkl_core.a and $MKLROOT/lib/mic/libmkl_sequential.a (which is what I use). Are you linking with the sequential version of MKL? In my real code, I make several MKL calls from within a large loop (including dfti calls). I' parallelizing the loop with OpenMP, so I'm guessing I want my MKL calls to be made to the sequential MKL library.

2. You pass an additional linker option: -liomp5.

I can't find the "With static, I noticed the MKL Link advisor suggested using the “intel64” libs in the -offload-option string instead of “mic”. After fixing that, with static libs, the sample also built/ran using:" part. My build line also has mic rather than intel64. Can you please clarify?

I've tried including the -iomp5 flag in my actual build line, but I get an error -

icpc: error #10104: unable to open '-liomp5'

Yet, on the same machine, I was able to build the hellomkl_func code earlier with this linker option without any trouble. What exacty am I linking with -liomp5?

Regarding 1:  I selected multithreaded. I overlooked the inclusion of sequential in your command-line. I understood MKL can be called within an OMP loop but I requested help from those more expert with MKL to post advice on using sequential/multithreaded for your case and review the thread and provide any other comment.

Regarding 2:  -liomp5 is the Intel Compatibility OpenMP* run-time library libiomp5. About the "intel64”, see the snap below. The red arrow points to the section where within -offload-option string it references “intel64” libraries which is incorrect. This option and string are directed to the MIC-side link; therefore, the libs provided must be “mic” flavor.

As to the question about g++-4.8, I believe you would need not only a g++-4.8 host installation, but consistent MIC support. icpc includes its own headers and binary support for them at g++-4.7 level, even though your host g++ may be older than that, but this will not permit you to mix in changes which came in g++ 4.8.

On the MKL offload question, Kevin pointed out that the offload facility within MKL doesn't involve you building or linking MIC functions; the facilities to choose between host and MIC execution are built into the intel64 MKL libraries, provided that the MIC MKL is also installed.  MKL sequential doesn't make sense in automatic offload mode; the host could always do a better job than a single MIC thread.  If your objective is simply to test that, you might be able to do it by setting environment variables, but a job big enough to go to MIC under automatic offload could take 50x as long if you restrict MIC to 1 thread.

If you have a reason for using MKL automatic offload in a parallel host region, it's possible conceptually to set a suitable number of threads for each MIC offload and allocate a distinct group of MIC cores to each host thread, but I doubt that convenient machinery exists to accomplish this.  The analogous situation where MKL offload is used in multiple MPI ranks on host does have documented support.  In my opinion, it might perform adequately only in the case where the MKL offload jobs aren't big enough for each one to use the entire coprocessor efficiently, and you might have to set the threshold for automatic offload correspondingly lower. 

If the concept of queuing offload tasks for MIC is suitable, there may be developments to support going in that direction.

So there's quite a bit of opportunity here for "walk before you run."  Find out whether the concept works at all before trying to break new ground.

Hi Tim,

      I'm not using the automatic offload mode - I'm planing on using compiler assisted offloading. Basically I'm calling MKL's statistical library to make sequences of random numbers. This portion of MKL is not parallelized from what I understand. After manipulating these sequences a bit, I perform an inverse FFT using MKL's DFTI (which supports both parallel and sequential execution). The resulting sequence is then further manipulated a little bit (not using MKL) to produce the final desired result. I'm trying to compute the auto-covariance of the resulting sequence using Monte-Carlo, so I need to compute many realizations of these sequences. At the moment, I have written a function that produces an array of the desired sequences and then calls MKL's auto-covarance computation routine to compute the desired matrix. The portion of the function that generates the array of realizations of these sequences has been parallelized by me using OpenMP - at the moment, the sequences are computed in parallel on the host CPUs from an OpenMP Parallel For region. I was hoping to offload this function to the Phi cards (we have two), and run the function on both Phi cards and on the host CPUs at the same time (using non-blocking offload calls). So I'd be making MKL calls from within an OpenMP parallel for region. Am I correct in assuming that I want to link with the sequential MKL libraries in that case?

If I use the -liomp5 linking, do I still want the usual -openmp linking for explicit OpenMP calls made by me rather than the MKL library?

Vishal, Kevin,

  • The "MKL Link Line Advisor" that Kevin used is out-of-date (v 2.2). We always keep the up-to-date link line advisor here: I plugged in the configurations you guys used and it did show correctly the MKL libs to be linked, either dynamically or statically.
  • Vishal's test code does not offload any MKL function. It only offloads calls to printf. So for this particular case, it does not need linking with MKL MIC libs. It only needs linking with MKL host libs, because the MKL functions are to be executed on host only.
  • If Vishal indeed wants to call MKL functions inside and offloaded OpenMP region, then it is correct that you want to link with the sequential MKL libs on the MIC side. You can use "MKL Link Line Advisor" to get proper compiler and linker options, by selecting the "sequential" threading layer. Note the link line advisor only shows MKL specific options. So it won't say anything about OpenMP when you choose sequential MKL. But you are using OpenMP outside MKL, you still need -liomp5.

Hope this helps.





Hi Kevin, Tim, & Zhang,

      One last question - if I link with OpenMP using -openmp, and including <omp.h>, I get no errors. If I link with Intel's OpenMP using -liomp5, but include <omp.h>, I get an unrecognised pragma error, but the code still compiles as it probably should (though in sequential mode I bet). What's the proper include file to use with Intel's OpenMP?

After a little googling I'm a bit confused - as per this page: I use -liomp5 when I'm linking Intel's OpenMP with something using gcc/g++, but as per the same page, when using Intel's compiler, I should use -openmp. So is that correct?   

Just use -openmp and drop the use of -liomp5. That came from the older v2.2 MKL Iink advisor. -openmp enables compile-time recognition of OMP pragmas AND links with the appropriate OpenMP libraries. -openmp appears within the MKL (v4.0) link advisor compiler options only when using Multi-threaded MKL. In your case you're selecting sequential; however, for the other OpenMP uses in your code you need to add -openmp.

Sorry for the confusion on that and for using the v2.2 link advisor. We will disable or have the older article I found redirect to the version Zhang cited so others only find the current version.

Let's back up a little, because there are two different things going on here, which interact somewhat and are confusing you.

1) You need to tell the Intel compiler front end to compile OpenMP code. That's achieved by using -openmp flag. If you compile code with OpenMP pragmas without the -openmp flag to the compiler then you'll get the "unrecognised pragma" errors. 

2) At link time you need to include the OpenMP runtime library. You can do that either by taking control yourself and explicitly adding -liomp5 to the link command, or, if you are using the compiler driver to control the linker, passing -openmp will cause the compiler driver to add -liomp5 to the link commands itself.


1) <omp.h> is the correct header for OpenMP interface functions (and, I think it's harmless even if you don't enable OpenMP), but to have the compiler understand the OpenMP pragmas you need to tell it to do that, which you achieve with the -openmp flag. (Similalry with gcc you need the -fopenmp flag).

2) You need to link OpenMP code against libiomp5, but can achieve that either explicitly or implicitly. 

Using -v to see the link command generated by the compiler driver is often useful to understand what's going on.


Hi guys,

       You've all been very helpful so far. Now I have a conceptual question. I'm trying to optimize the value of a number using the popular optimization library NLOpt. Computing this number requires an array which is currently computed by a single function written by me - let's call it computeCovarMatrix. This function, computeCovarMatrix, that I'm trying to offload takes three heap allocated arrays, a whole mess of heap allocated workspace arrays (allocated once before calls to a minimization routine - NLOpt), and produces a single output array - the required covariance matrix. At the moment, I have a struct that holds pointers to all these arrays. A function, allocateMem, heap allocates the correct sized arrays and maps the appropriate pointers (member pointers of the struct) to these arrays. After this, I just pass the struct around where required. So my target function, computeCovarMatrix, takes this struct as an input and overwrites the workspace and output arrays leaving the three original input arrays untouched. Since I allocate+setup all the required arrays before the NLOpt call, everything gets allocated just once.

My question is - what's the best way to offload the computation in computeCovarMatrix asynchronously to my two xeon phi cards so I can run the function simultaneously on both the phi cards as well as the host CPUs? Here are a few options that I came up with -

1. Offload_transfer the 3 input arrays with allocif(1), freeif(0) to the Phis. Then call the function, allocateMem, to heap allocate all the remaining arrays locally on the card. I would then call the NLOpt minimization routine from the host CPU, but then will the previously allocated arrays persist on the cards? In other words, if I call _mm_malloc inside a #pragma offload region, what is the lifetime of heap allocated variables created inside the #pragma offload? Do they persist across offload calls?

2. Offload transfer all the required arrays with allocif(1), freeif(0) before the NLOpt call. Then all the arrays should be persistent and can be used by my computeCovarMatrix routine when its called from 

Which option sounds more correct? Is there a better method?

The memory allocated on the card persists until you call free.  You just need to access the pointer to this memory, to do that have a global pointer declared with declspec(target(mic)) and assign the malloced memory to this pointer.

Also if this pointer is used in the lexical scope of the offload pragma,  make sure to use nocopy so the pointer value on the card does not get clobbered.

I forgot to state that use the 1st option with the method I described.

I have met a similar error. I used OpenMp In the offload region.The compiler is mpiicpc. There was an error when the code is run. I have tried the solutions mentioned above, but the error was still not be solved. 

Here I give the error. How can I solve it?

On the remote process, dlopen() failed. The error message sent back from the sink is /var/volatile/tmp/coi_procs/1/53421/load_lib/icpcoutpZ6UOg: undefined symbol: _ZN9__gnu_cxx17__normal_iteratorIPlSt6vectorIlSaIlEEEppEi

offload error: cannot load library to the device 0 (error code 20)

On the sink, dlopen() returned NULL. The result of dlerror() is "/var/volatile/tmp/coi_procs/1/53421/load_lib/icpcoutpZ6UOg: undefined symbol: _ZN9__gnu_cxx17__normal_iteratorIPlSt6vectorIlSaIlEEEppEi"

Leave a Comment

Please sign in to add a comment. Not a member? Join today