Cryptic error msg: offload error: cannot offload to MIC - device is not available

Cryptic error msg: offload error: cannot offload to MIC - device is not available

Mikalai Kisialiou (Intel)'s picture

Cryptic error msg: offload error: cannot offload to MIC - device is not available "offload error: cannot offload to MIC - device is not available" Could anyone please help me understand what this error message means? Code attempts to offload execution using "pragma offload": Sample code posted by Naik Sumedh in the thread below. http://software.intel.com/en-us/forums/topic/369458 Compilation succeeds: $ source ${path_to_icc}/compilervars.sh intel64 $ icc test1.cpp -o test1 KNC card is up and running: $ /usr/sbin/micctrl -w mic0: online SSH to MIC works alright. I can run native code on the KNC card. Yet, when I attempt to run the executable with offload pragmas on the host I get: $ ./test1 offload error: cannot offload to MIC - device is not available Any ideas why the device appears to be unavailable? Thanks, Nick

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Kevin Davis (Intel)'s picture

The test case includes the explicit card expression "mic:0" so if the card is unavailable that will produce the error. The error indicates the offload run-time received indication the card is not available. Can you show what version of the compiler you have with icc -V ?

Can you set the environment variable using export OFFLOAD_REPORT=3 and run test1 and post any output? If you have version 2013.1.117 you may not receive output. The latest 2013.2.146 (Update 2) may be needed to see OFFLOAD_REPORT output that shows details other than timing info.

Sumedh Naik (Intel)'s picture

Hi Mikalai, 

The code works for me. I used icc version 13.1.0 (Composer_xe_2013.2.146). Just as Kevin suggested, OFFLOAD_REPORT=3 would give you a better idea of what's going on. You could also try restarting the MPSS to see if that makes the problem go away. 

-Sumedh

Mikalai Kisialiou (Intel)'s picture

Kevin and Sumedh,

I appreciate your help!

The compiler I used yesterday was an older version that doesn't support OFFLOAD_REPORT:
$ icc -V
Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.0.0.079 Build 20120731
Copyright (C) 1985-2012 Intel Corporation.  All rights reserved.

That said, I'm not sure if it's really the compiler or the card that was misbehaving.
Today I have moved to another server that has a newer compiler installed:

> icc -V
Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.1.0.146 Build 20130121
Copyright (C) 1985-2013 Intel Corporation.  All rights reserved.

With this compiler and another KNC card Sumedh's offload example seems to work alright:

> ./test3
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 5 5 5 5 5 0 0 0 0 0 0 0 0 0 0

I will investigate the KNC card differences later since I have to share servers with other developers who also need to work. For now, I would appreciate if you could explain to me what [5:5] syntax means in the following pragma from that example:

  #pragma offload target(mic:0) \
    in(my_array[5:5] : into(my_array[5:5]) \
    alloc_if(0) free_if(0))

Since I need to update 5th through 9th value in the array, isn't it supposed to be this?

  #pragma offload target(mic:0) \
    in(my_array[5:9] : into(my_array[5:9]) \
    alloc_if(0) free_if(0))

Is there any reference documentation that documents all these useful pragmas, their meaning, and the underlying COI library? For example, I would like to find out answers to the following questions:

+ How much overhead do I introduce by using "#pragma offload" compared to using COI library calls.

+ Is there any documentation for COI API? I've found documentation for the SKIF interface but it is too low-level. I don't think that application developers want to explicitly deal with cache lines and memory layouts. It's like programming assembly. My understanding is that COI interface simplifies it quite a bit. Am I right on this?

+ If my array needs only 5% of random entries to be updated and I start updating them one-by-one using syntax like my_array[5:5], am I going to be swamped by some overheads? My application area is the large-scale circuit simulation where the array stores a huge sparse 2D matrix. Only a small percentage of entries change from one simulation step to the next. But the location of changes is dependent on circuit activity and therefore unpredictable in advance. Would one-by-one pragma's suffice for such a data transfer or should I program in COI or in SKIF?

I really appreciate your expert advice on this.
Thank you!
Nick

Kevin Davis (Intel)'s picture

I cannot speak with any authority about COI/SCIF. That’s always been invisible from my compiler/offload perspective.

The syntax you cited is using an array-slice that is discussed in the UG under the offload pragma. The syntax of the array-slice is: variable-name [ <starting element> :  <element count> ]

To update the 5th to 9th values in the array, you would use: my_array[4:5]. (Starting at 4 since my_array[0] is the 1st value)

I do not know whether the partial transfers would be swamped w/overhead. Retaining memory and only updating values I believe reduces some associated allocation/deallocation overhead. I can inquire w/development about this.

Is it possible to move the code updating the 5% into the offload execution also so you might only need to upload the full array once at the beginning and perhaps download once at the end?

Kevin Davis (Intel)'s picture

Here are the compiler Developer's comments that I received:

The example shows alloc_if(0)/free_if(0) so no new memory is allocated, so no overhead. Only the “length” amount of data is transferred.

There won’t be a significant difference in data transfer time between [the] compiler and COI.

Transferring small amounts of data, especially single elements, will give poor performance results overall.

(In reply my asking whether the virtual shared model might be better suited) Yes, he might be better off using _Cilk_shared, if the program is written in C. Fortran is not supported at all, and C++ can lead to complications with separating the shared and non-shared data.

Login to leave a comment.