Debugging COI Applications for Intel® Xeon Phi™ Coprocessors

Introduction

The Coprocessor Offload Infrastructure (COI) library is designed for communication between the processes on the host and the Intel® Xeon Phi™ Coprocessors. As per the COI terminology, one process is a source, and the other process is a sink. The communication channel between them is a pipeline initiated from the source to the sink. The source and sink are two binary executables compiled and built for their respective architectures. The source process is responsible for launching the coprocessor process through the COI API calls. 

This blog is written to help developers analyze and debug COI errors encountered while executing applications using COI API for offloading to Intel® Xeon Phi™ Coprocessors. It explains different methods/tools which a developer can use to trace the COI error and get meaningful information about the error. However, this blog does not explain any steps for building and running COI applications. For detailed COI API documentation and steps for building and running COI applications, refer to the COI API Reference Manual and coi_getting_started guide respectively, which is included as a part of the Intel® MPSS installation package.

By default, COI is installed in locations as shown:

/usr/share/doc/intel-coi-<version>COI API Reference Manual, COI getting started guide, and the release notes
/usr/includeInclude files required to build COI applications
/usr/share/do/intel-coi-<version>/tutorialsCode samples that can be helpful for learning how to write COI applications
/usr/binCOI tools to assist in development
/usr/lib64COI shared libraries needed to build COI applications

The rest of the blog is organized into the following sections, where each section explains a COI debugging/tracing method in detail:


Getting Error Information Using COIRESULT

COI uses COIResult for its error reporting. The form of the error message varies depending on the function which received and checked the COIResult value. However, the message usually takes the form of:

{function that checked for COIResult} with {COIResult mnemonic}

There are a couple of ways in which this could be done is as follows1:

Example 1:

#include <intel-coi/source/COIProcess_source.h>
#include <intel-coi/source/COIEngine_source.h>

COIRESULT               result = COI_ERROR;
COIENGINE               engine;
result = COIEngineGetHandle(COI_ISA_MIC, 0, &engine);

    if (result != COI_SUCCESS)
    {
        printf("COIEngineGetHandle result %s\n", COIResultGetName(result));
        return -1;
    }

Example 2: 

#include <intel-coi/source/COIProcess_source.h>
#include <intel-coi/source/COIEngine_source.h>

#define CHECK_RESULT(_COIFUNC) \
{ \
    COIRESULT result = _COIFUNC; \
    if (result != COI_SUCCESS) \
    { \
        printf("%s returned %s\n", #_COIFUNC, COIResultGetName(result));\
        return -1; \
    } \
}


COIENGINE               engine;

//Now every call to COI API function can be wrapped by CHECK_RESULT
    CHECK_RESULT(
    COIEngineGetHandle(COI_ISA_MIC, 0, &engine));

The associated names and basic meanings for each of the possible values of COIResult are given in the header file COIResult_common.h (default location: /usr/include/intel-coi/common) and are also listed here with possible reasons as to why the error might occur.

Error code

Offload Error

Remark

0

COI_SUCCESS

The function succeeded without error

1

COI_ERROR

Unspecified error

2

COI_NOT_INITIALIZED

The function was called before the system was initialized

3

COI_ALREADY_INITIALIZED

The function was called after the system was initialized

4

COI_ALREADY_EXISTS

Cannot complete the request due to the existence of a similar object

5

COI_DOES_NOT_EXIST

The specified object was not found

6

COI_INVALID_POINTER

One of the addresses provided was not valid

7

COI_OUT_OF_RANGE

One of the arguments contains a value that is invalid

8

COI_NOT_SUPPORTED

This function is not currently supported as used

9

COI_TIME_OUT_REACHED

The specified time out caused the function to abort

10

COI_MEMORY_OVERLAP

The source and destination range specified overlaps for the same buffer

11

COI_ARGUMENT_MISMATCH

The specified arguments are not compatible

12

COI_SIZE_MISMATCH

The specified size does not match the expected size

13

COI_OUT_OF_MEMORY

The function was unable to allocate the required memory

14

COI_INVALID_HANDLE

One of the handles provided was not valid

15

COI_RETRY

This function currently can't complete, but might be able to later

16

COI_RESOURCE_EXHAUSTED

The resource was not large enough

17

COI_ALREADY_LOCKED

The object was expected to be unlocked, but was locked

18

COI_NOT_LOCKED

The object was expected to be locked, but was unlocked

19

COI_MISSING_DEPENDENCY

One or more dependent components could not be found

20

COI_UNDEFINED_SYMBOL

One or more symbols the component required was not defined in any library

21

COI_PENDING

Operation is not finished

22

COI_BINARY_AND_HARDWARE_MISMATCH

A specified binary will not run on the specified hardware

23

COI_PROCESS_DIED

One of the COI processes died

24

COI_INVALID_FILE

The file is invalid for its intended usage in the function

25

COI_EVENT_CANCELED

Event wait on a user event that was unregistered or is being unregistered returns this error

26

COI_VERSION_MISMATCH

The version of Intel® Coprocessor Offload Infrastructure on the host is not compatible with the version on the device

27

COI_BAD_PORT

The port that the host is set to connect to is invalid

28

COI_AUTHENTICATION_FAILURE

The daemon was unable to authenticate the user that requested an engine. Only reported if daemon is set up for authorization

29

COI_NUM_RESULTS

Reserved, do not use

 


Inspecting the Automatically Produced COI Log File

Sometimes, having an accurate error code doesn’t necessarily make a problem clear. For example, if COIProcessCreateFromFile returns COI_MISSING_DEPENDENCY, this indicates that a dynamic library needed by the executable could not be found in the source or sink file systems. If the debug version of the COI library is used, however, there is a possibility that more information can be learned by looking at the automatically-produced log file. This file is named <executable>.coilog, where <executable> is the name of the source executable. It is located in the current directory in effect when the application was launched.

In order to use the debug version of the COI library, you will have to extract and compile the COI library from the source provided with your version of Intel® MPSS.

Steps to compile debug version of COI library can be given as follows:

  1. If you have not already done so, download and extract mpss-src-<MPSS-version>.tar file from the Intel® MPSS webpage
    tar –x mpss-src-<MPSS-version>.tar
  2. Extract the MPSS COI source
    cd mpss-<MPSS-version>/src
    tar –xj mpss-coi-<MPSS-version>.tar.bz2
  3. Compiling a debug version of the COI library requires that some of the metadata files are present in /usr/include directory. If not already present, you should extract the source mpss-metadata-<MPSS-version>.tar.bz2 file provided and copy the required files
    tar –xj mpss-metadata-<MPSS-version>.tar.bz2
    cp mpss-metadata-<MPSS-version>/mpss-metadata.c /usr/include/.
    cp mpss-metadata-<MPSS-version>/mpss-metadata.mk /usr/include/.
  4. From the extracted mpss-coi-<MPSS-version> directory you can compile and install the COI library as follows:
    make debug                              //Builds the debug COI library in build directory
    make debug-install-host                 //Installs the debug version of  COI library on Host
    make debug-install-sdk                  //Installs the required SDK files
  5. To install these new binaries and libraries on the coprocessor you will need to overwrite the card’s COI library (done manually for each coprocessor card)
    scp build/device-linux-debug/libcoi_device.so mic0:/usr/lib64/libcoi_device.so.0
    ssh mic0 “/etc/init.d/coi stop”
    scp build/device-linux-debug/coi_daemon mic0:/usr/bin/coi_daemon
    ssh mic0 “/etc/init.d/coi start” 

Once the debug version of the COI library is installed, a <executable>.coilog will be created whenever the application is launched. In the event of error <executable>.coilog will be populated with an entry like the following: 

[SOURCE][0xfffffffe][3484974483003000][..\..\mechanism\proxy\uproxy_host.cpp:185][COILOG_LEVEL_ERROR][COIProxy::WorkerThread]: Error: scif_recv failed: 108

where:

[SOURCE]refers to whether the error occurred on source (Host) or sink (Coprocessor)
 [0xfffffffe] is hex corresponding to the actual Pthread id
[3484974483003000]refers to the timestamp of the event (tickcount)
[..\..\mechanism\proxy\uproxy_host.cpp:185] Source file and line number
Error: scif_recv failed: 108error number corresponding to its entry in usr/include/asm-generic/errno*.h header files. For e.g. 108 corresponds to ESHUTDOWN (Cannot send after transport endpoint shutdown)

Trace Libraries Loaded Using SINK_LD_TRACE_LOADED_OBJECTS Environment Variable

If the environment variable SINK_LD_TRACE_LOADED_OBJECTS is set to a non-empty value, it changes the behavior of the COIProcessCreate* APIs. Instead of creating the process the coi_daemon will print to standard out (stdout), the information about which libraries it is loading. If all the dynamic dependencies are found, the API returns COI_NOT_INITIALIZED; the COIProcess will not actually be created when this environment variable is set; it is meant solely as a debugging aid. 

One scenario where this can be useful is, if the user built their binary on one system that had all the needed libraries, but then wanted to run their binary on a completely different system with different environment settings. In this case, the variable SINK_LD_TRACE_LOADED_OBJECTS can be useful to verify that your environment is configured correctly before you attempt to launch your application.

Steps for using the environment variable SINK_LD_TRACE_LOADED_OBJECTS  can be given as follows:

  1. Since the information about which libraries are loaded originates from coi_daemon it is important that the prints are redirected to the console rather than to /dev/null(the default). In order to do this, restart the coi_daemon on the coprocessor as follows:
    [user@host ~] ssh mic0                             //ssh directly in to the coprocessor 
    [user@host-mic0 ~] /etc/init.d/coi stop           //coi_daemon if it already running
    [user@host-mic0 ~] coi_daemon --= &              //restart coi_daemon with prints redirected to stdout (console)
  2. Now using a different shell, on the host execute your COI application with the environment variable SINK_LD_TRACE_LOADED_OBJECTS set to a non-empty value. For example, as shown below, we can set the environment variable to 1 and run our sample COI application on host. Here in this case, if we have no missing dependency then we would get the following output: 

    [user@host release] SINK_LD_TRACE_LOADED_OBJECTS=1 ./coi_simple_source_host
    //output
    2 engines available
    Got engine handle
    COIProcessCreateFromFile( engine, SINK_NAME, 0, NULL, false, NULL, false, NULL, 0, NULL, &proc ) returned COI_NOT_INITIALIZED
    
  3. Now, if you check the console on mic0, you will see the information about the loaded libraries. One such sample output originating from coi_daemon can be given as below. Here in this case, if the coi_device library is missing on the device then the coi_daemon will report the dynamic dependency check failure as given below: 
    [user@host-mic0 ~]
    COI_DAEMON is trying to create a process 'coi_simple_sink_mic' using the following files:
     
            <SOURCE>:       /home/slgogar/COI_TEST/release/coi_simple_sink_mic
            <SINK>: libstdc++.so.6
            <SINK>: libm.so.6
            <SINK>: libgcc_s.so.1
            <SINK>: libc.so.6
            <FAIL>: libcoi_device.so.0
      dynamic dependency check failed on 1 libraries. COIRESULT= COI_MISSING_DEPENDENCY
            libcoi_device.so.0
      process create ending abnormally
    
  4. Once the environment settings are all verified, restart the coi_daemon on the coprocessor in its default settings as follows:
    [user@host-mic0 ~] /etc/init.d/coi stop
    [user@host-mic0 ~] /etc/init.d/coi start

     


Using coitrace to assist with debugging

Included in the installation package is a tool called coitrace. This trace utility operates similar to Unix*-style tools like strace* and shows all of the COI API invocations and input parameters. This can be helpful to trace what COI commands are being executed for tracing and debugging. To see a complete list of options run

coitrace -h

To use coitrace simply execute your program through coitrace. For example, without coitrace the hello_world sample executes as follows:

[user@hostname release]# ./hello_world_source_host 
2 engines available
Got engine handle
Sink process created, press enter to destroy it.
Hello from the sink!
 
Sink process returned 0
Sink exit reason SHUTDOWN OK

This is how the hello_world sample would execute through the tool coitrace printing out additional information like function arguments, thread_id, and return values of each function call:

[user@hostname release]$ coitrace ./hello_world_source_host 
COIEngineGetCount [ThID:0x7fbc167d5740]
        in_ISA = COI_ISA_MIC
        out_pNumEngines = 0x7fff963b8698 0x00000002 (hex) : 2 (dec)
 
2 engines available
COIEngineGetHandle [ThID:0x7fbc167d5740]
        in_ISA = COI_ISA_MIC
        in_EngineIndex = 0x00000000 (hex) : 0 (dec)
        out_pEngineHandle = 0x7fff963b8680 0x7fbc16a73d60
 
Got engine handle
COIProcessCreateFromMemory [ThID:0x7fbc167d5740]
        in_Engine = 0x7fbc16a73d60
        in_pBinaryName = hello_world_sink_mic
        in_pBinaryBuffer = 0x7fbc167ec000
        in_BinaryBufferLength = 0x000000000000288f (hex) : 10383 (dec)
        in_Argc = 0
        in_ppArgv = 0
        (bool) in_DupEnv = false
        in_ppAdditionalEnv = 0
        (bool) in_ProxyActive = true
        in_Reserved = (nil)
        in_BufferSpace = 0x0000000000000000 (hex) : 0 (dec)
        in_LibrarySearchPath = (nil)
        in_FileOfOrigin = hello_world_sink_mic
        in_FileOfOriginOffset = 0x0000000000000000 (hex) : 0 (dec)
        out_pProcess = 0x7fff963b8688 0x1802a60
 
COIProcessCreateFromFile [ThID:0x7fbc167d5740]
        in_Engine = 0x7fbc16a73d60
        in_pBinaryName = hello_world_sink_mic
        in_Argc = 0
        in_ppArgv = 0
        (bool) in_DupEnv = false
        in_ppAdditionalEnv = 0
        (bool) in_ProxyActive = true
        in_Reserved = (nil)
        in_BufferSpace = 0x0000000000000000 (hex) : 0 (dec)
        in_LibrarySearchPath = (nil)
        out_pProcess = 0x7fff963b8688 0x1802a60

Sink process created, press enter to destroy it.
Hello from the sink!
 
COIProcessDestroy [ThID:0x7fbc167d5740]
        in_Process = 0x1802a60
        in_WaitForMainTimeout = -1
        (bool) in_ForceDestroy = false
        out_pProcessReturn = 0x7fff963b869f 
Sink process returned 0
Sink exit reason SHUTDOWN OK

 


Conclusion

At this point you should have a slightly better understanding of how to analyze and debug COI API errors. Depending on the complication of your application you might have to use several methods/tools in combination to track down the COI API error. Moreover, by correctly linking the COI application with debug version of the COI library, debuggers (like GDB) can be utilized to read debug symbols and provide useful information relevant to the error.

 

1 The code snippets are extracted from COI tutorials (sample examples) provided with Intel® MPSS installation. By default, after Intel® MPSS installation, the sample programs are copied in /usr/docs/intel-coi-<MPSS-version>/tutorials directory.

Other Related References

https://software.intel.com/en-us/articles/debugging-intel-xeon-phi-applications-on-linux-host

https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss

For more complete information about compiler optimizations, see our Optimization Notice.