Behind the Scenes: Offload Memory Management on the Intel® Xeon Phi™ coprocessor

For a lot of people out there, offload is a mysterious construct that by magic switches context to the Intel Xeon Phi coprocessor and makes the code run on the coprocessor without getting the developer’s hands dirty. Although this black box approach is an excellent idea for most part, the developer needs to have some understanding of what goes on in an offload behind the scenes. This particular thread on the forum is what caught my eye and triggered this blog. In this blog, I will try to demystify some aspects of the offload memory management.

A common question which many of you have is “How does an offload map the host data over to the coprocessor?” The answer is really simple: In case of an offload, the runtime uses hash table like data structure to map the data from the host to the coprocessor. You can imagine each entry in this hash table to be made up of a key-value pair. The base address of the data on the host is used as the key whereas the corresponding base address on the coprocessor is stored as the value. The table also has a third field indicating the length of the array. Every time you allocate memory on the coprocessor using an offload pragma (either by default or by using the alloc_if modifier), a entry is added to the hash table. In addition to this, the runtime automatically add all global varaibles declared with __attribute__(tagret(mic))) or __declspec((target(mic))) to the table when the program is loaded.  Similarly, when you free data from the coprocessor (by default or by free_if), the corresponding entry is deleted from the coprocessor. Because of this mapping, two pointers on the host pointing to the same location, will also point to the same location on the coprocessor. 

Well, so far so good, right? Now, let’s take a look at three common mistakes with which you can end up with an error and have no clue why you got it.  

My Mistake #1: Take a look at the following code.

__attribute__((target(mic)))int *i,*j; 
#pragma offload target(mic:0) nocopy(i:length(10) alloc_if(1) free_if(0)) nocopy(j:length(10) alloc_if(1) free_if(0))

In this case, you would expect the offload pragma to create two persistent (notice the free_if(0) ) arrays on the coprocessor. But that is not what happens in real. If you take a closer look, you will notice that both the host pointers are uninitialized. We generally assume that they arrays contain random numbers but according to my observations, the Intel® C++ Compiler initializes the pointers to 0. Hence, when the code comes across the first nocopy, it creates a first entry in the hash table with the key ‘0’. Next, it allocates the memory and stores the corresponding location as the value for the key ‘0’. The second nocopy comes along and follows the same procedure and ends up writing over the previous entry of the key ‘0’ in the hash table.  In other words, you have a collision in the hash table. Hence, you end up with a single array and a memory leak on the coprocessor. 

Moral: Never use uninitialized pointers to allocate memory on the coprocessor using the offload pragma. 

My Mistake #2: How about this code. 

//Global declaration of i. 
__attribute__((target(mic))) int *i;
//Create a local array on the coprocessor. 
#pragma offload target(mic:0)  nocopy(i:length(0) alloc_if(0) free_if(0))
{
 i=_mm_malloc(sizeof(int)*100,64);
 //Some computation
 …
}
//Allocate arrays only on the host
i=_mm_malloc(sizeof(int)*100,4096);
//Transfer data out of from the coprocessor-side array to the host-side array
#pragma offload target(mic:0) out(i:length(100) alloc_if(0) free_if(0))
{
}

Again, in this case, you would expect this code to work fine but all you will get out of such a code is nothing more than an error. In this example, you are directly allocating an array on the coprocessor without using the offload runtime (read offload pragma). In the second offload, when you try to transfer the data from the coprocessor to the host, the offload runtime runs back to the hash table to find the source and destination addresses on the host and the coprocessor but all it finds is an empty hash table and hence the error. 

Moral: When transferring data to and from the coprocessors, the coprocessor-side source or destination arrays should be allocated using the offload runtime (either by offload or offload_transfer pragmas). 

如需更全面地了解编译器优化,请参阅优化注意事项