How to offload computation to Intel(R) Graphics Technology

Introduction

Intel(R) C++ Compiler 15.0 Beta Update 2 provides a feature which enables offloading general purpose compute kernels to processor graphics. This feature enables the processor graphics silicon area for general purpose computing. Unlike the discrete graphics cards, processor graphics can directly access system RAM. Each processor graphics (depending on the processor generation and SKU) host certain number of execution units and each execution has certain number of hardware threads (each thread can run in SIMD mode). The key idea is to utilize the compute power of both CPU cores and GPU execution units in tandem for better utilization of available compute power. Intel(R) C++ Compiler enables the compute offload using Intel(R) Cilk(TM) Plus programming model which gives a seamless porting experience for C/C++ developers. Refer to https://01.org/linuxgraphics/documentation for more information on the hardware specifications of different processor graphics SKUs.

Types of Offloading support:

Synchronous Offload

  • This is a #pragma based offload. The highly data parallel section of your application is a perfect candidate for offloading to processor graphics. By annotating the code section with #pragma offload target(gfx), the target compiler is triggered just for these offloaded sections. For instance, consider the following parallel host code for Sepia filter:
    		cilk_for(int i = 0; i < size_of_image; i++)
    
    			{
    
    			    process_image(indata[i], outdata[i]);
    
    			}
    
    		
    
    			void process_image(rgb &indataset, rgb &outdataset){
    
    			        float temp;
    
    			        temp = (0.393f * indataset.red) + (0.769f * indataset.green) + (0.189f * indataset.blue);
    
    			        outdataset.red = (temp > 255.f)?255.f:temp;
    
    			        temp = (0.349f * indataset.red) + (0.686f * indataset.green) + (0.168f * indataset.blue);
    
    			        outdataset.green = (temp > 255.f)?255.f:temp;
    
    			        temp = (0.272f * indataset.red) + (0.534f * indataset.green) + (0.131f * indataset.blue);
    
    			        outdataset.blue = (temp > 255.f)?255.f:temp;
    
    			        return;
    
    			}
    


    This program can offloaded to processor graphics by using synchronous offload by doing the following:
    
    			#pragma offload target(gfx) pin(indata,outdata:length(size_of_image))
    
    			{
    
    			    cilk_for(int i = 0; i < size_of_image; i++)
    
    			    {
    
    			        process_image(indata[i], outdata[i]);
    
    			    }
    
    			}
    
    		__declspec (target(gfx))
    
    			void process_image(rgb &indataset, rgb &outdataset){
    
    			        float temp;
    
    			        temp = (0.393f * indataset.red) + (0.769f * indataset.green) + (0.189f * indataset.blue);
    
    			        outdataset.red = (temp > 255.f)?255.f:temp;
    
    			        temp = (0.349f * indataset.red) + (0.686f * indataset.green) + (0.168f * indataset.blue);
    
    			        outdataset.green = (temp > 255.f)?255.f:temp;
    
    			        temp = (0.272f * indataset.red) + (0.534f * indataset.green) + (0.131f * indataset.blue);
    
    			        outdataset.blue = (temp > 255.f)?255.f:temp;
    
    			        return;
    
    			}
    

  • Two key things to note in the above porting process is that #pragma offload target(gfx) is immediately followed by cilk_for (Intel(R) Cilk(TM) Plus keyword for communicating to the compiler on potential data parallelism in a loop). Secondly, inside the loop which is offloaded to processor graphics, there is another function call. In order to make sure process_image() is available on GPU side, we annotate the function with __declspec(target(gfx)) in the function declaration sites. 
  • Pin clause is used on both input and output arrays to guarantee that the memory pages which hold these array values are not swapped out by memory management unit while the compute offload kernels are executing in processor graphics. Currently the Operating Systems don't have a means of knowing if a page is in use by some compute offload kernel in processor graphics. Pin clause helps the developer to explicitly pin the memory pages which hold the arrays which will be used by the offloaded kernels. The length() specifies the length in terms of number of elements of the array. 
  • In a heterogeneous application (which involves both CPU and GPU compute kernels), the execution starts from the host. So host threads start executing on CPU cores and when they hit a synchronous offload region, the host thread enqueues the offload code to processor graphics command queue and goes into suspend mode. The host thread will remain in suspend mode until the processor graphics compute kernel finishes execution and returns the control back. That's the reason this approach is called synchronous offload (similar to blocked API call).
  • When the Intel(R) C++ Compiler is invoked to build a heterogeneous application, the section of code annotated with synchronous offload construct is compiled by both host compiler and target compiler. So the offload region will have both host binary as well as virtual ISA/binary targeting processor graphics. This enables the customer to ship a binary which uses the processor graphics when the target processor has one if not switch to fallback path (host code path).
  • End of synchronous offload section implicitly unpins the memory pages which were pinned at the beginning of offload region.

Asynchronous Offload

Unlike the Synchronous offload model which is a #pragma based approach, this is an API based approach. The APIs are declared in gfx_rt.h header file. For a quick demonstration of how porting to processor graphics is done using asynchronous offload, let us consider the parallel host code of sepia filter as shown below:


cilk_for(int i = 0; i < size_of_image; i++)

	{

	    process_image(indata[i], outdata[i]);

	}
void process_image(rgb &indataset, rgb &outdataset){

        float temp;

	        temp = (0.393f * indataset.red) + (0.769f * indataset.green) + (0.189f * indataset.blue);

	        outdataset.red = (temp > 255.f)?255.f:temp;

	        temp = (0.349f * indataset.red) + (0.686f * indataset.green) + (0.168f * indataset.blue);

	        outdataset.green = (temp > 255.f)?255.f:temp;

	        temp = (0.272f * indataset.red) + (0.534f * indataset.green) + (0.131f * indataset.blue);

	        outdataset.blue = (temp > 255.f)?255.f:temp;

	        return;

	}


The corresponding asynchronous offload will be as shown below:


	#include<gfx/gfx_rt.h>

	.

	.

	__declspec(target(gfx_kernel))

	void offload(rgb *indata, rgb *outdata){

	    cilk_for(int i = 0; i < size_of_image; i++)

	    {

	        process_image(indata[i], outdata[i]);

	    }

	}

__declspec(target(gfx))

	void process_image(rgb &indataset, rgb &outdataset){

        float temp;

	        temp = (0.393f * indataset.red) + (0.769f * indataset.green) + (0.189f * indataset.blue);

	        outdataset.red = (temp > 255.f)?255.f:temp;

	        temp = (0.349f * indataset.red) + (0.686f * indataset.green) + (0.168f * indataset.blue);

	        outdataset.green = (temp > 255.f)?255.f:temp;

	        temp = (0.272f * indataset.red) + (0.534f * indataset.green) + (0.131f * indataset.blue);

	        outdataset.blue = (temp > 255.f)?255.f:temp;

	        return;

	}

	.

	.

	int main(){

	.

	.

	__GFX_share(indata, sizeof(rgb)*size_of_image);

	__GFX_share(outdata, sizeof(rgb)*size_of_image);

	__GFX_enqueue("offload", indata, outdata);

	__GFX_wait();

	__GFX_unshare(indata);

	__GFX_unshare(outdata);

	.

	}

  • As shown above, this approach is API based. The highly data parallel section of the code is isolated in a separate function which is then converted to a Kernel entry point for processor graphics execution. The offload() function above is created for the same reason and annotated with __declspec(target(gfx_kernel)) to convert this API to a kernel entry point. The difference between target(gfx_kernel) and target(gfx) is that the former is used for annotating kernel entry points and they are always called from the host side. target(gfx) annotation will make the function visible to processor graphics. 
  • The kernel entry point should have the cilk_for annotated loop which expresses the data parallelism in the loop. 
  • __GFX_share() API does the same job as the pin clause in synchronous offload. In the above implementation, __GFX_share() is invoked for each data structure which needs to be pinned in memory.
  • __GFX_enqueue() API enqueues the offload kernel to processor graphics's command queue. Unlike the synchronous offload, in asynchronous offload the host thread will not get into suspend mode rather it enqueues the kernel and proceeds further with its execution until it encounters __GFX_wait() API when its suspended and waits for processor graphics kernel to complete its execution. __GFX_wait() is a synchronization point between CPU and GPU execution. If multiple kernels are enqueued in quick succession, __GFX_wait() will wait for the last enqueued kernel to complete its execution. You can wait for a specific kernel by waiting for the handle of the kernel by passing that handle as an argument to __GFX_wait(handle);
  • Once the kernels have completed its execution, the memory pages explicitly pinned using __GFX_share() can be unpinned using __GFX_unshare(). This is another difference in comparison to synchronous offload. In Asynchronous offload, the data is pinned across multiple kernel invocations (data persistence across multiple kernel invocations) unlike synchronous offload where the data is unpinned implicilty after each offload section is executed.
  • When a asynchronous offload section is compiled with Intel(R) C++ Compiler, only the target code is generated and not the host code. So if a developer needs a host code equivalent of the offloaded section, he/she needs to manually handle that logic in the code.
     


 

This article applies to:
    Products: Intel® System Studio
    Host OS/Platform: Windows (IA-32 or Intel® 64); Linux (Intel® 64)
    Target OS/platform: Windows (IA-32 or Intel® 64); Ubuntu 12.04 (Intel® 64)

For more complete information about compiler optimizations, see our Optimization Notice.