Data Parallel in Phi for CUDA programmers

Data Parallel in Phi for CUDA programmers

     Can someone please clarify what data parallel programming means in the context of a Phi? The little bit of literature I’ve seen makes much of pragma launching code up onto one of the 240 available threads, which sounds like task parallel. But with my blinkered CUDA mindset, I wonder how to do pure data parallel. You know, manipulating data to work across a large number of available threads. Is Phi meant to do the same thing, but across those 240 threads?

But then I see Dr Dobbs saying:
"The good news for CUDA programmers who wish to utilize Phi coprocessors is that CUDA maps very nicely onto vector hardware. When writing CUDA code to utilize the GPU SIMD processors efficiently, developers also create an efficient mapping to x86 SSE and the new Phi coprocessor's wide vector instructions. It is expected that because the 512-bit width of the wide vector instructions matches an integer multiple of the width of the GPU streaming multiprocessors when they process a thread block, performance will be excellent on Phi devices. In short, the SIMD operations for thread block should translate to one or more wide vector operations".

Right… So how do I get in to use that vectorization in my ex-CUDA code? I don’t want to do it with intrinsics, and auto-vectorization seems to be losing all control over performance. All that CUDA code is very much aware of what thread it’s working on: I can't see a Phi (OpenMP) translation of that.

What am I missing here?

 

5 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

I think that you should consult Xeon Phi ISA reference.Please read this link :http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-developers-quick-start-guide

Phi supports both threaded parallelization and data-parallel SIMD vectorization.  You may wish to read the chapter on vectorization on the Compiler Methodology website.  The Phi programming models are CLOSER to standard cpu-based methods (not identical, obvously) rather than special purpose and architecture specific like CUDA.  I would not expect an easy translation.

ron

Hi Philip, 

You could try using OpenCL which uses the data parallel approach. However, currently only a beta version of OpenCL 1.2 is available (supports Intel Xeon Phi coprocessor).

You can find more at the following links: 

http://software.intel.com/en-us/blogs/2012/11/12/introducing-opencl-12-for-intel-xeon-phi-coprocessor

http://software.intel.com/en-us/articles/opencl-design-and-programming-guide-for-the-intel-xeon-phi-coprocessor

-Sumedh

Philip, I'm not quite sure why you say automatic vectorization means loosing control over performance. But in any event - the web pages you have already been pointed to should have provided you with the answers you needed. I would just like to leave a short example for later visitors to this page who want a quick and dirty answer -

If I execute the following code on the Intel Xeon Phi coprocessor, it will execute on one thread and will perform the additions 16 at a time (because an int is 4 bytes long and a vector register is 64 bytes long)

for(i=0;i<1000000;i++)
{
a[i]=b[i]+c[i];
}

On the other hand, if I execute this next bit of code, it will break the loop into chunks and hand the chunks out to up to 240 threads which will each perform their own subset of the additions 16 at a time 

#pragma omp parallel for

for(i=0;i<1000000;i++)
{
a[i]=b[i]+c[i];
}

Usually, I would look for nested loops and use OpenMP to break the outer loop up into chunks, leaving the inner loops running on the vector registers.

Accedere per lasciare un commento.