Can someone please clarify what data parallel programming means in the context of a Phi? The little bit of literature I’ve seen makes much of pragma launching code up onto one of the 240 available threads, which sounds like task parallel. But with my blinkered CUDA mindset, I wonder how to do pure data parallel. You know, manipulating data to work across a large number of available threads. Is Phi meant to do the same thing, but across those 240 threads?
But then I see Dr Dobbs saying:
"The good news for CUDA programmers who wish to utilize Phi coprocessors is that CUDA maps very nicely onto vector hardware. When writing CUDA code to utilize the GPU SIMD processors efficiently, developers also create an efficient mapping to x86 SSE and the new Phi coprocessor's wide vector instructions. It is expected that because the 512-bit width of the wide vector instructions matches an integer multiple of the width of the GPU streaming multiprocessors when they process a thread block, performance will be excellent on Phi devices. In short, the SIMD operations for thread block should translate to one or more wide vector operations".
Right… So how do I get in to use that vectorization in my ex-CUDA code? I don’t want to do it with intrinsics, and auto-vectorization seems to be losing all control over performance. All that CUDA code is very much aware of what thread it’s working on: I can't see a Phi (OpenMP) translation of that.
What am I missing here?



