two questions, timing and memory

two questions, timing and memory

Hi all,

I'm testing my code on MIC using pragma offload + openmp model, and I have 2 questions so far.

1. when we use OFFLOAD_REPORT to do the timing, is it true that the data transfer time can be derived by subtracting  "MIC time" from "CPU time"? So in this context, CPU time refers to the actual total time of the offloaded code, right?

2. some SIMT accelerator has high latency of global memory acess and the access has to be coalesced to prevent over-fetch, I'm wondering if these 2 problems exist on MIC, and what issues I should pay attention to with regard to efficient global data access? Specifically, my code has a non-vectorized (threads may execute totally different instructions most of the time), embarrassingly parallel nature, and global memory access follows a very random, scattered pattern. Will this potentially cause any performance penalty?

Thanks for clarifications!

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

1) Is "CPU time" -"MIC time" = data transfer time? The time includes overhead from setting up the process on the coprocessor as well - or at least that is my understanding. In general though, I think you are safe using this as an approximation of data transfer time. Set OFFLOAD_REPORT to 2 so that you also see the number of bytes transferred.

2) Your program does sound like it wants to be a problem child.
I'm not sure what you mean by coalescing the data. Are you doing a preliminary gather into a new memory location?
And is there any vectorization that takes place within a single thread? It is not a bad thing to have each core executing completely different parts of the code, if within that piece of code there is vectorization to be found. If not, you are leaving a lot of processing ability sitting idle.

If you haven't read yet, you might find that helpful in determining how well the memory can perform in your case. And if you haven't watched any of the training videos yet, you might want to watch some of them for ideas. The one on memory isn't posted yet but it should be soon.

Concerning #2, I agree with Francis that this sounds like it might be trying to get running well on the MIC architecture. The MIC architecture hides the time it takes to get data from main memory by a traditional cache-coherent memory model rather than coalesced data access. With random non-predictable memory access (I'll assume the pattern is too complex to prefetch) you're going to cause lots of cache misses/flush and TLB misses, and thus run into performance problems. If you can find some way to keep memory accesses within the same memory page, or ideally a small subset of cache lines, then it should work better on either architecture (sustitute "shared memory" and "register file" for the other one).

Following up on Francis' other point, you probably aren't going to find much advantage with MIC if your code is highly parallel, but not vectorizing within a given thread. MIC needs a high degree of vectorization, parallel execution via threads, *and* good use of its memory caches to really shine. For non-vectorized code that is highly parallel you probably are going to get best performance on Xeon (which also has much more robust hardware prefetchers that may be able to better cope with your memory access patterns).


Leave a Comment

Please sign in to add a comment. Not a member? Join today