two questions, timing and memory

two questions, timing and memory

Hi all,

I'm testing my code on MIC using pragma offload + openmp model, and I have 2 questions so far.

1. when we use OFFLOAD_REPORT to do the timing, is it true that the data transfer time can be derived by subtracting  "MIC time" from "CPU time"? So in this context, CPU time refers to the actual total time of the offloaded code, right?

2. some SIMT accelerator has high latency of global memory acess and the access has to be coalesced to prevent over-fetch, I'm wondering if these 2 problems exist on MIC, and what issues I should pay attention to with regard to efficient global data access? Specifically, my code has a non-vectorized (threads may execute totally different instructions most of the time), embarrassingly parallel nature, and global memory access follows a very random, scattered pattern. Will this potentially cause any performance penalty?

Thanks for clarifications!

The keeper of the city keys
Put shutters on the dreams.
I wait outside the pilgrim's door
With insufficient schemes.
The black queen chants
The funeral march,
The cracked brass bells will ring;
To summon back the fire witch
To the court of the crimson king.
3 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

1) Is "CPU time" -"MIC time" = data transfer time? The time includes overhead from setting up the process on the coprocessor as well - or at least that is my understanding. In general though, I think you are safe using this as an approximation of data transfer time. Set OFFLOAD_REPORT to 2 so that you also see the number of bytes transferred.

2) Your program does sound like it wants to be a problem child.
I'm not sure what you mean by coalescing the data. Are you doing a preliminary gather into a new memory location?
And is there any vectorization that takes place within a single thread? It is not a bad thing to have each core executing completely different parts of the code, if within that piece of code there is vectorization to be found. If not, you are leaving a lot of processing ability sitting idle.

If you haven't read yet, you might find that helpful in determining how well the memory can perform in your case. And if you haven't watched any of the training videos yet, you might want to watch some of them for ideas. The one on memory isn't posted yet but it should be soon.

Concerning #2, I agree with Francis that this sounds like it might be trying to get running well on the MIC architecture. The MIC architecture hides the time it takes to get data from main memory by a traditional cache-coherent memory model rather than coalesced data access. With random non-predictable memory access (I'll assume the pattern is too complex to prefetch) you're going to cause lots of cache misses/flush and TLB misses, and thus run into performance problems. If you can find some way to keep memory accesses within the same memory page, or ideally a small subset of cache lines, then it should work better on either architecture (sustitute "shared memory" and "register file" for the other one).

Following up on Francis' other point, you probably aren't going to find much advantage with MIC if your code is highly parallel, but not vectorizing within a given thread. MIC needs a high degree of vectorization, parallel execution via threads, *and* good use of its memory caches to really shine. For non-vectorized code that is highly parallel you probably are going to get best performance on Xeon (which also has much more robust hardware prefetchers that may be able to better cope with your memory access patterns).


Kommentar hinterlassen

Bitte anmelden, um einen Kommentar hinzuzufügen. Sie sind noch nicht Mitglied? Jetzt teilnehmen