The difference in creating threads between Phi and GPU

The difference in creating threads between Phi and GPU

Dear all,

I heard that "GPU can create threads using hardware, one instruction creating one thread".

Does Phi have similar support? For example, when I use OpenMP, can I use some features of Phi to reduce the overhead of creating threads?

Thanks very much. 

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Initial creation of OpenMP threads on MIC is time-consuming.   I haven't tried myself to measure whether restricting the number of threads e.g. by KMP_PLACE_THREADS will save a significant portion.  The default value for KMP_BLOCKTIME (0.200 sec) keeps the OpenMP threads active after leaving a parallel region so as to save creating them again.

Initial creation of OpenMP threads on MIC is time-consuming.   I haven't tried myself to measure whether restricting the number of threads e.g. by KMP_PLACE_THREADS will save a significant portion.  

Reducing the number of threads created will reduce the time, but it won't be linear.

The default value for KMP_BLOCKTIME (0.200 sec) keeps the OpenMP threads active after leaving a parallel region so as to save creating them again.

Not quite. The delay here is not about terminating and re-creating threads, but about whether they should be actively spinning waiting for the next parallel region, or suspended inside the the kernel. That decision affects how rapidly they can be awoken at the next parallel region, but it's not about destroying and re-creating threads. The threads won;t be destroyed until the OpenMP runtime closes down.

The (typical) programming model on a GPU is quite different than using OpenMP on a Xeon Phi or any other processor supporting OpenMP. Thread creation time in OpenMP occurs at first occurrence of creation of a thread pool. This will occur at first entry to any parallel region, and if nesting is enabled, upon first entry to the specific nest-level path: thread 0 of outer most region has nested thread pool, thread 1 of outer most region has different nested thread pool, .... If nesting goes further each respective thread of each respective thread pool will instantiate a new pool upon first entry to that nest level. **** This happens only with nesting enabled .AND. your program uses nested parallelism ***

Most OpenMP programs do not use nested levels.

The two comments above relating to KMP_BLOCKTIME and how threads waiting in a block-timed barrier (wait for something to do before taking a nap) is more comparable to the GPU thread "creation" time. If you were to look closely at the GPU you would likely see that a thread request node is written to a table or flagged, and subject to availability of a thread resource, a node is taken by an available thread. This is quite similar to what happens at the block-time barrier. Note, in OpenMP, one or more threads (100's on Phi) may be signaled to run.

Jim Dempsey

www.quickthreadprogramming.com

GPU "threads" are small kernels in GPGPU domain or shaders in 3D graphics programming.

For example GPU cannot perform context switching on running kernels.

You can compare GPU threads to small portion of CPU code which extensively uses vector SIMD units.

Quote:

James Cownie (Intel) wrote:

Initial creation of OpenMP threads on MIC is time-consuming.   I haven't tried myself to measure whether restricting the number of threads e.g. by KMP_PLACE_THREADS will save a significant portion.  

Reducing the number of threads created will reduce the time, but it won't be linear.

The default value for KMP_BLOCKTIME (0.200 sec) keeps the OpenMP threads active after leaving a parallel region so as to save creating them again.

Not quite. The delay here is not about terminating and re-creating threads, but about whether they should be actively spinning waiting for the next parallel region, or suspended inside the the kernel. That decision affects how rapidly they can be awoken at the next parallel region, but it's not about destroying and re-creating threads. The threads won;t be destroyed until the OpenMP runtime closes down.

Hi James, thanks for the reply. So what is the overhead for keeping all threads actively waiting for the next parallel region? If I set KMP_BKOCKTIME=200 secs, then the time for the threads entering and working in the next parallel reason might be reduced in a very high possibility. Why the default value for KMP_BKOCKTIME is 0.2 sec. such a small value.

And you say "The threads won;t be destroyed until the OpenMP runtime closes down.". If I have a program, which calls a set of functions in sequence, and all of these functions have #pragma omp parallel. Does Phi need to create and destroy a pool of threads for each function, or they will keep alive until the end of the whole program.

Thanks.

The default timeout set by most of the OpenMP libraries we encounter on a variety of platforms works for many people and is fairly easy to remember.  A 200 second timeout would be unsatisfactory for applications which want to switch threading models, e.g. to Cilk(tm) Plus, or to engage power saving. Not many people would be satisfied with performance of an application which needed frequently to spend intervals up to 200 seconds at full power waiting to reach the next parallel region.  I've heard equally often that the default should be much smaller.  I've never encountered an organization which made changing this default a condition for continuing to work on a project which didn't have another at least equally compelling reason for killing the project.  It seems easier to put a value in your script than to raise objections to something which works for many people.

By the way, in my experience, a more widely quoted reason for increasing block time is to enable better maintenance of affinity in applications subject to work imbalance or to OS stalling threads.

There's a lot of inertia involved in changing defaults; for example, it was proposed to raise the default shell stack limit for MIC, and it seemed to be generally favored, but it hasn't happened.  

Tim's answer is pretty much accurate. I can add one more argument, though.

If you're code is running serially on MIC for 200 seconds, then

  • something is very wrong, since you're clearly not running a highly parallel code,
  • why are you worrying about a small overhead at the fork if you've just wasted such a huge amount of time in serial code. (If the fork took (an infeasibly large) 500mS, that's still only 0.25% of the serial time, so you need to fix the serial time before worrying about optimizing the fork overhead!)

It's also just a default. You can easily set KMP_BLOCKTIME=infinite and stop obsessing about the default; it's only a default, not a fixed value that is set in stone that you can't change.

So what is the overhead for keeping all threads actively waiting for the next parallel region? 

The threads are spinning in user-space, consuming energy. If you were sharing the machine with someone else, or are over-subscribing the hardware by having more threads than logical CPUs, then these spinning threads will be occupying logical CPUs that could be doing something useful. However that's generally not a good way to use the MIC anyway.

If I have a program, which calls a set of functions in sequence, and all of these functions have #pragma omp parallel. Does Phi need to create and destroy a pool of threads for each function, or they will keep alive until the end of the whole program.

The OpenMP runtime is not shutdown in these circumstances, so the threads are not destroyed.

Leave a Comment

Please sign in to add a comment. Not a member? Join today