POSIX Threads on Xeon Phi

POSIX Threads on Xeon Phi

Hey all,

 

I have written a program which is fully vectorized with Intel Intrinsics and parallelized with OpenMP pragmas. I reach quiet good performance (300 GFLOPs/s). Now I have tried to implement my Intrinsics code with POSIX Threads for all 240 HW threads, but my performance is quiet bad now (15 GFLOPs/s). I used both pthread_setaffinity_np and sched_setaffinity, but neither gives good performance. In the attachment is a code sample (for some reasons my code gets blocked by the spam filter if I post it here). I have also added a short Vtune analysis, which looks quiet strange (the kernel starts at 47s and is called dslash and their threads p16_run)! The Kernel has the same structure as in my OpenMP code. I just replaced all pointers with pointers from the input struct for each POSIX Thread.

My question is: Did I allocate the CPU masks with the cpu_set_t commands in the right way? Why is my performance so bad?

 

Thanks

Patrick

 

AttachmentSize
Downloadtext/x-c++src code.cpp1.07 KB
Downloadimage/png vtune2.png367.68 KB
Downloadimage/png vtune1.png376.38 KB
16 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

As an experiment, conditionalize out your setting of affinities. You are not oversubscribing (except for the MIC deamon) so your threads should not migrate. The experiment is to test to see if the setting of the affinities by your code is adverse or not.

Jim Dempsey

www.quickthreadprogramming.com

Hi Jim,

 

I get exactly the same performance with and without sched_setaffinity. In the attachments is the vtune analysis without setting the affinity. I hope thats what you wanted to see?

 

Thanks

Patrick

Attachments: 

AttachmentSize
Downloadimage/png vtune3.png277.11 KB

update:

I noticed that Pthreads i creating a lot of threads - a lot more than 240 p16_run threads (with and without setting the affinity). That might be the problem and the reason for the poor performance. The OpenMP code uses exactly 240 threads p16_run. 

I really don't understand why Pthread starts that much threads (I guess more than 1000 threads).

last update: (sry for my frequently small updates)

The many threads are caused by my performance analysis. In my "main" program I need to call the kernel (called dslash in vtune) many times - so for measuring the performance of the quiet small kernel I'm averaging over many iterations performed by a big loop.

 

I have tried now to create all threads with a detached attribute for recycling their resources, when they are finished.

 

    pthread_attr_t thread_attr;
 
    pthread_attr_init( &thread_attr );
 
    pthread_attr_setdetachstate( &thread_attr, PTHREAD_CREATE_DETACHED );
 
    //..........
    //..........
 
    pthread_create( &threads[i], &thread_attr, kernel, (void*) &thread_ptr[i] );

 

This improves the performance a little bit (20 GFLOPs/s), but is still very bad compared to OpenMP (300 GFLOPs/s). A Vtune analysis with one call of the small kernel showed that all 240 threads don't start at the same moment (comp. vtune4 in the attachment). 

 

 

Attachments: 

AttachmentSize
Downloadimage/png vtune4.png280.09 KB

Dear Patrick S,

If your co-processor has exactly 240 HW threads, you should run 239 copies of kernel() in 239 child threads and then run the 240th copy in the main thread. Otherwise you oversubscribe by 1.

You can safely remove CPU_ALLOC because it does not set the maximum number of CPUs, it dynamically allocates one cpu set and returns a pointer to this cpu set.

You can also safely remove pthread_exit(NULL) before 'return 0;' -- this call attempts to force termination of the system pthread running main(). It is better to use 'return 0;' to return control from main().

Thank you.

Evgueni.

In looking at your VTune screenshot there is no work being done in other than one p16_run... thread until about 46 seconds. The other p16_run... threads do nothing until ~46 seconds into the run, then run for negligible amount of time.

I suggest you create your threads, have kernel set affinity if you wish, then issue a condition variable wait. When your create thread loop is finished, then signal all the condition variables. This will remove the thread creation time from your timing section. Thread creation may have significant overhead.

Also, in your last code post you were using a &thread_attr... If this thread_attr is specifying a higher runtime priority than the main thread's priority creating the additional threads may be problematic should a newly created thread be assigned to the same logical processor as the main thread (IOW this would serialize the program, different threads, but running one at a time.)

Evgueni's comment is valid when the main thread participates in the work (e.g. calls kernel as thread with your ID of 0 or last thread). However, as coded, the main thread is not performing work (calling kernel).

I notice that threads vmlinux and coi_daemon are eating up a considerable amount of time. I would venture to guess that these are related to running VTune. If this is the case, then you might consider diminishing the total thread count during VTune runs (e.g. use 236 an stay off the last core).

Jim Dempsey

www.quickthreadprogramming.com

Quote:

jimdempseyatthecove wrote:

I notice that threads vmlinux and coi_daemon are eating up a considerable amount of time.

Thanks for that nice hint. You are totally right. Now the vtune analysis of my other OpenMP code looks quiet perfect (compare omp.png).

Quote:

jimdempseyatthecove wrote:

In looking at your VTune screenshot there is no work being done in other than one p16_run... thread until about 46 seconds. The other p16_run... threads do nothing until ~46 seconds into the run, then run for negligible amount of time.

I have a big scalar part at the beginning of my program (just for checking my intrinsics kernel), which you can also see in the OpenMP vtune analysis (omp.png) and in my Pthreads code.

Quote:

jimdempseyatthecove wrote:

I suggest you create your threads, have kernel set affinity if you wish, then issue a condition variable wait. When your create thread loop is finished, then signal all the condition variables. This will remove the thread creation time from your timing section. Thread creation may have significant overhead.

I also want to measure the overhead, which is caused by creating the threads for one kernel call. Similar to my OpenMP code, where I used a big averaging loop around the #pragma omp for region. It seems that OpenMP uses always the same threads, contrary to that pthreads creates every time a new thread id.

Quote:

Evgueni wrote:

If your co-processor has exactly 240 HW threads, you should run 239 copies of kernel() in 239 child threads and then run the 240th copy in the main thread. Otherwise you oversubscribe by 1.

I hope that this is now fixed in my code (comp. attachment code.cpp).

\\\\\\\\\\\\\\\\\\\\\\\\\\\\

So far I have tried now a lot of versions, but neither reaches good performance. I have added code my last code in the attachments. I have chosen hardware thread 239 as my main thread. All other threads are created by thread 239. After all other threads are created the main threads calculates its own chunk of the kernel. 

I noticed that the CPU utilization of each HW thread is about 15% (checked with '> top'). If I use only 60 threads on 15 real cores - while keeping the chunk size per thread constant - the CPU utilization is about 70%. Maybe the threads are blocking each other somehow?

 

Attachments: 

AttachmentSize
Downloadimage/png omp_0.png416.53 KB
Downloadtext/x-c++src code_0.cpp1.87 KB

>>I also want to measure the overhead, which is caused by creating the threads for one kernel call. Similar to my OpenMP code, where I used a big averaging loop around the #pragma omp for region. It seems that OpenMP uses always the same threads, contrary to that pthreads creates every time a new thread id.

If you use pthreads as you outlined in your original code.cpp and code_0.cpp, then you will always have the thread creation/destruction overhead. This is acceptable when you run your parallel "region" once within an application. Should your application run the parallel region multiple times, then you may find it better to run a thread creation loop, with thread startup pinning themselves and waiting on a condition variable. Then subsequently, each time you wish to enter a "parallel region" you write a context for the thread to use and then signal the condition variable. When that sub-process completes the thread waits on the condition variable (it may also write to a completion status variable for the thread as well). IOW you do not create and destroy the thread pool each time you have parallel work to do. Essentially you are using pthreads in a similar manner to how OpenMP uses its thread pool.

Note, your thread pool can be structured such that you can use 1 to n (available) threads from your pool per parallel region (similar to OpenMP).

In your pthread code where you set affinity, use either 2 or 3 threads per core

2/core 0,1,4,5,8,9,...
3/core 0,1,2,4,5,6,8,9.10,...

Do not use 1 thread per core (not efficient on MIC)
Using 2 or 3 running threads per core is usually optimal when data mostly fits in L1/L2 cache.
Using 4 threads per core may be optimal on memory intensive code on MIC (run performance tests without VTune)

Jim Dempsey

www.quickthreadprogramming.com

Dear Patrick S.,

You may want to look at how your OMP code performs when KMP_BLOCKTIME=0.

This way we force the OMP runtime to release its pthreads immideately after control leaves an omp region. By default, the OMP runtime does not release its threads for 200ms (KMP_BLOCKTIME=200).

If you see a significant performance decrease, then you need to exclude pthread creation from measurement. Another point is that the OMP runtime uses a so called tree barrier at the end of omp regions which is _order_of_magnitude_ faster than sequentially joining 239 threads using pthread_join.

Thank you.

Evgueni.

Quote:

Evgueni Petrov aka espetrov (Intel) wrote:

You may want to look at how your OMP code performs when KMP_BLOCKTIME=0.

Changing the KMP_BLOCKTIME using "export KMP_BLOCKTIME=0" has no influence on the performance.

Quote:

jimdempseyatthecove wrote:

In your pthread code where you set affinity, use either 2 or 3 threads per core

My kernel is totally memory bound. The OpenMP code showed that using 4 threads per real core gives best performance.

Quote:

jimdempseyatthecove wrote:

Should your application run the parallel region multiple times, then you may find it better to run a thread creation loop, with thread startup pinning themselves and waiting on a condition variable. Then subsequently, each time you wish to enter a "parallel region" you write a context for the thread to use and then signal the condition variable. When that sub-process completes the thread waits on the condition variable (it may also write to a completion status variable for the thread as well). IOW you do not create and destroy the thread pool each time you have parallel work to do. Essentially you are using pthreads in a similar manner to how OpenMP uses its thread pool.

Let's make an example and please check if I understood your post correctly:

I have a program with two kernels A and B, which should run in a pthreads "parallel region" with 240 threads and I want to call those kernels several times. Then I would create at the beginning of my main program for kernel A and B 240 threads each, whereat all threads A are waiting for an condition variable A (the same for B). 

If I want to call kernel A I have to signal condition variable A and let the main thread wait on all threads A to finish. But how can I call kernel A again? All threads have finished. Do I have to surround my kernel in "void kernelA" with a while statement which is always true and insert a second condition variable A_restart at the end of my kernel?

You are not quite grasping the concept of a thread pool.

Think of the concept as follows:

Your main thread creates a thread pool, who's "kernel" is a kernelWaitForWork.

All threads in the kernelWaitForWork wait on a condition variable that is conditioned upon a queue of tasks to perform.

Tasks will have a task object of a design of your own, which may have the same context as a "standard" pthread kernel. Or you may want something more, you design this to your requirements.

After initializing the tread pool, the main thread can enqueue starter tasks into the queue (depending on design, a lock may be required).

If you want the main thread to participate in the work, then the main thread can call the kernelWaitForWork function after it is done initializing.

Your task queue can have any number of entries in it, as well as you can use multiple task queues with priorities. You have the choice of your design to have:

more threads than logical processors
less threads than logical processors
same number of threads as logical processors
threads pinned or not pinned
threads of same priority or different priority
threads of a class that are to be compute-only
threads of a class that contain waits for other events (e.g. I/O completion)
...

Start with something simple, number of threads, including main equal number of logical processors:

*** sketch code ***

main() {
  init();
  createThreadPool();
  YourTaskQueue.push(&mainTask);
  kernelWaitForWork(...);
}

kernelWaitForWork() {
  while(true) {
    WaitOnCondition(&YourTaskQueueCondition);
    if(EndProgramFlag) {
      unlock(&YourTaskQueueConditionLock);
      signal((&YourTaskQueueCondition);
      return;
    } // if(EndProgramFlag)
    // here if not EndProgramFlag
    YourTaskNode_t* aTask = YourTaskQueue.trypop();
    unlock(&YourTaskQueueConditionLock);
    if(aTask) {
      signal((&YourTaskQueueCondition);
      aTask->fnFunctor();
    } // if(aTask)
  } // while(true)
} // kernelWaitForWork()

***
You have control over your Task object. If you want, the task node can contain a member variable indicating the number of threads to participate as a team. Example:

   YourTaskNode_t* aTask = YourTaskQueue.trypop();
     if(aTask) {
       int iThread = aTask->ThreadsRun++;
       if(iThread < aTask->NumberOfThreads - 1) {
          YourQueue.pushback(aTask);
          unlock(&YourTaskQueueConditionLock);
          signal((&YourTaskQueueCondition);
        } else {
           unlock(&YourTaskQueueConditionLock);
     }
      aTask->fnFunctor(iThread);
    } // if(aTask)
    else
    { // null aTask
      unlock(&YourTaskQueueConditionLock);
    }

Personally, unless OpenMP, or TBB, or Cilk Plus is unsuitable, I suggest you not go to the effort of creating your own thread pooling system.

Jim Dempsey

www.quickthreadprogramming.com

Quote:

jimdempseyatthecove wrote:

Personally, unless OpenMP, or TBB, or Cilk Plus is unsuitable, I suggest you not go to the effort of creating your own thread pooling system.

The reason why I switched from OpenMP to Pthreads is that I need full control over distributing the chunks of my loop too the cores. For example it would be optimal that chunk 1, 24, 46 and 50 run on real core 0, because they have data sharing - so the program can benefit from loading data from the cache instead from memory. I think thats not possible with OpenMP?

With OpenMP you would pin the threads with one of the OpenMP pinning techniques(KMP_AFFINITY=compact, or scattered, or any of the other pinning techniques).

Then I suggest using __cupid and your own __cpuidEX survey the system and build tables relating to OpenMP thread number (top level) verses core number, and HT sibling within core. As well as generate cross reference tables. Also consider using thread local storage for current thread's core number and HT sibling number.

The do not use #pragma omp parallel for
Instead just use #pragma omp parallel and then slice up the work by core and or thread

#pragma omp parallel
{
  int nBlobsZ = (nz + nHT - 1) / nHT; // nHT is determined at init time as number of HT used per core

	  int nBlobsZY = nBlobsZ * ny;

	  int nBlobsZYPerCore = (nBlobsZY + nCores - 1) / nCores; // nCores determined at init ime

	  int BlobBegin = nBlobsZYPerCore * myCore; // myCore held in TLS and determined at init time

	  int BlobEnd = BlobBegin + nBlobssZYPerCore; // 1 after last Blob for core
  if(TBlobEnd > nBlobsZY) BlobEnd = nBlobsZY; // last blob may be partially used

	  for(int iBlob = BlobBegin; iBlob < BlobEnd; ++iBlob) {
    int z0 = (iBlob / ny) * nHT; // home z for 0'th team member for Blob

	    int z = z0 + myHT;  // z for this team member
    int y = iBlob % ny;

	        // last blob along z may be partially filled
    // assure we are within z

	    if(z < nz) {

	      // do work

	

Jim Dempsey

www.quickthreadprogramming.com

By the way, the above is an edited piece of code (names change for clarity, and non-essential code lines removed for this example) from a program that processes a 3D volume where the threads within a core (this may be 1, 2, 3 or 4) work together to "drill" along the X axis in adjacent Z columns. This improves L1 cache hit ratios because the processing by each thread uses data from adjacent columns.

Jim Dempsey

www.quickthreadprogramming.com

Quote:

jimdempseyatthecove wrote:

work together to "drill" along the X axis in adjacent Z columns. This improves L1 cache hit ratios because the processing by each thread uses data from adjacent columns.

 

That's what I needed. You really saved me some time there! I will come back in some days and post how it affects my code.

Leave a Comment

Please sign in to add a comment. Not a member? Join today