The Chronicles of Phi - part 2 Hyper-Thread Phalanx – tiled_HT1

In the first part I discussed the diffusion problem and the proposed strategy to address the performance issue through use of a Hyper-Thread Phalanx. I left you with the dangling question:

How to determine thread binding to core and HT within core?

Let's begin with an illustration of 2-wide, 3-wide, and 4-wide Hyper-Thread Phalanxes:

View of the y/z plane of a x/y/z space, x going into page. Computation order into page along x, then stepping down on page in the +y direction.

The left image of each pair, illustrates the cold cache penetration of x. Yellow indicates one of the threads incurs a cache miss, while all the adjacent threads accessing the same cell experiences a cache hit. More importantly, moving on to the next drill down of x (right illustration of paired illustrations), we can now estimate the cache hit ratios: 10/14 (71.43%), 16/21(76.19%), and 22/18 (78.57%). These are significant improvements over the single thread layout value of 3/7 (42.86%).

The question then becomes:

How to determine thread binding to core and HT within core

One could specify affinity and core placement through use of environment variables external to the program, but this may not be suitable or reliable. It is better to place the least constrictions and requirements on the environment variables. While one set of affinity bindings may be best for this function, your overall application may benefit from a different arrangement of thread bindings. Therefore, this necessitates having the program determine the affinity bindings applied by the environment.

The following header HyperThreadPhalanx.h  and utility code HyperThreadPhalanx.c  were used for the improved performance test programs added to the sample program folder. The original test programs were written in C. Therefore, this version of the utility code is also written in C. As an exercise for the reader, you may modify the code for use with C++.

The primary goal of the HyperThreadPhalanx.c  utility function is to:

o Determine the number of OpenMP threads in the outer most region of the application
o Compute a logical core number (zero based and contiguous) for each thread
o Compute a logical HT number within the core (zero based and contiguous) for each thread
o Compute the number of logical cores
o Compute number of HTs per core as used in the working set

Notes:

The programmer (operator of program) must specify some form of realistic affinity binding. They are free to choose almost any strategy that is reasonable for the remainder (non- HyperThreadPhalanx’ed part) of the application. KMP_AFFINITY=compact, KMP_AFFINITY=scatter, as well as combining with KMP_PLACE_THREADS=nnC, mmT, oO. The only “reasonable” requirement is for each core used to have the same number of working threads. If they do not, the current code will choose the smallest number (though testing of adverse configurations has not been strenuously performed).

The header file:

// HyperThreadPhalanx.h
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <omp.h>
#include <assert.h>

// types:
struct HyperThreadPhalanxThreadInfo_t
{
  int APICid;
  int PhysicalCore;
  int PhysicalHT;
  int LogicalCore;
  int LogicalHT;
};

struct HyperThreadPhalanx_t
{
  int isIntel;
  union {
  char ProcessorBrand[48];
  unsigned int ProcessorBrand_uint32[12];
  };
  int nHTsPerCore;// hardware
  int nThreads;   // omp_get_num_threads() {in parallel region, no nesting}
  int nCores;     // number of core derived therefrom
  int nHTs;       // smallest number of HT's in mapped cores (logical HTs/core)
  struct HyperThreadPhalanxThreadInfo_t* ThreadInfo; // allocated to nThreads
};

// global variables:
extern struct HyperThreadPhalanx_t HyperThreadPhalanx;


// global thread private variables:
#if defined(__linux)
// logical core (may be subset of physical cores and not necessarily core(0))
extern __thread int myCore; 
// logical Hyper-Thread within core
// (may be subset of hw threads in core and not necessarily hwThread(0) in core)
extern __thread int myHT;
#else
// logical core (may be subset of physical cores and not necessarily core(0))
extern __declspec(thread) int myCore;
// logical Hyper-Thread within core
// (may be subset of hw threads in core and not necessarily hwThread(0) in core)
extern __declspec(thread) int myHT;
#endif

// functions:
int HyperThreadPhalanxInit();

The header introduces into your namespace the HyperThreadPhalanx object and two Thread Local Storage variables myCore and myHT. Other than for the two TLS variables, the user is free to use the post-HyperThreadPhalanxInit() values if they wish to do so.

The current code was kept brief, and only is functional for Intel processors (P4 and later). The code uses the CPUID intrinsic and instruction. Information on the CPUID instruction can be found in Intel® Processor Identification and the CPUID instruction. Application Note 485.

The code now follows:

// HyperThreadPhalanx.c

#include "HyperThreadPhalanx.h"

struct HyperThreadPhalanx_t HyperThreadPhalanx;

#if defined(__linux)
// logical core (may be subset of physical cores and not necessarily core(0))
__thread int myCore = -1;
// logical Hyper-Thread within core
// (may be subset of hw threads in core and not necessarily hwThread(0) in core)
__thread int myHT = -1;
#else
// logical core (may be subset of physical cores and not necessarily core(0))
__declspec(thread) int myCore = -1;
// logical Hyper-Thread within core
// (may be subset of hw threads in core and not necessarily hwThread(0) in core)
__declspec(thread) int myHT = -1;
#endif

void __cpuidEX(int cpuinfo[4], int func_a, int func_c)
{
 int eax, ebx, ecx, edx;
 __asm__ __volatile__ ("cpuid":\
 "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx) : "a" (func_a), "c" (func_c));
 cpuinfo[0] = eax;
 cpuinfo[1] = ebx;
 cpuinfo[2] = ecx;
 cpuinfo[3] = edx;
} // void __cpuidEX(int cpuinfo[4], int func_a, int func_c)

void InitProcessor()
{
  unsigned int CPUinfo[4];
  __cpuid(CPUinfo, 0); // This code requires at least support of CPUID
  HyperThreadPhalanx.ProcessorBrand_uint32[0] = CPUinfo[1];
  HyperThreadPhalanx.ProcessorBrand_uint32[1] = CPUinfo[3]; // note order different
  HyperThreadPhalanx.ProcessorBrand_uint32[2] = CPUinfo[2];
  HyperThreadPhalanx.ProcessorBrand_uint32[3] = 0;
  HyperThreadPhalanx.isIntel =
    (strcmp(HyperThreadPhalanx.ProcessorBrand, "GenuineIntel") == 0);
}

int HyperThreadPhalanxInit()
{
  InitProcessor();
  if(!HyperThreadPhalanx.isIntel)
  {
    printf("Not Intel processor. Add code to handle this processor.\n");
    return 1;
  }
  if(omp_in_parallel())
  {
    printf("HyperThreadPhalanxInit() must be called from outside parallel .\n");
    return 2;
  }

#pragma omp parallel
  {
    // use omp_get_num_threads() NOT omp_get_max_threads()
    int nThreads = omp_get_num_threads();
    int iThread = omp_get_thread_num();
    unsigned int CPUinfo[4];

#pragma omp master
    {
      HyperThreadPhalanx.nThreads = nThreads;
      HyperThreadPhalanx.ThreadInfo =
       malloc(nThreads * sizeof(struct HyperThreadPhalanxThreadInfo_t));
      __cpuidEX(CPUinfo, 4, 0);
      HyperThreadPhalanx.nHTsPerCore = ((CPUinfo[0] >> 14) & 0x3F) + 1;
      // default logical HT's per core to physical (may change later)
      HyperThreadPhalanx.nHTs = HyperThreadPhalanx.nHTsPerCore;

    }
#pragma omp barrier
    // master region finished, see if allocation succeeded
    if(HyperThreadPhalanx.ThreadInfo)
    {
      __cpuidEX(CPUinfo, 1, 0); // get features
      if(CPUinfo[2] & (1 << 21))
      {
        // processor has x2APIC
        __cpuidEX(CPUinfo, 0x0B, 0);
 // get thread's APICid
        HyperThreadPhalanx.ThreadInfo[iThread].APICid = CPUinfo[3];
      }
      else
      {
        // older processor without x2APIC
        __cpuidEX(CPUinfo, 1, 0);
 // get thread's APICid
        HyperThreadPhalanx.ThreadInfo[iThread].APICid = (CPUinfo[1] >> 24) & 0xFF;
      }
      // Use thread's APICid to determine physical core and physical HT number within core
      HyperThreadPhalanx.ThreadInfo[iThread].PhysicalCore =
        HyperThreadPhalanx.ThreadInfo[iThread].APICid
         / HyperThreadPhalanx.nHTsPerCore;
      HyperThreadPhalanx.ThreadInfo[iThread].PhysicalHT =
        HyperThreadPhalanx.ThreadInfo[iThread].APICid
          % HyperThreadPhalanx.nHTsPerCore;
      // for now indicate LogicalCore and LogicalHT not assigned
      HyperThreadPhalanx.ThreadInfo[iThread].LogicalCore = -1;
      HyperThreadPhalanx.ThreadInfo[iThread].LogicalHT = -1;
    }
#pragma omp barrier
    // At this point, all the HyperThreadPhalanx.ThreadInfo[iThread].APICid,
    // PhysicalCore and PhysicalHT have been filled-in.
    // However, the logical core number may differ from physical core number
    // no different than OpenMP thread number differing from logical processor number
    // The logical core numbers are 0-based, without gaps
#pragma omp master
    {
      int NextLogicalCore = 0;
      for(;;)
      {
        int iLowest = -1; // none found
        for(int i = 0; i < HyperThreadPhalanx.nThreads; ++i)
        {
          // see if unassigned core
          if(HyperThreadPhalanx.ThreadInfo[i].LogicalCore == -1)
          {
            if(iLowest < 0)
            {
              // first unassigned is lowest
              iLowest = i;
            }
            else
            {
              if(HyperThreadPhalanx.ThreadInfo[i].APICid < HyperThreadPhalanx.ThreadInfo[iLowest].APICid)
               iLowest = i; // new lowest
            }
          } // if(HyperThreadPhalanx.ThreadInfo[i].LogicalCore < 0)
        } // for(int i = 0; i < HyperThreadPhalanx.nThreads; ++i)
        if(iLowest < 0)
          break;
        if(HyperThreadPhalanx.ThreadInfo[iLowest].PhysicalHT != 0)
        {
          // unable to use core
          for(int i = 0; i < HyperThreadPhalanx.nThreads; ++i)
          {
            if(HyperThreadPhalanx.ThreadInfo[i].PhysicalCore == HyperThreadPhalanx.ThreadInfo[iLowest].PhysicalCore)
              HyperThreadPhalanx.ThreadInfo[i].LogicalCore = -2; // mark as unavailable
          } // for(int i = 0; i < HyperThreadPhalanx.nThreads; ++i)
        }
        else
        {
          // able to use core
          int NextLogicalHT = 0;
          for(int i = 0; i < HyperThreadPhalanx.nThreads; ++i)
          {
            if(HyperThreadPhalanx.ThreadInfo[i].PhysicalCore == HyperThreadPhalanx.ThreadInfo[iLowest].PhysicalCore)
            {
              HyperThreadPhalanx.ThreadInfo[i].LogicalCore = NextLogicalCore;
              HyperThreadPhalanx.ThreadInfo[i].LogicalHT = NextLogicalHT++;
            }
          } // for(int i = 0; i < HyperThreadPhalanx.nThreads; ++i)
          ++NextLogicalCore;
          if(NextLogicalHT < HyperThreadPhalanx.nHTs)
            HyperThreadPhalanx.nHTs = NextLogicalHT; // reduce
        }
      } // for(;;)
      HyperThreadPhalanx.nCores = NextLogicalCore;
    } // omp master
#pragma omp barrier
    // master is finished
    myCore = HyperThreadPhalanx.ThreadInfo[iThread].LogicalCore;
    myHT = HyperThreadPhalanx.ThreadInfo[iThread].LogicalHT;
  } // omp parallel
  
  for(int i = 1; i < HyperThreadPhalanx.nThreads; ++i)
  {
    for(int j = 0; j < i; ++ j)
    {
      if(HyperThreadPhalanx.ThreadInfo[j].APICid == HyperThreadPhalanx.ThreadInfo[i].APICid)
      {
        printf("Oversubscription of threads\n");
        printf("Multiple SW threads assigned to same HW thread\n"); 
        return 4;
      }
    } // for(int j = 0; j < i; ++ j)
  }
  return 0;
} // void HyperThreadPhalanxInit()

Next we can integrate the above function into the sample code. Which I will cover in the next part of this blog.

Jim Dempsey
Consultant
QuickThread Programming, LLC

 

For more complete information about compiler optimizations, see our Optimization Notice.

Comments