Methods to Utilize Intel's Hyper-Threading Technology with Linux*

by Thomas W. Burger

Abstract

By exploiting Intel's Hyper-Threading Technology, programmers can produce significant performance improvements to applications.

This article explains what Hyper-Threading Technology (HT Technology) is, how it works and the benefits obtained by using it. It then goes on to show examples of code that utilizes Hyper-Threading Technology in C++, C and/or FORTRAN. In addition, code to detect the support of Hyper-Threading Technology will be provided.


The Concept of Hyper-Threading

Hyper-Threading Technology has been in development by Intel for more than 4 years. Hyper-Threading is a more efficient way to use processing power of a CPU. Processor components are always idle at some point in the code execution. In the past improving processor performance consisted primarily of increasing the clock speed, enlarging the cache, reducing the size of the chip or simply developing MPU systems and adding more processors. Intel has developed Hyper-Threading Technology to allow application performance gains without the need of hardware changes or adding more processors. Hyper-Threading, also referred to as simultaneous multithreading (SMT), allows different threads to run simultaneously on different execution units within one physical processor. Hyper-Threading extends multi-threading using an architecture state where one physical processor can look like two (or more) logical processors to the operating system and applications. Two processor states on the chip create a logical dual processor configuration. Each logical processor maintains a complete set of the architecture state instructions and shares access to the functional units of the CPU. The architecture state consists of registers including the general-purpose registers, the control registers, the advanced programmable interrupt controller (APIC) registers as well as some machine state registers. A Hyper-Threading enabled processor can manage data as if it were two logical processors by handling data instructions in parallel rather than serially.

The architecture state consists of registers including the general-purpose registers, the control registers, the advanced programmable interrupt controller registers as well as some machine state registers. Logical processors share most resources on the physical processor: caches, execution units, branch predictors, control logic and buses. Developers who thread applications are positioned to take advantage of Hyper-Threading Technology. Threaded applications can immediately pick up all the performance benefits of Hyper-Threading based multiprocessing whenever they are running on an Intel Hyper-Thread enabled processors. Intel C++/Fortran compiler support Hyper-Threading using the OpenMP directive and pragmas guided parallelization. As an example, OpenMP parallel sections of code where section-1 calls an integer-intensive routine and where section-2 calls a floating-point intensive routine can use Hyper-Threading by scheduling section-1 and section-2 onto two different logical processors achieving higher performance by more fully utilizing processor resources. The OpenMP standard API supports a multi-platform, shared-memory, parallel programming in C++, C and Fortran95 on al l Intel architectures and operating systems such as Windows NT, Linux, and UNIX. Simply stated Hyper-Threading Technology (HT Technology) is a more efficient way to use processing power, now coders have to be educated to take advantage of HT Technology.


Programming Hyper-Threaded Technology

Determining Logical Processors

When Hyper-Threading Technology is used, applications must inquire to the operating system as to the number of logical processors on the system as well as the number of physical processors. This is not only important for determining how the code should run, but for licensed software where the number of physical processors determines the cost of the application.

The first step to determining the HT Technology capacity of the CPU is to see if the processor supports the cpuid instruction and is a genuine Intel® Pentium® 4 processor or later. The code then retrieves the model specific feature information using the cpuid instruction and setting the eax register to one. It does not matter which processor you execute the CPUID instruction on, all physical processors present must support the same number of logical processors. After executing the CPUID instruction, if bit 28 in edx is set then the physical processor supports Hyper-Threading Technology. This does not mean the processor supports more than one logical processor, that the BIOS has enabled the feature, or that the operating system is utilizing the extra logical processors. Additional steps are required to determine the number of logical processors supported by the physical processor as well as querying the operating system to determine the logical to physical processor mapping.

Sample Code 1.0: Determine Hyper-Threading support.

    #define HT_BIT 0x10000000 // Bit 28 indicates Hyper-Threading
    Technology support
     #define FAMILY_ID 0x0f00 // Bits 11 through 8 is family
    processor id
    #define EXT_FAMILY_ID 0x0f00000 // Bits 23 through 20 is
    extended family processor id
    #define PENTIUM4_ID 0x0f00 // Pentium 4 family processor
    id
     // Returns non-zero if Hyper-Threading Technology supported
    zero if not.
    // Hyper-Threading Technology still may not be enabled due to
    BIOS or OS settings.
    unsigned int Is_HT_Supported(void)
    {
     unsigned int reg_eax = 0;
     unsigned int reg_edx = 0;
     unsigned int vendor_id[3] = {0, 0, 0};
     __try { // verify cpuid instruction is supported
     __asm {
     xor eax, eax // call cpuid with eax = 0 (faster than mov ax,
    1)
     cpuid // to get vendor id string
     mov vendor_id, ebx
     mov vendor_id + 4, edx
     mov vendor_id + 8, ecx
     mov eax, 1 // call cpuid with eax = 1
     cupid // to get the CPU family information
     mov reg_eax, eax // eax contains cpu family type info
     mov reg_edx, edx // edx has Hyper-Threading info
     }
     }
     __except (EXCEPTION_EXECUTE_HANDLER ) {

    return 0; // The CPUID call is not supported
    }
     // Is this a Pentium 4 or later processor?
     if (((reg_eax & FAMILY_ID) == PENTIUM4_ID) ||
    (reg_eax & EXT_FAMILY_ID))
     if (vendor_id[0] == 'uneG')
     if (vendor_id[1] == 'Ieni')
     if (vendor_id[2] == 'letn')
     return (reg_edx & HT_BIT); // Intel Processor
    Hyper-Threading
     return 0; // The processor is not Intel.
    }

 

If the physical processor supports Hyper-Threading Technology, bits 16-23 in register ebx, using the cpuid instruction with eax equal to one, identifies the number of logical processors the physical processor supports. Note that although a processor supports Hyper-Threading Technology and has more than one logical processor, Hyper-Threading Technology may not be enabled by the BIOS or that the operating system is utilizing the extra logical processors. If Hyper-Threading Technology is not supported, the default number of logical processors per physical processor is one. In the next step, each processor recognized by the operating system must be queried to provide the logical to physical processor mapping.

Sample Code 2.0: Determining the number of logical processors per physical processor.

    // Register EBX bits 23 through 16 indicate the number of
   logical processors per package
    #define NUM_LOGICAL_BITS 0x00FF0000
     // Return the number of logical processors per physical
    processors.
    unsigned char LogicalProc_PerCPU(void)
    {
     unsigned int reg_ebx = 0;
     if ( Is_HT_Supported()) {
     __asm {
     mov eax, 1 // call cpuid with eax = 1 to get the CPU family
    information
     cpuid
     mov reg_ebx, ebx // get number of logical processors
     }
     return (unsigned char) ((reg_ebx &
    NUM_LOGICAL_BITS) >> 16);
     }
     else {
     return (unsigned char) 1;
     }
    }

 

The above code is shown in two separate calls to cpuid for the sake of clarity and can be combined.

Each logical processor has a unique Advanced Programmable Interface Controller (APIC) ID. The APIC ID is initially assigned by the hardware at system reset and can be later reprogrammed by the BIOS or the operating system. On a processor that supports HT Technology, the CPUID instruction also provides the initial APIC ID for a logical processor prior to any changes by the BIOS or operating system. In order to retrieve the APIC ID for each of the logical processors recognized by the operating system, the code must be run on the specific processors. Setting the processor affinity for the executing code using operating specific system calls will do this.

Sample Code 3.0: Retrieving the Processor APIC ID.

    #define INITIAL_APIC_ID_BITS 0xFF000000 // EBX bits 31:24 has
    the unique APIC ID
    // Returns the 8-bit unique Initial APIC ID for the processor
    this
    // code is running on. The default value returned is 0xFF
    if
    // Hyper-Threading Technology is not supported.
    unsigned char Get_APIC_ID (void)
    {
     unsigned int reg_ebx = 0;
     if (!Is_HT_Supported()) {
     return (unsigned char) -1;
     }
     else {
     __asm {
     mov eax, 1 // call cpuid with eax = 1 for family
    information
     cpuid
     mov reg_ebx, ebx // Has APIC ID info
     }
     return (unsigned char) ((reg_ebx &
    INITIAL_APIC_ID_BITS) >> 24);
     }
    }

 

The Initial APIC ID is composed of the physical processor's ID and the logical processor's ID within the physical processor. The least significant bits of the APIC ID are used to identify the logical processors within a single physical processor. The number of logical processors per physical processor package determines the number of least significant bits needed. As an example, one least significant bit is needed to represent the logical IDs for two logical processors but two least significant bits are needed to represent the logical IDs for both 3 and 4 logical processors. The remaining most significant bits identify the physical processor ID. As a result, initial APIC IDs are not necessarily consecutive values starting from zero. In general, the APIC ID is given by the following formula:

    ((Physical Package ID << (1 +
    ((int)(log(2)(max(Logical_Per_Package-1, 1)))) || Logical
    ID)

 

In addition to non-consecutive initial APIC IDs, operating system processors IDs can be non-consecutive in value. It is also possible for the operating system to limit the processors a process may utilize. Therefore, even if Hyper-Threading Technology is enabled, the application may not be able to run on both logical processors sharing the same physical processor.

To know which logical processor IDs are sharing the same physical processor for the purposes of load balancing and licensing strategy for your application the Sample Code 4.0 below can be used as a guide. It prints out a table containing the operating system's processor affinity ID, APIC ID, the physical processor ID, and the logical processor ID within the physical processor. If there are two logical processor IDs sharing the same physical processor ID, then Hyper-Threading Technology is enabled.

Sample Code 4.0: Associating logical to physical processors.

    #include <stdio.h>
    // Include the previous routines here.
    void main (void) {
     // Check to see if Hyper-Threading Technology is
    available
     if (Is_HT_Supported()) { // See code sample 1.
     unsigned char HT_Enabled = 0;
     unsigned char Logical_Per_CPU;
     printf ("Hyper-Threading Technology is available.");
     Logical_Per_CPU = LogicalProc_PerCPU(); // See code sample
    2.
     printf ("Logical Processors Per CPU: %d",
    Logical_Per_CPU);
     // Logical processors > 1
     // does not mean that Hyper-Threading Technology is
    enabled.
     if (Logical_Per_CPU > 1) {
     HANDLE hCurrentProcessHandle;
     DWORD dwProcessAffinity;
     DWORD dwSystemAffinity;
     DWORD dwAffinityMask;
     // Physical processor ID and Logical processor IDs are
    derived
     // from the APIC ID. Calculate the shift and mask values
    knowing the number of
     // logical processors per physical processor.
     unsigned char i = 1;
     unsigned char PHY_ID_MASK = 0xFF;
     unsigned char PHY_ID_SHIFT = 0;
     while (i < Logical_Per_CPU){
     i *= 2;
     PHY_ID_MASK <<= 1;
     PHY_ID_SHIFT++;
     }
     // The OS may limit the processors that this process may run
    on.
     hCurrentProcessHandle = GetCurrentProcess();
     GetProcessAffinityMask(hCurrentProcessHandle,
    &dwProcessAffinity,
    &dwSystemAffinity);
     // If our available process affinity mask does not equal
    the
     // available system affinity mask, then we may not be able
    to
     // determine if Hyper-Threading Technology is enabled.
     if (dwProcessAffinity != dwSystemAffinity) {
     printf ("This process can not utilize all
    processors.");
     }
     dwAffinityMask = 1;
     while (dwAffinityMask != 0
    && dwAffinityMask <=
    dwProcessAffinity) {
     // Check to make sure we can utilize this processor
    first.
     if (dwAffinityMask & dwProcessAffinity)
    {
     if (SetProcessAffinityMask(hCurrentProcessHandle,
    dwAffinityMask)) {
     unsigned char APIC_ID;
     unsigned char LOG_ID;
     unsigned char PHY_ID;
     Sleep(0); // This process may not be running on the cpu
    that
     // the affinity is now set to. Sleep gives the OS
     // a chance to switch to the desired cpu.
     // Get the logical and physical ID of the CPU
     APIC_ID = Get_APIC_ID(); // See code sample 3.
     LOG_ID = APIC_ID & ~PHY_ID_MASK;
     PHY_ID = APIC_ID >> PHY_ID_SHIFT;
     // Print out table of processor IDs
     printf ("OS Affinity ID: 0x%.8x, APIC ID: %d PHY ID: %d, LOG
    ID: %d",
     dwAffinityMask, APIC_ID, PHY_ID, LOG_ID);
     if (LOG_ID != 0)
     HT_Enabled = 1;
     }
     else {
     // This should not happen. A check was made to ensure we
     // can utilize this processor before trying to set affinity
    mask.
     printf ("Set Affinity Mask Error Code: %d",
    GetLastError());
     }
     }
     dwAffinityMask = dwAffinityMask << 1;
     }
     // Reset the processor affinity if this code is used in an
    application.
     SetProcessAffinityMask(hCurrentProcessHandle,
    dwProcessAffinity);
     if (HT_Enabled) {
     printf ("Processors with Hyper-Threading enabled were
    detected.");
     }
     else {
     printf ("Processors with Hyper-Threading enabled were not
    detected.");
     }
     }
     else {
     printf ("Processors with Hyper-Threading are not
    enabled.");
     }
     }
     else {
     printf ("Hyper-Threading Processors are not
    detected.");
     }
    }

 

OpenMP

The open multiprocessing or OpenMP* standard API provides library routines, environment variables directives and pragmas that are the standard for coding parallelism. OpenMP supports multi-platform, shared-memory, parallel programming in C++/C/Fortran95 on all Intel architectures and popular operating systems. OpenMP is a de facto rather than official standard that has wide support and is advisable to adopt in HT Technology and standard threaded multiprocessing. OpenMP is the industry standard for portable multithreaded application development, and is effective at fine grain (loop level) and large grain (function level) threading.

Program Design

Developers should observe carefully the following guidelines in terms of their affects on hyper-threading:

1. Threads should not compete for the same resources. This causes logical processors to stall while waiting for the same event. Avoid this problem by having threads do very different things. If threads must perform the same work, make them operate on different data and different events and/or assign the task to a different physical processor, if possible.

2. Where two threads perform different tasks, developers should use processor affinity to assign these threads to the same physical processor. Linux is moving towards processor affinity*.

3. When iterating while waiting for a resource to free or an event to occur prevent context switches while a thread is waiting. This is done with a pause instruction placed into the loop. On older processors pause is a no-op and does nothing. However, with HT Technology enabled processors it helps the processor reduce loop-exit overhead and free resources for the second thread.

4. Many of the standard practices of multiprocessor-oriented programming apply to hyper-threading. These include:

  • The use of thread pools (in which a pool of threads is reused) rather than repeatedly creating new threads
  • minimizing thread synchronization
  • minimizing thread use where there is no or little advantage.

 

5. The Intel Xeon processor, due to its unusually deep execution pipeline, benefits if programs make code jumps consistent with Intel's branch-prediction algorithm. This algorithm, which is heavily used in out-of-order execution, assumes that backwards jumps are always taken (as they would be in a loop) and that forward jumps are never taken. As a result, when testing a value, the most frequent path should either fall through or branch backwards, while the least-probable decision should jump forward.


Linux* Issues with Hyper-Threaded Technology

For Linux to make use of HT Technology an SMP kernel must be installed. This is confirmed by the output of uname -a, which shows the kernel version ending in SMP. The output of cat /proc/cpuinfo (see below) will show the number of CPU's and their properties. HT Technology is shown by the presence of the ht flag. If the computer BIOS supports SMP (or an analogous option is activated in BIOS), the machine can be used as an SMP system with Linux. When the SMP kernel is installed, it will seem as if double the physical processors are available. At present the errata kernels 2.4.9-21 and 2.4.9-31 have HT Technology support, but you have to pass "acpismp=force" on the kernel command line to enable it.

    > $cat /proc/cpuinfo
    > processor : 0
    > vendor_id : GenuineIntel
    > cpu family : 15
    > model : 1
    > model name : Intel® Pentium® 4 CPU 1.70GHz
    > stepping : 2
    > cpu MHz : 1694.907
    > cache size : 256 KB
    > fdiv_bug : no
    > hlt_bug : no
    > f00f_bug : no
    > coma_bug : no
    > fpu : yes
    > fpu_exception : yes
    > cpuid level : 2
    > wp : yes
    > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
    pge
    > mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss
    ht tm
    > bogomips : 3381.65
    >

 

The Linux 2.4 kernel will not be developed to take an active advantage of HT Technology and patches may be required to fix some possible problems. Rusty Russell (a Linux developer) mentions: "The hyperthreading issue... is likely to throw a new set of complications into the mix. A processor which does hyperthreading looks like two independent CPUs, but it [processes] should not be scheduled [by the scheduler] as such - it is better to divide process across real (hardware) processors first."

Work is underway in the Linux community to add HT Technology patches to the 2.4 release of the Linux kernel. HT Technology changes to Linux are being exploited by Carrier Grade Linux Change Project under the ODSL (see below). HT Technology offers performance enhancement potentials that are a prime requirement of the carrier grade environment. Active utilization of HT Technology will likely appear in the standard 2.5 Linux kernel, although it has not yet been placed into the development queue due to prerequisite work being done.

Linux HT Technology Capable Compilers

The Intel Linux C++ Compiler 6.0 has increased levels of Linux and industry standards support that provide improved compatibility with GNU C, broader support of Linux distributions, support for the C++ ABI object model and GNU inline ASM for IA32. - The Intel C++ Compiler supports OpenMP API version 1.0 and performs code transformation for shared memory parallel p rogramming. The Intel compiler supports multi-threaded application development and debugging, with support for OpenMP 1.0 for C and new support for OpenMP 1.0 for C++. This makes it fully capable of designing HT Technology enabled applications on HT Technology enabled hardware.


Conclusion

The release of Intel Hyper-Threading processors is being met with great enthusiasm and effort by the Linux development community. Carrier Grade Linux efforts are under way to exploit HT Technology to achieve greater availability and response time in the demanding carrier grade environment. Additionally, HT Technology is a great step in encouraging the improvement of software performance through better programming techniques to take advantage of new hardware technology without having to, yet again, reinvent the world.

It is likely that Intel will continue to develop the HT Technology of several logical processors running on each physical processor. With HT Technology, Intel will drag software development into a standard methodology of using threaded code, an effort that will benefit everyone.


Further Reading

 


About the Author

Thomas Wolfgang Burger is the owner of Thomas Wolfgang Burger Consulting. He has been a consultant, instructor, analyst and applications developer since 1978.


Categories:
For more complete information about compiler optimizations, see our Optimization Notice.