Intel® 64 Architecture Processor Topology Enumeration

 

Processor topology information is important for a number of processor-resource management practices, ranging from task/thread scheduling, licensing policy enforcement, affinity control/migration, etc. Topology information of the cache hierarchy can be important to optimizing software performance. This white paper covers topology enumeration algorithm for single-socket to multiple-socket platforms using Intel 64 and IA-32 processors. The topology enumeration algorithms (both processor and cache) using initial APIC ID has been extended to use x2APIC ID, the latter mechanism is required for future platforms supporting more than 256 logical processors in a coherent domain.


Download "Intel® 64 Architecture Processor Topology Enumeration" [PDF 153KB]

Download Code Package: topo-09272010.tar

 

 


Introduction

Hardware multithreading in microprocessors has proliferated in recent years. The majority of Intel® architecture processors shipping today provide one or more forms of hardware multi-threading support (multicore and/or simultaneous multithreading (SMT), the latter introduced as HyperThreading Technology in 2002). From a processor hardware perspective, the physical package of an Intel 64 processor can support SMT and multi-core. Consequently, a physical processor is effectively a hierarchically ordered collection of logical processors with some forms of shared system resources (for example, memory, bus/system links, caches) From a platform hardware perspective, hardware multithreading that exists in a multi-processor system may consist of two or more physical processors organized in either uniform or non-uniform configuration with respect to the memory subsystem.

Application programming using hardware multithreading features must follow the programming models and software constructs provided by the underlying operating system. For example, an OS scheduler generally assigns a software task from a queue using hardware resource at the granularity of a logical processor; an OS may define its own data structure and provide services to applications that allows them to customize the assignment between task and logical processor via an affinity construct for multithreaded applications The OS and the software stack underneath an application (the BIOS, the OS loader) also play significant roles in bringing up the hardware multi-threading features and configuring the software constructs defined by the OS.

The CPUID instruction in Intel 64 architecture defines a rich set of information to assist BIOS, OS, and applications to query processor topology that are needed for efficient operation by each respective member of the software stack. Generally, the BIOS needs to gather topology information of a physical processor, determine how many physical processors are present in the system; prepare the necessary software constructs related to system topology, and pass along the system topology information to the next layer of the software stack that takes over control of the system. The OS and the application layers have a wide range of uses for topology information. This document covers several common software usages by OS and applications for using CPUID to analyze processor topology in a single-processor or multi-processor system.

The primary software usage of processor topology enumeration deals with querying and identifying the hierarchical relationship of logical processor, processor cores, and physical packages in a single-processor or multi-processor system. We’ll refer to this usage as system topology enumeration. System topology enumeration may be needed by OS or certain applications to implement licensing policy based on physical processors. It is used by OS to implement efficient task-scheduling, minimize thread migration, configure application thread management interfaces, and configure memory allocation services appropriate to the processor/memory topology. Multithreaded applications need system topology information to determine optimal thread binding, manage memory allocation for optimal locality, and improve performance scaling in multi-processor systems.

Intel 64 processors fulfill system topology enumeration requirements:

  • Each logical processor in an Intel 64 or IA-32 platform supporting coherent memory is assigned a unique ID (APIC ID) within the coherent domain. A multi-node cluster installation may employ vendor-specific BIOS that preserve the APIC IDs assigned (during processor reset) within each coherent domain, extend with node IDs to form a superset of unique IDs within the clustered system. This document will only cover the CPUID interfaces providing unique IDs within a coherent domain.
  • The values of unique IDs assigned within a coherent Intel 64 or IA-32 platform conform to an algorithm based on bit-field decomposition of the APIC ID into three sub-fields . The three sets of sub-fields correspond to three hierarchical levels defined as “SMT”, “processor core” (or “core”), and “physical package” (or “package”). This allows each hierarchical level to be mapped to a sub-field (a sub ID) within the APIC ID.

 

Conceptually, a topology enumeration algorithm is simply to extract the sub ID corresponding to a given hierarchical level from the APIC ID, based on deriving two parameters that defines the subset of bits within an APIC ID. The relevant parameters are: (a) the width of a mask that can be used to mask off unneeded bits in the APIC ID, (b) an offset relative to bit 0 of the APIC ID.

The “SMT” level corresponds to the innermost constituent of the processor topology. So it is located in the least significant portion of the APIC ID. If the corresponding width for “SMT” is 0, it implies there is only 1 logical processor within the next outer level of the hierarchy. For example, Intel® CoreTM2 Duo Processors generally produce a “SMT_Mask_Width” of 0. If the corresponding width is 1 bit wide, there could be two logical processors within the next outer level of the hierarchy.

If the corresponding width for “core” is 0, it implies there is only 1 processor core within a physical processor. If the corresponding width for “core” is 1 bit wide, there could be two proce ssor cores within a physical processor.

Note, the values of APIC ID that are assigned across all logical processor s in the system need not be contiguous. But the subsets of bit fields corresponding to three hierarchical levels are contiguous at bit boundary. Due to this requirement, the bit offset of the mask to extract a given Sub ID can be derived from the “mask width” of the inner hierarchical levels.


Unique APIC ID in a Multi-Processor System

Although legacy IA-32 multi-processor systems assigns unique APIC IDs for each logical processors in the system, the programming interfaces have evolved several times in the past. For Intel Pentium Pro processors and Pentium III Xeon processors, APIC IDs are accessible only from local APIC registers (Local APIC registers use memory-mapped IO interfaces and are managed by OS). In the first generation of Intel Pentium 4 and Intel Xeon processor processors (2000, 2001), CPUID instruction provided information on the initial APIC ID that is assigned during processor reset. The CPUID instruction in the first generation of Intel Xeon MP processor and Intel Pentium 4 processor supporting Hyper-Threading Technology (2002) provided additional information that allows software to decompose initial APIC IDs into a two-level topology enumeration. With the introduction of dual-core Intel 64 processors in 2005, system topology enumeration using CPUID evolved into a three-level algorithm on the 8-bit wide initial APIC IDs. Future Intel 64 platforms may be capable of supporting a large number of logical processors that exceed the capacity of the 8-bit initial APIC ID field. The x2APIC extension in Intel 64 architecture defines a 32-bit x2APIC ID, the CPUID instruction in future Intel 64 processors will allow software to enumerate system topology using x2APIC IDs. The extended topology enumeration leaf of CPUID (leaf 11) is the preferred interface for system topology enumeration for future Intel 64 processor.

The CPUID instruction in future Intel 64 processors may support leaf 11 independent of x2APIC hardware. For many future Intel 64 platforms, system topology enumeration may be performed using either CPUID leaf 11 or legacy initial APIC ID (via CPUID leaf 1 and leaf 4). Figure 1 shows an example of how software can choose which CPUID leaf information to use for system topology enumeration.

Figure 1 Example of Choosing CPUID Leaf Information for System Topology Enumeration

 

The maximum value of supported CPUID leaf can be determined by setting EAX = 0, execute CPUID and examine the returned value in EAX, i.e. CPUID.0:EAX. If CPUID.0:EAX >= 11, software can determine whether CPUID leaf 11 exists by setting EAX=11, ECX=0, execute CPUID to examine the non-zero value returned in EBX, i.e. CPUID. (EAX=11, ECX=0):EBX != 0.

Fully functional hardware multithreading requires full-reporting of CPUID leaves.

If software observes that CPUID.0:EAX < 4 on a newer Intel 64 or IA-32 processor (newer than 2004), it should examine the MSR IA32_MISC_ENABLES[bit 22].

If IA32_MISC_ENABLES[bit 22] was set to 1 (by BIOS or other means), the user can restore CPUID leaf function full reporting by having IA32_MISC_ENABLES[bit 22] set to ‘0’ (Modify BIOS CMO S setting or use WRMSR).

For older IA-32 processors that support only two-level topology, the three-level system topology enumeration algorithm (using CPUID leaf 1 and leaf 4) is fully compatible with older processors supporting two-level topology (SMT and physical package). For processors that report CPUID.1:EBX[23:16] as reserved (i.e. 0), the processor supports only one level of topology.

Table A-1 shows a code example of dealing with CPUID leaf functions across three categories of processor hardware.


System Topology Enumeration Using CPUID Extended Topology Leaf

The algorithm of system topology enumeration can be summarized as three phase of operation:

  • Derive “mask width” constants that will be used to extract each Sub IDs.
  • Gather the unique APIC IDs of each logical processor in the system, and extract/decompose each APIC ID into three sets of Sub IDs.
  • Analyze the relationship of hierarchical Sub IDs to establish mapping tables between OS’s thread management services according to three hierarchical levels of processor topology.

 

Table A-2 shows an example of the basic structure of the three phases of system wide topology as applied to processor topology and cache topology.

Figure 2 outlines the procedures of querying CPUID leaf 11 for the x2APIC ID and extracting sub IDs corresponding to the “SMT”, “Core”, “physical package” levels of the hierarchy.

Figure 2 Procedures to Extract Sub IDs from the x2APIC ID of Each Logical Processor

Table A-3 lists a data structure that holds the APIC ID, various sub IDs, and a hierarchical set of ordinal numbering scheme to enumerate each entity in the processor topology and/or cache topology of a system.

System topology enumeration at the application level using CPUID involves executing CPUID instruction on each logical processor in the system. This implies context switching using services provided by an OS. On-demand context switching by user code generally relies on a thread affinity management API provided by the OS. The capability and limitation of thread affinity API by different OS vary. For example, in some OS, the thread affinity API has a limit of 32 or 64 logical processors. It is expected that enhancement to thread affinity API to manage larger number of logical processor will be available in future versions.


System Topology Enumeration Using CPUID Leaf 1 and Leaf 4

Figure 3 outlines the procedures of querying initial APIC ID via CPUID leaf 1 and extracting sub IDs corresponding to the SMT, Core, physical package levels of the hierarchy using CPUID leaf 1 and leaf 4. Extraction of sub ID from initial APIC ID is based on querying CPUID leaf 1 and 4 to derive the bit widths of three select masks (SMT mask, Core mask, Pkg mask) that make up the 8-bit initial APIC ID field. The select masks allow software to extract the sub IDs corresponding to “SMT”, “core”, “package” from the initial APIC ID of each logical processor.

Figure 3 Procedures to Extract Sub IDs from the Initial APIC ID of Each Logical Processor

Table A-4 shows an example of querying the APID ID for each logical processor in the system and parsing each APIC ID into respective sub IDs for later analysis of the topological makeup.

Table A-5 lists support routines to extract various sub IDs from each APIC ID of the logical processor that we have bound the current execution context.

Table A-6 shows OS-specific wrapper functions.


Cache Topology Enumeration

The physical package of an Intel 64 processor has a hierarchy of cache. A given level of the cache hierarchy may be shared by one or more logical processors. Some software may wish to optimize performance by taking advantage of the shared cache of a particular level of the cache hierarchy. Performance tuning using cache topology can be accomplished by combining the system topology information with the addition of cache topology information. Figure 4 outlines the procedures of decomposition of sub IDs to enumerate logical processors sharing the target cache level and enumerating the target level caches visible in the system. The Cache_ID can be extracted from the x2APIC ID for processors that reports 32-bit x2APIC ID or from the initial APIC ID for processors that do not report x2APIC ID. The array of “Cache_ID” can be used to enumerate different caches in conjunction with other sub ID derived from the processor topology to implement code-tuning techniques.

Figure 4 Procedures to Extract Cache_ID of the Target Cache Level of Each Logical Processor

The three-level sub IDs, SMT_ID[k], Core_ID[k], Pkg_ID[k], k = 0, .., N-1 can be used by software in a number of application-specific ways. Some of the more common usages include:

  • Determine the number of physical processors to implement a per-package licensing policy. Each unique value in the Pkg_ID[] array represents a physical processor.
  • A thread-binding strategy may choose to favor binding each new task to a separate core in the system. This may require the software to know the relationships between the affinity mask of each logical processor relative to each distinct processor core.
  • An MP-scaling optimization strategy may wish to partition its data working set according to the size of the large last-level cache and allow multiple threads to process the data tile residing in each last level cache. This will require software to manage the affinity masks and thread binding relative to each Cache_ID and APIC ID in the system.

 


Data Processing of Sub IDs of the Topology

Each hierarchy of the sub IDs represents a subset of the APIC ID (either x2APIC ID or initial APIC ID). It allows software to address each distinct entity within the parent hierarchical level. For processor topology enumeration:

  • SMT_ID: each unique SMT_ID allows software to distinguish different logical processors within a processor core,
  • Core_ID: each unique Core_ID allows software to distinguish different processor cores within a physical package,
  • Pkg_ID: each unique Pkg_ID allows software to distinguish different physical packages in a multi-processor system.

For cache topology enumeration:

  • CacheSMT_ID: each unique CacheSMT_ID allows software to distinguish different logical processors sharing the same target cache level.
  • Cache_ID: each unique Cache_ID allows software to distinguish different target level cache in the system.

 

The extraction of sub IDs from the APIC ID makes use of constant parameters that are derived from CPUID instruction. From platform hardware perspective, Intel 64 and IA-32 multi-processor system require each physical processor support the same hardware multi-threading capabilities. Therefore, system topology enumeration can execute the relevant CPUID leaf functions on one logical processor to derive system-wide sub-ID extraction parameters. But the APIC IDs must be queried by executing CPUID instruction on each logical processor in the system.


Sub ID Extraction Parameters for x2APIC ID

Extraction of sub ID from x2APIC ID is based on querying the value of CPUID.(EAX=11, ECX=n):EAX[4:0] for a valid sub leaf index n to obtain the bit width parameter to derive an extraction mask while x2APIC ID is queried by CPUID.(EAX=1,ECX=0):EDX[31:0]. The extraction mask allows software to extract a subset of bits from the x2APIC ID as a sub ID for the respective level of the hierarchy. In order to enumerate the sub IDs, increase sub leaf index (ECX=n) by 1 until CPUID.(ECX=11,ECX=n).EBX[15:0] == 0

  • SMT_ID:CPUID.(EAX=11, ECX=0):EAX[4:0] provides the width parameter to derive a SMT select mask to extract the SMT_IDs of logical processors within the same processor core. The sub leaf index (ECX=0) is architecturally defined and associated with the “SMT” level type (CPUID.(EAX=11, ECX=0):ECX[15:8] == 1)
    • SMT_Mask_Width = CPUID.(EAX=11, ECX=0):EAX[4:0] if CPUID.(EAX=11, ECX=0):ECX[15:8] is 1
    • SMT_Select_Mask = ~((-1) << SMT_Mask_Width )
    • SMT_ID = x2APIC_ID & SMT_Select_Mask
  • Core_ID:The level type associated with the sub leaf index (ECX=1) may vary across processors with different hardware multithreading capabilities. If CPUID.(EAX=11, ECX=1):ECX[15:8] is 2, it is associated with “processor core” level type. Then, CPUID.(EAX=11,ECX=1):EAX[4:0] provides the width parameter to derive a select mask of all logical processors within the same physical package. The “processor core” includes “SMT” in this case, and enumerating different cores in the package can be done by zeroing out the SMT portion of the inclusive mask derived from this.
    • CorePlus_Mask_Width = CPUID.(EAX=11,ECX=1):EAX[4:0] if CPUID.(EAX=11, ECX=1):ECX[15:8] is 2
    • CoreOnly_Select_Mask = (~((-1) << CorePlus_Mask_Width ) ) ^ SMT_Select_Mask.
    • Core_ID = (x2APIC_ID & CoreOnly_Select_Mask) >> SMT_Mask_Width
  • Pkg_ID:Within a coherent domain of the three-level topology, the upper bits of the APIC_ID (except the lower “CorePlus_Mask_Width” bits) can enumerate different physical packages in the system. In a clustered installation, software may need to consult vendor specific documentation to distinguish the topology of how many physical packages are organized within a given node.
    • Pkg_Select_Mask = (-1) < ;< CorePlus_Mask_Width
    • Pkg_ID = (x2APIC_ID & Pkg_Select_Mask) >> CorePlus_Mask_Width

 

An example of deriving the extraction parameters for x2APIC ID can be found in the support function “CPUTopologyLeafBConstants()” in the Appendix.

Table A-7 lists the support function to derive bitmask extraction parameters from CPUID leaf 0BH to extract sub IDs from x2APIC ID.


Sub ID Extraction Parameters for Initial APIC ID

Topological sub ID extraction from an INITIAL_APIC_ID (CPUID.1:EBX[31:24]) uses parameters derived from CPUID.1:EBX[23:16] and CPUID.(EAX=04H, ECX=0):EAX[31:26]. CPUID.1:EBX[23:16] represents the maximum number of addressable IDs (initial APIC ID) that can be assigned to logical processors in a physical package. The value may not be the same as the number of logical processors that are present in the hardware of a physical package. The value of (1 + (CPUID.(EAX=4, ECX=0):EAX[31:26] )) represents the maximum number of addressable IDs (Core_ID) that can be used to enumerate different processor cores in a physical package. The value also can be different than the actual number of processor cores that are present in the hardware of a physical package.

  • SMT_ID:The equivalent “SMT_Mask_Width” can be derived from dividing maximum number of addressable initial APIC IDs by maximum number of addressable Core IDs
    • SMT_Mask_Width = Log2[1]( RoundToNearestPof2(CPUID.1:EBX[23:16]) / ((CPUID.(EAX=4, ECX=0):EAX[31:26] ) + 1)), where Log2 is the logarithmic based on 2 and RoundToNearestPof2() operation is to round the input integer to the nearest power-of-two integer that is not less than the input value.
    • SMT_Select_Mask = ~((-1) << SMT_Mask_Width )
    • SMT_ID = INITIAL_APIC_ID & SMT_Select_Mask
  • Core_ID:The value of (1 + (CPUID.(EAX=04H, ECX=0):EAX[31:26] )) can also be use to derive an equivalent “CoreOnly_Mask_Width”.
    • CoreOnly_Mask_Width = Log2(1 + (CPUID.(EAX=4, ECX=0):EAX[31:26] ))
    • CoreOnly_Select_Mask = (~((-1) << (CoreOnly_Mask_Width + SMT_Mask_Width) ) ) ^ SMT_Select_Mask.
    • Core_ID = (INITIAL_APIC_ID & CoreOnly_Select_Mask) >> SMT_Mask_Width
  • Pkg_ID:Pkg_Select_Mask can be derived as follows;
    • CorePlus_Mask_Width = CoreOnly_Mask_Width + SMT_Mask_Width
    • Pkg_Select_Mask = ((-1) << CorePlus_Mask_Width)
    • Pkg_ID = (INITIAL_APIC_ID & Pkg_Select_Mask) >> CorePlus_Mask_Width

 

Table A-8 lists the support function to derive bitmask extraction parameters from CPUID leaf 01H and Leaf 04H to extract sub IDs from initial APIC ID. Table A-9 shows the support function to derive mask widths from the system-wide extraction parameters.


Cache ID Extraction Parameters

Cache IDs are specific to a target level cache of the cache hierarchy. Software must determine, a priori, the target cache level (the sub-level index n associated with CPUID leaf 4) it wishes to optimize with respect to the processor topology. After it has chosen the sub -leaf index ECX=n, then Log2(RoundToNearestPof2( (1 + CPUID.(EAX=4, ECX=n):EAX[25:14])) is the equivalent “Cache_Mask_Width” parameter. The “Cache_Mask_Width” parameter forms the basis to construct either a select mask to extract the sub IDs of logical processors sharing the target cache level, or a complementary mask to select the upper bits from APID to identify different cache entities of the specified target level in the system. To construct a mask to extract sub IDs of different logical processors sharing a cache, it is simply ~((-1) << Cache_Mask_Width ).

The derivation of bitmask extraction parameters for cache topology is analogous to those shown in Table A-8. Software may choose to focus on one specific cache level in the cache hierarchy. In the companion full source code package that is released separately with this paper, the reader can find code examples of derivation for the bitmask extraction parameters for each cache level, and corresponding cache topology sorting examples. For space consideration, the full source of the cache topology code is not listed in the Appendix.


Analyzing Topology Enumeration Result and Customization

How can software make use of topological information (in the form of hierarchical sub IDs)? This really depends on the needs and situations specific to each application. It may need adaptation due to differences of APIs provided by different OS. For the purpose of illustration, we consider some examples of using sub IDs to establish manage affinity masks hierarchically.

The knowledge of sub IDs of each topological hierarchy may be useful in several ways, For example:

  • Count the number of entities in a given hierarchical level across the system;
  • Use OS threading management services (e.g. affinity masks) while adding topological insights (per-core, per-package, per-target-level-cache) to optimize application performance.

 

Affinity mask is a data structure that is defined within a specific OS, different OS may use the same concept but providing different means of application programming interface. For example, Microsoft Windows* provides affinity mask as a data type that can be directly manipulated via bit field by applications for affinity control. Linux implements a similar data structure internally but abstracts it so application can manipulate affinity through an iterative interface that assigned zero-based numbers to each logical processor.

The affinity mask or the equivalent numbering scheme provided by OS does not carry attributes that can store hierarchical attributes of the system topology. We will use the “affinity mask” terminology generically in this section (as the technique can be easily generalized to the numbered interface of affinity control).

In the reference code example, we use the sub IDs to create an ordinal numbering scheme (zero-based) for each hierarchical level. Different entities in the system topology (packages, cores) can be referenced by applications using a set of hierarchical ordinal numbers. Using the hierarchical ordinal number scheme and a look-up table to the corresponding affinity masks, software can easily control thread binding, optimize cache usage, etc.

Figure 4 depicts a basic example of data processing of Pkg_IDs and Core_IDs in the system to acquire information on the number of so ftware visible physical packages, processor cores in the system. This basic technique can also be adapted to acquire affinity mappings, hierarchical breakdowns, and asymmetry information in the system.

Table A-10 part a, b, and c lists an algorithm to analyze the sub IDs of all logical processor in the system and derive a triplet of zero-based numbering scheme to index unique entities within each topological level.

Table A-11 lists a data structure that organizes miscellaneous global variable, arrays, workspace items that are used throughout the rest of the code example. The full set of source code is provided in a separate package. The full source code can be compiled under 32-bit and 64-bit Windows and Linux operating systems. A limited set of OS and compiler tools have been tested.


Dynamic Software Visibility of Topology Enumeration

When application software examines/uses topology information, it must keep in mind the dynamic nature of software visibility. The hardware capability present at the platform level may be presented differently through BIOS setting, through OS boot option, through OS-supported user interfaces. For example, Intel 64 and IA-32 multi-processor system require each physical processor support the same hardware multi-threading capabilities. This hardware symmetry that exists at the platform hardware level may be presented differently at application level. System topology enumeration can uncover dynamic software visible asymmetry irrespective of the cause of such asymmetry may be caused by BIOS setting, OS boot option, or UI configurations.

The appendix lists the bulk of the supporting functions that are used in enumerating processor topology of the system as visible to the current software process. A complete source code is provided separately for download. The reference code can be compiled in either 32-bit or 64-bit Windows* environment. In 64-bit environment, the cpuid64.asm file is needed to provide an enhanced intrinsic function for querying CPUID sub-leaves. An equivalent reference code implementation for 32-bit and 64-bit Linux environment will also be available.


Glossary

Physical Processor: The physical package of a microprocessor capable of executing one or more threads of software at the same time. Each physical package plugs into a physical socket. Each physical package may contain one or more processor cores, also referred to as a physical package.

Processor Core: The circuitry that provides dedicated functionalities to decode, execute instructions, and transfer data between certain sub-systems in a physical package. A processor core may contain one or more logical processors.

Logical Processor: The basic modularity of processor hardware resource that allow software executive (OS) to dispatch task or execute a thread context. Each logical processor can execute only one thread context at a time.

Hyper-Threading Technology: A feature within the IA-32 family of processors, where each processor core provides the functionality of more than one logical processor.

SMT: Abbreviated name for Simultaneous Multi-Threading. An efficient means in silicon to provide the f unctionalities of multiple logical processors within the same processor core by sharing execution resources and cache hierarchy between logical processors.

Multi-core Processor: A physical processor that contains more than one processor cores.

Multi-processor Platform: A computer system made of two or more physical sockets.

Hardware Multi-threading: Refers to any combination of hardware support to allow a system to run multi-threaded software. The forms of hardware support for multi-threading are: SMT, multi-core, and multi-processor.

Processor Topology: Hierarchical relationships processor entities (logical processors, processor cores) within a physical package relative to the sharing hierarchy of hardware resources within the physical processor.

Cache Hierarchy: Physical arrangement of cache levels that buffers data transport between a processor entity and the physical memory subsystem.

Cache Topology: Hierarchical relationships of a cache level relative to the logical processors in a physical processor.


Appendix

Table A-1 Determination of System-wide CPU Topology Constant

 

// Derive parameters used to extract/decompose APIC ID for CPU topology
// The algorithm assumes CPUID feature symmetry across all physical packages.
// Since CPUID reporting by each logical processor in a physical package are
// identical, we only execute CPUID on one logical processor to derive these
// system-wide parameters

// return 0 if successful, non-zero if error occurred
static int CPUTopologyParams()
{
DWORD maxCPUID; // highest CPUID leaf index this processor supports
CPUIDinfo info; // data structure to store register data reported by CPUID
_CPUID(&info, 0, 0);
maxCPUID = info.EAX;
// cpuid leaf B detection
if (maxCPUID >= 0xB)
{
CPUIDinfo CPUInfoB;
_CPUID(&CPUInfoB,0xB, 0);
//glbl_ptr points t
o assortment of global data, workspace, etc
glbl_ptr->hasLeafB = (CPUInfoB.EBX != 0);
}
_CPUID(&info, 1, 0);
// Use HWMT feature flag CPUID.01:EDX[28] to treat three configurations:
if (getBitsFromDWORD(info.EDX,28,28))
{
// #1, Processors that support CPUID leaf 0BH
if (glbl_ptr->hasLeafB)
{
// use CPUID leaf B to derive extraction parameters
CPUTopologyLeafBConstants();
}
else
//#2, Processors that support legacy parameters
// using CPUID leaf 1 and leaf 4
{
CPUTopologyLegacyConstants(&info, maxCPUID);
}
}
else
//#3, Prior to HT, there is only one logical
//processor in a physical package
{

glbl_ptr->CoreSelectMask = 0;
glbl_ptr->SMTMaskWidth = 0;
glbl_ptr->PkgSelectMask = (-1);
glbl_ptr->PkgSelectMaskShift = 0;
glbl_ptr->SMTSelectMask = 0;
}

if( glbl_ptr->error)return -1;
else return 0;
}

 

Table A-2 Modular Structure of Deriving System Topology Enumeration Information

 

/*
* BuildSystemTopologyTables
*
* Construct the processor topology tables and values necessary to
* support the external functions that display CPU topology and/or
* cache topology derived from system topology enumeration.
* Arguments: None
* Return: None, sets glbl_ptr->error if tables or values can not be calculated.
*/
static void
BuildSystemTopologyTables()
{
unsigned lcl_OSProcessorCount, subleaf;
int numMappings = 0;
// call OS-specific service to find out how many logical processors
// are supported by the OS
glbl_ptr->OSProcessorCount = lcl_OSProcessorCount = GetMaxCPUSupportedByOS();

// allocated the memory buffers within the global pointer
AllocArrays(lcl_OSProcessorCount);

// Gather all the system-wide constant parameters needed to
// derive topology information
if (CPUTopologyParams() ) return ;

if (CacheTopologyParams() ) return ;

// For each logical processor, collect APIC ID and
// parse sub IDs for each APIC ID
numMappings = QueryParseSubIDs();
if ( numMappings < 0 ) return ;
// Derived separate numbering schemes for each level of the cpu topology
if( AnalyzeCPUHierarchy(numMappings) < 0 ) {
glbl_ptr->error |= _MSGTYP_TOPOLOGY_NOTANALYZED;
}
// an example of building cache topology info for each cache level
if(glbl_ptr->maxCacheSubleaf != -1) {
for(subleaf=0; subleaf <= glbl_ptr->maxCacheSubleaf; subleaf++) {
if( glbl_ptr->EachCacheMaskWidth[subleaf] != 0xffffffff) {
// ensure there is at least one core in the target level cache
if (AnalyzeEachCHierarchy(subleaf, numMappings) < 0) {
glbl_ptr->error |= _MSGTYP_TOPOLOGY_NOTANALYZED;

}
}
}
}
}

 

Table A-3 Data Structure of APIC ID, Sub IDs, and Mapping of Ordinal Based Numbering Schemes

typedef struct {
unsigned int32 APICID;
// the full x2APIC ID or initial APIC ID of a logical
//processor assigned by HW
unsigned __int32 OrdIndexOAMsk;
// An ordinal index (zero-based) for each logical
// processor in the system, 1:1 with "APICID"
// Next three members are the sub IDs for processor topology enumeration
unsigned __int32 pkg_IDAPIC;
// Pkg_ID field, subset of APICID bits
// to distinguish different packages
unsigned __int32 Core_IDAPIC;
// Core_ID field, subset of APICID bits to
// distinguish different cores in a package
unsigned __int32 SMT_IDAPIC;

// SMT_ID field, subset of APICID bits to
// distinguish different logical processors in a core
// the next three members stores a numbering scheme of ordinal index
// for each level of the processor topology.
unsigned __int32 packageORD;
// a zero-based numbering scheme for each physical
// package in the system
unsigned __int32 coreORD;
// a zero-based numbering scheme for each core in the
// same package
unsigned __int32 threadORD;
// a zero-based numbering scheme for each thread in
// the same core
// Next two members are the sub IDs for cache topology enumeration
unsigned __int32 EaCacheSMTIDAPIC[MAX_CACHE_SUBLEAFS];
// SMT_ID field, subset of
// APICID bits to distinguish different logical processors
// sharing the same cache level
unsigned __int32 EaCacheIDAPIC[MAX_CACHE_SUBLEAFS];
// sub ID to enumerate
// different cache entities of the cache level corresponding
// to the array index/cpuid leaf 4 subleaf index
// the next three members stores a numbering scheme of ordinal index
// for enumerating different cache entities of a cache level, and enumerating
// logical processors sharing the same cache entity.
unsigned __int32 EachCacheORD[MAX_CACHE_SUBLEAFS];
// a zero-based numbering
// scheme for each cache entity of the specified cache level in the system
unsigned __int32 threadPerEaCacheORD[MAX_CACHE_SUBLEAFS];
// a zero-based
// numbering scheme for each logical processor sharing the same cache of the
// specified cache level

} IdAffMskOrdMapping;


/* Alternate technique for ring 3 code to infer the effect of CMOS setting in BIOS
* that restricted CPUID instruction to report highest leaf index
is 2, i.e.
* MSR IA32_MISC_ENABLES[22] was set to 1; This situation
* will prevent software from using CPUID to conduct topology enumeration
* RDMSR instruction is privileged, this alternate routine can run in ring 3.
*/
Int InferBIOSCPUIDLimitSetting()
{ DWORD maxleaf, max8xleaf;
CPUIDinfo info; // data structure to store register data reported by CPUID
// check CPUID leaf reporting capability is intact
CPUID(&info, 0);
maxleaf = info.EAX;
CPUID(&info, 0x80000000);
max8xleaf = info.EAX;
// Earlier Pentium 4 and Intel Xeon processor (prior to 90nm Intel Pentium 4
// processor)support extended with max extended leaf index 0x80000004,
// 90nm Intel Pentium 4 processor and later processors supports higher extended
// leaf index greater than 0x80000004.
If ( maxleaf <= 4 && max8xleaf > 0x80000004) return 1;
else return 0;
}

 

 

Table A-4 Query APIC ID and Parsing APIC ID into Sub IDs

/*
* QueryParseSubIDs
*
* Use OS specific service to find out how many logical processors can be accessed
* by this application.
* Querying CPUID on each logical processor requires using OS-specific API to
* bind current context to each logical processor first.
* After gathering the APIC ID's for each logical processor,
* we can parse APIC ID into sub IDs for each topological levels
* The thread affnity API to bind the current context limits us
* in dealing with the limit of specific OS
* The loop to iterate each logical processor managed by the OS can be done
* in a manner that abstract the OS-specific affinity mask data structure.
* Here, we construct a generic affinity mask that can handle arbitrary number
* of logical processors.
* Return: 0 is no error
*/
long QueryParseSubIDs(void)
{ unsigned i;
//DWORD_PTR processAffinity;
//DWORD_PTR systemAffinity;
unsigned long numMappings = 0, lcl_OSProcessorCount;
unsigned long APICID;
// we already queried OS how many logical processor it sees.
lcl_OSProcessorCount = glbl_ptr->OSProcessorCount;
// we will use our generic affinity bitmap that can be generalized from
// OS specific affinity mask constructs or the bitmap representation of an OS
AllocateGenericAffinityMask(&glbl_ptr->cpuid_values_processAffinity, lcl_OSProcessorCount);
AllocateGenericAffinityMask(&glbl_ptr->cpuid_values_systemAffinity, lcl_OSProcessorCount);
// Set the affinity bits of our generic affinity bitmap according to
// the system affinity mask and process affinity mask
SetChkProcessAffinityConsistency(lcl_OSProcessorCount);
if (glbl_ptr->error) return -1;

for (i=0; i < glbl_ptr->OSProcessorCount;i++) {
// can't asume OS affinity bit mask is contiguous,
// but we are using our generic bitmap representation for affinity
if(TestGenericAffinityBit(&glbl_ptr->cpuid_values_processAffinity, i) == 1) {
// bind the execution context to the ith logical processor
// using OS-specifi API

if( BindContext(i, glbl_ptr->cpuid_values_OSProcessorCount) ) {
glbl_ptr->error |= _MSGTYP_UNKNOWNERR_OS;
break;
}
// now the execution context is on the i'th cpu, call the parsing routine
ParseIDS4EachThread(i, numMappings);
numMappings++;
}
}
glbl_ptr->EnumeratedThreadCount = numMappings;
if( glbl_ptr->error)return -1;
else return numMappings;
};

 

 

Table A-5 Support Routine for Parsing APIC ID into Sub IDs

/*
* ParseIDS4EachThread
* after execution context has already bound to the target logical processor
* Query the 32-bit x2APIC ID if the processor supports it, or
* Query the 8-bit initial APIC ID for older processors. Apply various
* system-wide topology constant to parse the APIC ID into various sub IDs
* Arguments:
* i : the ordinal index to reference a logical processo
* r in the system
* numMappings : running count ot how many processors we've parsed
* Return: 0 is no error
*/
unsigned ParseIDS4EachThread(unsigned i, unsigned numMappings)
{ unsigned APICID;
unsigned subleaf;

APICID = glbl_ptr->PApicAffOrdMapping[numMappings].APICID = GetApicID(i);
glbl_ptr->PApicAffOrdMapping[numMappings].OrdIndexOAMsk = i;
// this an ordinal number that can relate to generic affinitymask
glbl_ptr->PApicAffOrdMapping[numMappings].pkg_IDAPIC = ((APICID & glbl_ptr->PkgSelectMask)
>> glbl_ptr->PkgSelectMaskShift);
glbl_ptr->PApicAffOrdMapping[numMappings].Core_IDAPIC = ((APICID & glbl_ptr->CoreSelectMask)
>> glbl_ptr->SMTMaskWidth);
glbl_ptr->PApicAffOrdMapping[numMappings].SMT_IDAPIC = (APICID & glbl_ptr->SMTSelectMask);
if(glbl_ptr->maxCacheSubleaf != -1) {
for(subleaf=0; subleaf <= glbl_ptr->maxCacheSubleaf; subleaf++) {
glbl_ptr->PApicAffOrdMapping[numMappings].EaCacheSMTIDAPIC[subleaf]
= (APICID & glbl_ptr->EachCacheSelectMask[subleaf]);

 

For more complete information about compiler optimizations, see our Optimization Notice.

Comments

's picture

I meet the same problem specified by CQ Tang.
Our platform consists of dual Woodcrest Xeon 5345 processors.
From /proc/cpuinfo, #0 #1 #4 and #5 are in package 0.
And with the reference code, the result is:
#0 and #1 share one L2 cache
#4 and #5 share another L2 cache
But the result from ping-pong test shows that:
Communication between #0 and #4 has the top perfermance.
Communication between #0 and #1 has the same perfermance with Communication between #0 and #5, and the perfermance is much lower than Communication between #0 and #4.
Communication between #0 and #2,#0 and #3,#0 and #6,#0 and #7 are same in performance.
It shows that #0 and #4 share a same L2 cache.

I have repeated the test for many times, and I think maybe the reference code has some bugs.

P.S. Your work is great!

ananthnarayan_s's picture

The build script bug reported by Sam doesn't appear to have been fixed. :)

--Ananth
Shih Kuo (Intel)'s picture

I had a few off-line email exchanges to root cause Mr. lio's observations. The correlation of expected performance characteristics relative to the output example of topology enumeration may be of interest to other folks.

To recap the background, the situation under test was dealing with ping-pong data transimission of a pair of affinitized transmit/receive agents.

If the affinitization of logical processors of the transimitter/receiver pair are sharing the same L2 cache, one can expect that the latency of such ping-pong data transmission will be lower than if data have to go out on the bus.

The ping-pong test was run on a Linux system and affinity is controlled using a mask that is equivalent to the CPU_SET API. The hardware is a dual-socket quad core system.

So, how does one go about mapping a pair of ordinal numbers sent to the CPU_SET API and be cerain those ordinal numbesr control two processor coress sharing the same L2?
In this case there are three sets of numbers that are relevant:

1. The composite Affinity mask for the target L2, for a cache level shared by two logical processors, the composite mask will have two non-zero bit.

The tutorial portion of the code example can print out a small box diagram that shows the L2 mapping to affinity masks, the diagram below depicts the L2 cache level in one of the two quadcore processor:

+----+----+----+----+
Cache | L2 | L2 |
Size | 4M | 4M |
CmbMsk | 11 | 22 |
+----+----+----+----+

2. Since CPU_SET uses an ordinal index to specify individual logical processor, we need a listing of the logical processor affnity mask relative to the ordinal numbers that software can use to iterate individual logical processor. The example code also produces the following listing that connects the three sets of numbers (combined mask of each L2, and affinity masks of each logical processor listed in numerical order of the ordinal numbers)

Affinity mask 00000001 - apic id 0
Affinity mask 00000002 - apic id 2
Affinity mask 00000004 - apic id 4
Affinity mask 00000008 - apic id 6
Affinity mask 00000010 - apic id 1
Affinity mask 00000020 - apic id 3
Affinity mask 00000040 - apic id 5
Affinity mask 00000080 - apic id 7

3. From the Composite mask value of "11H", which corresponds to affinity mask value of 01H (ordinal number 0 in the listing) and 010H (ordinal number 4). The sequence of ordinal number pair that software can use to affinitize the ping/pong transmitter/receiver test to take advantage of L2 is (0, 4), or (1, 5), (2, 6), (3, 7). The first pair matches what was observed by Mr. Lio.

It is important to realize that the numercal order of an ordinal numbered index (for CPU_SET in Linux or SetThreadAffinity ) does not imply a direct mapping relationship to any cache level (L1 Data cache, L1 Instruction cache, L2, etc).

In a single-socket quad-core system, you may see the following:
+----+----+----+----+
Cache | L2 | L2 |
Size | 4M | 4M |
CmbMsk | 3 | c |
+----+----+----+----+

and an ordinal numbered listing of

Affinity mask 00000001 - apic id 0
Affinity mask 00000002 - apic id 1
Affinity mask 00000004 - apic id 2
Affinity mask 00000008 - apic id 3

One interesting about this is that optimal pairing of affinity control (via affinity mask or equivalent) is highly workload specific.

As, Mr. Lio observed, pairing logical processor (0, 1) in a dual-socket, quad-core topology did not produce the best latency in transmission/receiver test. On a single-socket quadcore system, that pairing would.

On the other hand, one factor in scheduling heuristic is to minimize cache evictions due to context switches. More than 90% of the wall clock cycles often lads on the system Idle loop, implying it is more common to have several unsubscribed logical processor resources available when a runnable task is added to the queue.

On a dual-socket, quadcore system shown above, using consecutive sequence of ordinal number to schedule a new task can take advantage of the capacity of L2 without causing evictions on other tasks that were already running.

But applying the same consecutive sequence of ordinal-numbered heuristic to a single-socket quadcore system could result in competition of the L2 capacity by two tasks. So cache evictions caused by starting another task may have some visible performance impact, depending on the size of respective working set.

's picture

Couple of minor points:

1 The Unix shell files should have unix line endings

2 when run the following warning was generated

cpu_topo.c: In function ‘DumpCPUIDArray’:
cpu_topo.c:1857: warning: comparison is always false due to limited range of data type

looks like the 'cres' variable should be unsigned perhaps...

's picture

How to find the Number of Physical Processor Count & Number of Core(s) per Physical Processor on Intel Xeon E5540 Processor?
I have tried using the CPUID, But could not able to find exact bit location. But My code will working fine in Single (Intel Xeon) processor system.

Thanks
Kumaravel

's picture

I have been using this code for some time - thanks!
But I am also aware of, and have started using the code in Khang Nguyen's article and the HTMultiCore library.
Now I am trying to reconcile the two sets of data, and need help please.
Here is output on my Lenovo ThinkPad, using the distributed sample programs.

HTMultiCore tells me:
Hyper-Threading Technology: Not Capable
System: Multi-core
Number of available logical processors per physical processor: 1
Number of available cores per physical processor: 2
Number of physical processors: 1

cpu_topo tells me:
Number of logical processors visible to the OS: 2
Number of logical processors visible to this process: 2
Number of processor cores visible to this process: 2
Number of physical packages visible to this process: 1

And, for reference, the systeminfo command tells me:
System type: X86-based PC
Processor(s): 2 Processor(s) Installed.
[01]: x86 Family 6 Model 15 Stepping 10 GenuineIntel ~1995 Mhz
[02]: x86 Family 6 Model 15 Stepping 10 GenuineIntel ~1994 Mhz

It is unclear to me whether the terms "physical processor" (HTMultiCore) and "physical package" (cpu_topo) are synonymous or not. If they are meant to be, then it seems to me that I am getting different answers, since in once case I see 1 logical processor per physical processor, and in the other case, 2 logical processors "visible" - which I presume means per physical package, since there is only one of them.

Can you help me understand how to interpret these results please?

Thanks.
Geoffrey

Shih Kuo (Intel)'s picture

Kumravel

The number of physical processors is a system-level information, CPUID instruction only reports raw data within its own logical processor domain. If you look at the reference source code, you can find the utility routine GetSysProcessorPackageCount(). There is another utility routine that reports the total number of cores in the system across all the physical processors, you can derive core count per physical processor from those information.

Geoffrey

My colleage Khang is aware his library code needs to be updated to correct this inconsistency when it tries to report the number of logical processor per core but the text heading was misleading.

As a separate note, Windows 7 and Server 2008 servicepack 2 have added support for Group_AFFINITY with new APIs to support up to 4 groups of affinity masks. I expect to provide an updated reference code that includes support for GROUP_AFFINITY in the near future.

's picture

I see this error when running cpu_topology64.out:

got error. maxPackageDetetcted= 1 numMappings= 2 maxPackageDetetcted* numMapping

I tracked thid one in the code and the problem is that cpuid(&info, 1, 0) results in info.EBX having a bogus value. There is nothing I can do about this in the code as it is the register values that are incorrect. Any ideas? The output of cat /proc/cpuinfo is this:

processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Xeon(TM) CPU 3.60GHz
stepping : 3
cpu MHz : 3600.298
cache size : 2048 KB
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm pni monitor ds_cpl est tm2 cid cx16 xtpr
bogomips : 7110.65
clflush size : 64
cache_alignment : 128
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Xeon(TM) CPU 3.60GHz
stepping : 3
cpu MHz : 3600.298
cache size : 2048 KB
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm pni monitor ds_cpl est tm2 cid cx16 xtpr
bogomips : 7192.57
clflush size : 64
cache_alignment : 128
address sizes : 36 bits physical, 48 bits virtual
power management:

Shih Kuo (Intel)'s picture

Hi Tom,
When you encountered the error with cpu_topology64.out, can you execute it with the parameter 'c' in the command line, immediately. That will print each leaf function data reported by CPUID to stdout. pls send it to me at shihjong.kuo@intel.com.

It is possible that different BIOS setting may be at play or that an obscure mechanism to force CPUID to not report leaf 4 data may have tricked the software stack. Then topology enumeration can not be done correctly untill that system configuration error is corrected. The raw CPUID data produced by running the 'c' option can help diagonose this issue.

Shih Kuo (Intel)'s picture

A new release of the topology enumeration reference code has been posted. URL link is at the beginning of this page: topo-12212009.tar.

In previous reference code implementations, the OS-independent wrapper utility routines for thread context management support APIs in both Windows and Linux environment. In Linux environment, the limit of how many logical processors can be managed by the OS is generally set by each Linux distro through kernel configuration at compile time. Hence, our wrapper utility employing a generic affinity mask with width dynamically allocated at runtime can accommodate different size limit imposed by different distro.

In Windows environment and prior to Platform SDK 7.0a, the width of affinity mask (hence the maximum number of logical processors that OS can manage in the system) is limited to 64 in 64-bit mode (and 32 in 32-bit mode). With the release of Platform SDK 7.0a, new APIs and data structures are defined to support systems with greater than 64 logical processors. The latest topology enumeration reference code release added support for these new APIs that are only available in Platform SDK 7.0a. The Windows binary compiled with the new SDK can run on latest Windows 7 and its siblings using the new APIs. Running the same binary under prior versions of windows where PRCOSSOR_GROUP is not supported, it will use the legacy thread management APIs and report topology information like previous releases of our reference code did before.

A readme file is added in the download package. Detailed code changes are listed there.

Pages