CPUID for x64 Platforms and Microsoft Visual Studio* .NET 2005

When targeting x64 platforms in Visual Studio .NET* 2005, programmers are no longer able to use inline assembly code as they did for 32-bit code. This forces the programmer to either rely on C/C++ code using intrinsics, or to tediously create a 64-bit MASM (.asm) version of the function. Unfortunately, the VS .Net 2005 implementation of the intrinsic for CPUID (__cpuid) recognizes only input arguments in the register eax, and not the more recently defined inputs in ecx, which are required for queries regarding cache parameters and certain multi-core characteristics. Thus, a 64-bit .asm listing is required for full use of the CPUID instruction.

The following code samples demonstrate how to use the CPUID and RDTSC instructions with VS .Net 2005 for 64-bit (x64) platforms. The CPUID instruction is commonly used to obtain detailed information about the system’s CPU(s), and RDTSC is used to read the CPU’s internal time-stamp counter for timing and performance-measurement purposes. The RDTSC intrinsic (__rdtsc) does work as expected and can be used to replace inline assembly.

To build the 64-bit .asm file, create a custom build step that calls the 64-bit MASM, "ml64.exe", as shown in the screen-shot below. For the 32-bit configuration, the cpuid64.asm file should not be built, so for platform Win32, set General -> Excluded From Build to Yes.

(Click image for larger version)

The header file below (cpuid_32_64.h) creates a single definition of the functions _CPUID and _RDTSC that can be used in both 32-bit and 64-bit builds. For 64-bit builds, _CPUID uses the .asm function cpuid64, and _RDTSC uses the intrinsic __rdtsc. For 32-bit builds, _CPUID uses the inline-assembly function cpuid32, and _RDTSC uses the inline-assembly function _inl_rdtsc32.

There are two examples shown in the C file below (cpuid_32_64.c). The first is GetCoresPerPackage(), which calls _CPUID with eax=4 and ecx=0 in order to read the first set deterministic cache parameters reported by the CPU and extract the field indicating the number of processor cores per processor package. (For example, this function would return 1 for a single-core Intel® Pentium® 4 processor, and 2 for a dual-core Intel® Pentium® D processor.) If the intrinsic __cpuid were used in this function on an x64 platform instead of the cpuid64 function, the input value of ecx would be nondeterministic, and the output would be unreliable. The second example function is timeSomethingExample(), which calls _RDTSC twice and calculates the elapsed timer ticks in the loop. The _CPUID example shows how to use one definition to invoke either 64-bit .asm code or 32-bit inline assembly, and the _RDTSC example shows how to use one definition to invoke either a 64-bit intrinsic or 32-bit inline assembly.

Both the _CPUID and _RDTSC examples show how to create utility functions that are transparently portable from Win32 to x64 platforms in cases where different underlying code is required for each platform. Furthermore, the cpuid64 function provides a workaround for a deficiency in the __cpuid intrinsic, allowing both 32-bit and 64-bit app lications to fully utilize the capability of the CPUID instruction.

Header file (cpuid_32_64.h):

#pragma once

typedef struct cpuid_args_s {

DWORD eax;

DWORD ebx;

DWORD ecx;

DWORD edx;


#ifdef __cplusplus

extern "C" {


#ifdef _M_X64 // For 64-bit apps

unsigned __int64 __rdtsc(void);

#pragma intrinsic(__rdtsc)

#define _RDTSC __rdtsc

void cpuid64(CPUID_ARGS* p);

#define _CPUID cpuid64

#else // For 32-bit apps

#define _RDTSC_STACK(ts)

__asm rdtsc 

__asm mov DWORD PTR [ts], eax 

__asm mov DWORD PTR [ts+4], edx

__inline unsigned __int64 _inl_rdtsc32() {

unsigned __int64 t;


return t;


#define _RDTSC _inl_rdtsc32

void cpuid32(CPUID_ARGS* p);

#define _CPUID cpuid32


// Our 32/64-bit example function

int GetCoresPerPackage();

#ifdef __cplusplus




32/64-bit .c file (cpuid_32_64.c):

#include "windows.h"

#include "cpuid_32_64.h"

#ifndef _M_X64

void cpuid32(CPUID_ARGS* p) {

__asm {

mov	edi, p

mov eax, [edi].eax

mov ecx, [edi].ecx // for functions such as eax=4


mov [edi].eax, eax

mov [edi].ebx, ebx

mov [edi].ecx, ecx

mov [edi].edx, edx




// Assumptions prior to calling:

// - CPUID instruction is available

// - We have already used CPUID to verify that this in an Intel® processor

int GetCoresPerPackage()


// Is explicit cache info available?

int nCaches=0;

int coresPerPackage=1; // Assume 1 core per package if info not available 


int cacheIndex;


ca.eax = 0;


t = ca.eax;

if ((t > 3) && (t < 0x80000000)) { 

for (cacheIndex=0; ; cacheIndex++) {

ca.eax = 4;

ca.ecx = cacheIndex;


t = ca.eax;

if ((t & 0x1F) == 0)





if (nCaches > 0) {

ca.eax = 4;

ca.ecx = 0; // first explicit cache


coresPerPackage = ((ca.eax >> 26) & 0x3F) + 1; // 31:26


return coresPerPackage;


void timeSomethingExample()


ULONGLONG tStart, tElapsed;

int i;

tStart = _RDTSC();

for (i=0; i < 1000; i++)


// Do something here 1000 times


tElapsed = _RDTSC() - tStart; // CPU timer ticks taken to do something 1000 times




64-bit .asm file (cpuid64.asm):

; call cpuid with args in eax, ecx

; store eax, ebx, ecx, edx to p

PUBLIC cpuid64


ALIGN     8

cpuid64	PROC FRAME

; void cpuid64(CPUID_ARGS* p);

; rcx <= p

sub	rsp, 32

.allocstack 32

push	rbx

.pushreg	rbx


mov	r8, rcx

mov eax, DWORD PTR [r8+0]

mov ecx, DWORD PTR [r8+8]


mov DWORD PTR [r8+0], eax

mov DWORD PTR [r8+4], ebx

mov DWORD PTR [r8+8], ecx

mov DWORD PTR [r8+12], edx

pop      rbx

add      rsp, 32


ALIGN     8

cpuid64 ENDP



For more complete information about compiler optimizations, see our Optimization Notice.


This cpuCount() is quite interesting.

I tested on a OpenVZ Server and an i920 CPU.
In both cases, it returned "1" ...

But i found no sysconf.h for linux, maybe this is the problem?

My version:

#include <sysconf.h>
#include <windows.h>

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

unsigned int cpuCount( void )
unsigned int count = 1; // Always assume 1.

#if defined( LINUX )
count = sysconf( _SC_NPROCESSORS_CONF );
#elif defined( WINDOWS )
GetSystemInfo( &si );
count = si.dwNumberOfProcessors;
return( count );

unsigned int cpuCount( void );

int main ()
unsigned int d = cpuCount();
printf("%d", d);

Did i do something wrong, then?


This is a very informative article. Is a pure-java solution available? We have a java-based
product and we want to transition to cpu core number licensing. Although we can use JNI,
we would prefer to avoid that path if possible.


For CPU count determination use the following snippets, something I found whilst research some asm copy routines for mmx and sse enabled processors.

unsigned int cpuCount( void )
unsigned int count = 1; // Always assume 1.

#if defined( LINUX )
count = sysconf( _SC_NPROCESSORS_CONF );
#elif defined( WINDOWS )
GetSystemInfo( &si );
count = si.dwNumberOfProcessors;
return( count );

Make sure you have the right headers 'windows.h' for Windows (like duh!) and sysconf.h for Linux.

Hope it helps, have fun and may The Source be with you :)

You cannot detect the number

You cannot detect the number of physical processors without some use of OS APIs. When you issue the CPUID instruction, it runs on the processor on which the OS has scheduled your thread. The enumeration algorithms in the samples below work by forcing the current thread to each of the processors in the system and then reading the APIC-related fields from the CPUID instruction. This requires that the OS provide the total number of logical processors and a way to force the current thread to run on each (set affinity). See http://software.intel.com/en-us/articles/detecting-multi-core-processor-topology-in-an-ia-32-platform/ and/or http://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/. The first has the most up-to-date detection code.

Is it possible to detect the number of physical processors using this technique? The sample works perfectly but I need a method to query the processor(s) directly to get a count of physical processors in the system without using Windows API.
Many thanks.