Intel Xeon Phi Reading MSR

Intel Xeon Phi Reading MSR

Hi All,

I want to read specific MSR for Intel Xeon Phi 7210. Before this, I never wrote code to do so. Can anyone please answer:

1) Which software guide should I read to understand which Xeon Phi MSR does what?
2) Any sample code I can refer?
3) I have used turbostat and looking to read all the MSR read by this tool. The source code of this tool is a bit large, so any pointers will help.

I tried msr-tools, but without knowing which MSR does what and what address to use, it's getting difficult to get hold of it. 

Please advise. Thanks.

Chetan Arvind Patil
9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

1) Which software guide should I read to understand which Xeon Phi MSR does what?

MSRs are documented in the Intel Software Developer Manuals, specifically in volume 4: Model-specific registers (Xeon Phi 72xx documentation starts on PDF page 298 of the linked version of the document)..

2) Any sample code I can refer?

The home page for MSR tools which you gave includes a link to the source code there. That seems a reasonable place to start.

Hi James,


Can you share details on the difference of reading counters using MSR and RAPL? I come from development boards, where most of the system details are read using sysfs. 

Do you think, I can get the power, temperature etc data using RAPL sysfs directly (I did investigate the system I have, but still need to validated the data I am getting). If not, then reading MSR is good approach?

As a first step, I would like to get sensor data as read by turbostat, but using my own code.

Any advise will be helpful, thanks.

Chetan Arvind Patil

The interface to the RAPL functionality uses MSRs.   If you look at the kernel driver for the sysfs interface, you will see that it eventually resolves its way to an MSR read.   The same is true for any sysfs function that interfaces with hardware functionality that is implemented via MSRs.   (Most hardware configuration is done through MSRs.  Some is done through PCI configuration space, but this mostly applies to the "uncore" devices.)

Some aspects of the documentation can be frustrating. Many sections of the documentation (particularly in Volume 3 of the Intel Architectures Software Developers Manual) refer to MSRs by name only, and you have to look up these names (in Volume 4 of the Intel Architectures Software Developers Manual) to obtain the MSR numbers.  It can be tricky to search PDFs for the longer MSR names, since PDF searches don't match on strings that include line breaks (e.g., caused by narrow columns in tables).   The names of the MSRs are not always consistent -- some MSRs are referred to with the "MSR_" prefix in some cases and with the "IA32_" prefix in other places.   Sometimes a specific MSR number will have different names on different processors -- even when the functionality appears to be identical.

It helps to have at least one very large monitor so you can have both Volume 3 and Volume 4 open at the same time.

The msr-tools "rdmsr" and "wrmsr" programs are definitely the most reasonable way to access the MSRs for occasional use.  (The alternative is writing loadable kernel modules, which has a much steeper learning curve.)

"Dr. Bandwidth"

Hi John,

As you said, the nomenclature is where I am getting confused and frustrated. As of now, I am using msr-tools. However, with this also, interpreting raw data values and it's unit is a trouble.

For example, for IA32_THERM_STATUS, I get following raw data:

sudo ./rdmsr -p 63 -d 412

What I don't know (neither the document specifies), is the unit of this output, and whether any post processing is required? I guess, things will get complicated when I move with others sensors related to package power etc.


Chetan Arvind Patil
Best Reply

IA32_THERM_STATUS (MSR 0x19c = 412d) is composed of a whole bunch of bit flags, so you should look at it in hex.

      rdmsr -p 63 -x -0 0x19c

The "-x" says output in hex.  The "-0" says include all of the bits (including zeros on the left).   I always specify the MSR numbers in hex, but it does not change the behavior of the tool.

On one of my Haswell systems, the decimal output is similar to yours, and is not very enlightening.  The hex output clearly shows more structure.  In the last example I pipe the 64-bit hex output through a little tool that I wrote to print out the bit positions and bits.

# rdmsr -p 0 -d 0x19c

# rdmsr -p 0 -x -0 0x19c

# rdmsr -p 0 -x -0 0x19c | ~mccalpin/bin/bits64
 6 6 6 6 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Looking up the bit fields in Volume 4 of the SWDM, we see:

  • bit 31: Reading Valid -- this is set, so the value is good
  • bits 30:27:  0x1 -- resolution is 1 degree Celcius
  • bits 22:16: 0x42 = 66 decimal
  • Note that this is the number of degrees *below* the thermal throttling temperature, which can be obtain from bits 23:16 of MSR_TEMPERATURE_TARGET (MSR 0x1a2).
  • The rdmsr tool can read these bits directly using the "-f" option:

rdmsr -p 0 -x -f23:16  0x1a2

  • 0x5b is 91 decimal

  • 91 degrees - 66 degrees = 25 degrees C is the actual temperature reading for core 0.

The bit fields can also be read separately in 0x19c using the "-f" option:

rdmsr -p 0 -x -f31:31 0x19c

rdmsr -p 0 -x -f30:27 0x19c

rdmsr -p 0 -x -f22:16 0x19c

The last case is a number, not a bit field, so using the decimal output for these temperature values might make interpretation easier.  (Note that I am using the "-u" descriptor because these bit fields are interpreted as unsigned integers.)

rdmsr -p 0 -u -f23:16  0x1a2

rdmsr -p 0 -u -f22:16 0x19c

This gives the same 91 C thermal activation temperature, and the same 66 degrees below that temperature as the current temperature of core 0.  Again 91-66 = 25 degrees is the absolute temperature.


"Dr. Bandwidth"

Hi John,

Appreciate your detailed answer (your answers are always helpful!).

Quick question: If I do diff to get absolute temperature, then how can that be considered accurate? Running this "rdmsr -p 0 -u -f23:16  0x1a2" and then this "rdmsr -p 0 -u -f22:16 0x19c", and then taking diff will give me the readinds, but it won't be real time? Am I correct?

On side note: Intel has very stable MSR and linux support compared to other architectures. Then, why not expose all via sysfs for benefit of end user?


Chetan Arvind Patil

The "Thermal Activation Temperature" in MSR_TEMPERATURE_TARGET (0x1A2) should never change for a particular processor.  It is different across processor models, but for each model it looks like it is programmed at the factory to a fixed value.

There are several places that you can get temperature data.  IA32_THERM_STATUS (0x19C) has a scope of "core" on most Intel processors, but it has a scope of "module" on Xeon Phi x200.  "Module" is a potentially confusing label to use here, but at the beginning of Section 2.17 of Volume 4 of the SWDM the text explains that "module" is the same as "tile" (i.e., a processor pair) for the Xeon Phi x200.   You can also get the temperature using IA32_PACKAGE_THERM_STATUS (0x1B1), which provides a single temperature for the entire chip.  It looks like the value provided by IA32_PACKAGE_THERM_STATUS is the maximum of the values of the various sensors on the package.  Not all of the sensors are in cores, so it is possible to get "package" temperatures that are higher than any of the core temperatures.

I find it easier to use the MSR interface than sysfs interfaces, since the Linux kernel developers always seem to want to "simplify" or abstract these interfaces.  It is painfully difficult to work through the kernel source code to find out what a sysfs device is actually doing at the lowest level -- which is the only level that is documented.  Using the MSR interface directly allows me to skip a lot of irritating Linux kernel detective work....

There was an attempt to build an "MSR-safe" kernel extension that would allow user access to "safe" MSRs -- e.g., programming the performance counters and reading most of the MSRs -- but I think that project was dropped due to a lack of ongoing funding.

My biggest gripe about the current MSR interface is that it requires crossing into the kernel for every MSR read and requires an interprocessor interrupt for every MSR read on any core that is not the currently executing core.   This makes reading lots of MSRs very expensive.  One of my performance monitoring codes takes about 2 milliseconds to read a subset of the available counters on all cores of a Xeon Phi x200, and most of the time is spent in cross-processor MSR reads.  The /dev/cpu/*/msr interface does allow a "length" parameter that can be bigger than 8 Bytes, but instead of reading multiple MSRs, it simply reads the same MSR multiple times and returns only the final result.   I am guessing that Intel's "sep" driver is able to batch the performance counter reads into a smaller number of kernel crossings, because I have seen VTune results with 2 millisecond sampling granularity -- definitely not practical with the 2 millisecond overhead of my code.

It would not be too hard to write a kernel module that would dump all the performance counters in a single call.   I don't know how hard it would be to "batch" the transactions so that one InterProcessor Interrupt would read all the target MSRs on a core, but that does not seem too scary.  The hard part (based on my experience) is figuring out how to write all the testing code to make sure that the kernel copy back to user space does not accidentally use kernel privileges to overwrite memory that it should not overwrite.  In a previous project, a kernel device driver would happily kill the system if the user passed the kernel a bad pointer.  In that case no one else had access to the device driver, so the security flaw was not a problem, but it feels like it would take a fair amount of study to figure out how to make this sort of interface safe to deploy in an environment with thousands of users.

"Dr. Bandwidth"

Excellent thread in understanding the thermal polling right from the processor die. 

Leave a Comment

Please sign in to add a comment. Not a member? Join today