HowTo MSR for Turbo Ratios ?

HowTo MSR for Turbo Ratios ?

Hello,

My source code ZFreq.c displays the frequencies of the i7 cores

I'm using the MSR registers to read the core ratios multiplied by the current external clock from the SMBIOS.

However whatever the system load is, the MSR IA32_PERF_STATUS never returns the values found in the turbo zone given by MSR_TURBO_RATIO_LIMIT.

To be short IA32_PERF_STATUS never goes above MSR_PLATFORM_INFO.MaxNonTurboRatio

Please help me to program correctly those MSR

 

Thank You

CyrIng

Fr

37 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Source code also available at code.cyring.fr/FTS/Source/C/zfreq.c

-;)

The clock ratio that you obtain depends on the model number of the part, the number of processors active, the processor temperature, the processor power consumption, the power current draw, and some other factors that Intel does not describe in a lot of detail.  

To look into this further, it would really help to have the exact processor model number.   If you look at the Wikipedia page on Core i7 processors, you will see that there are 9 different implementations that are referred to as "Desktop Core i7" processors.  This includes Nehalem, Westmere, Sandy Bridge, Ivy Bridge, or Haswell cores -- and four of these five have two different options for the uncore.   In addition, there are 10 different implementations that are referred to as "Mobile Core i7" processors.  This also includes Nehalem, Westmere, Sandy Bridge, Ivy Bridge, and Haswell cores, with three of the five cores being associated with different uncore or packaging options.

One reason that this matters is that, even though the processor MSRs to control the frequency ratio request may be the same, different processors have different support for other features that may be helpful in understanding why the processor is behaving as it does.  For example, the Core i7 processors based on the "Sandy Bridge E" (and probably "Ivy Bridge E") should use the same interface to the uncore "Performance Control Unit" as the "Sandy Bridge EP" (Xeon E5-1600/2400/2600/4600) server chips.  For those processors you can query registers in the Performance Control Unit to find out why the requested frequency ratio has not been granted.

 

John D. McCalpin, PhD
"Dr. Bandwidth"

Hi, Thanks for helping.

So I have progressed by implementing the performance counters MSR. A lot of fun !

This second release of the source code is tested on my Bloomfield i7-920 with a BCLK overclocked to 160 MHz ; Nehalem architecture, and may run with successors.

The frequency based on unhalted cycles shows some very interresting values : sometimes, below the minimium ratio, rarely, above the maximum one.

But it may happen, meaning perhaps that turbo is "furtive". Thus, to catch it, I'm displaying turbo bumps on a quater scale.

However I still don't reach a display such as the Intel Widget for Windows does.

I guess the key is to optimize the counter readings. For instance, a tiny thread loop with no output.

CyrIng

Hello Cyring,

Sorry to delay responding. I've been busy doing end-of-year, start-of-year work.

In your program, you are only setting the 'count OS cycles' bit for the fixed counters (unless I'm mistaken). So you are only going to count unhalted reference cycles and unhalted core cycles if your measurement program is running in ring0... and I kind of doubt that you are running at ring0.

But it is nice code though...

Pat

Thanks alot for your advices -;)

Meanwhile I have progress with C3 and C6 states that I have already implemented in a bigger project.
You may check my Blog or Sourceforge for screenshots and the source code of the Xlib Widgets.

However, turbo ratios still don't show up with relative frequencies based on C-States and TSC.

As you said working in Ring0 should be the clue.

Best Regards

CyrIng

 

Hello,

I have blog my formula to compute Turbo Ratio :

Ratio = OR × { d(URC) ÷ d(TSC) } + TR

It gives some good results, even in Ring3

Please let me know if you find it correct.

 

CyrIng

Hello Cyring,

I'm not sure what this is really calculating. It looks like some kind of add on to the regular turbo ratio. TR is defined as unhalted_core_clks/unhalted_ref_clks so if you are running at TSC freq then TR=1.

Your 'Ratio = OR × { d(URC) ÷ d(TSC) } + TR' then basically RATIO = operating_freq_ratio * %unhalted + TR. So if you are running at TSC freq, say 2.0 GHz with no halting then Ratio = 2.0*(1) + 1 = 2 ... unless I'm doing something wrong.

But a more fundamental issue with this approach is using the current frequency from (IA32_PERF_STATUS) and trying to say that the instantaneous IA32_PERF_STATUS tells you something about average frequency.

On Haswell for instance, going into or out of halting can take .32 useconds (0.32e-6 secs). Going into turbo can take 0.1usecs. See http://www.anandtech.com/show/7744/intel-reveals-new-haswell-details-at-isscc-2014 . So if we took 0.1usecs as the smallest 'window' then we could have maybe 10,000,000 changes in frequency per second and you are looking at maybe 1 of those changes (assuming you are just reading IA32_PERF_STATUS once per second). Do you see what I'm trying to say?

Pat

Hello Pat,

Thank you for your reply.

Reading MSR_TURBO_RATIO_LIMIT (0x1ad) returns the following Turbo Ratio Values

MaxRatio_1C=22 ; MaxRatio_2C=21 ; MaxRatio_3C=21 ; MaxRatio_4C=21

Reading MSR_PLATFORM_INFO (0xce) returns a MinimumRatio of 12 and a MaxNonTurboRatio of 20

My issue with a Core i7-920 is that it should have 2 bumps when one Core (and only  Core) is loaded : meaning 22

Looping every 1sec to get current ratio from IA32_PERF_STATUS returns a ratio with only two possible values : 12 or 20

As you noticed : OR x { d(URC) ÷ d(TSC) } remains OR .

Thus never above 20, except if I add the remaining States { d(UCC) ÷ d(URC) } which gives a ratio up to [ 21.0 - 22.0 ]

I share your point of view that a smaller sample must be taken into account : this is reserved for a future ring0 driver.

Meanwhile I would like to be sure of the good formula and the registers associated with.

CyrIng

Hello

nmi_watchdog is a bad boy. It uses the counters as soon as Linux boots. Thank you Pat for this info.
The kernel modules which enable it were blacklisted in /etc/modprobe.d/modprobe.conf

blacklist iTCO_vendor_support
blacklist iTCO_wdt

and verified in  /proc/sys/kernel/nmi_watchdog with a 0 value

Another thing I have observed when tracing Unhalted Core Clocks is that the counter can go "backward", even if I take care of 64 bits overflow.
Meanwhile, I found some answers in The accuracy of the performance counter statisitics .

To my understanding, those variations of UCC are explained by events such as Interrupts, Throttling, Instruction serialization.
However, can this also show some kind of the Turbo activity ?
 

Best regards

CyrIng

Hello Cyring,

The counter should only 'go backward' if the counter has overflowed or some piece of code has reset the counter.

The counter isn't 64bits wide but probably 48 bits. You can get the fixed counter width from cpuid, input 0xa, output bits eax[16:23].

What are you trying to measure? It should be easy to measure turbo mode frequency. If you are still not showing that you are getting into turbo mode then there are several possibilities. Your chip may have turbo mode disabled, perhaps in the bios, or I think I've seen some low power chips where turbo is hardwired off (but the chip spec sheet will say this). Or the OS may be disabling turbo mode, usually due to using a 'favor power savings over performance' power plan. On some Windows versions, it seems like the 'Balanced' power plan disabled turbo.

Requirements for turbo are that

1) the frequency be allowed to go to max non-turbo frequency (see /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq)... the requirement might actually be that the freq be allowed to go to > than max non-turbo freq (see /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies). Setting scaling_max_freq and scaling_min_freq allows you to control the cpu frequency (to one of the allowed frequencies).

2) that cpuid.input(0x6).ouput(eax[1]) == 1,

3) MSR 0x1a0 IA32_MISC_ENABLE bit 38 be == 0 (this bit is usually controlled by bios)

4) MSR 0x199 IA32_PERF_CTL bit 32 be == 0. This bit may be changed by the OS. If the OS doesn't want to allow turbo then it can set this bit.

You can see what the max turbo freq is for 1 core by looking at MSR_TURBO_RATIO_LIMIT[0:7].

Now, if all of the above permits turbo, you can still not get into turbo mode if the power limit is exceeded or the thermals don't permit it (chip too hot).

But lets say that everything is permitting turbo. Then you should be able to write a simple, single-threaded spinner (just spin for x seconds) program and pin it to 1 cpu, with the rest of the system idle, and the turbo ratio should show that you are hitting MSR_TURBO_RATIO_LIMIT[0:7].

Pat

Thanks Pat for these instructions.

I have all the requirements for Turbo gathered. Some Windows tools show Turbo is working fine (such as T-Monitor)

My code is made for Linux, and I have blacklisted the cpufreq module, however Turbo feature is enabled in cpuid and activated in MISC_PROC_FEATURES[38] as shown bellow.

Measuring Cycle Delta with idle then high load on 1 core give me the following values ,
( where columns are in this order UCC:URC C3 C6 / TSC )

IDLE

HIGH

I don't understand which MSR registers can give me a ratio hitting MSR_TURBO_RATIO_LIMIT[0:7] which is btw 0010110 in my screenshot.

So the max turbo freq for 1 core is 2.2 GHz and you seem to be running at 2.6 Ghz. I guess you are overclocking the CPU?

I'm not sure how turbo mode behaves when you overclock. I'm guessing that the cpu sees the freq is already > 2.2 GHz and doesn't try to turbo boost.

I would not advise blacklisting kernel modules unless you really, really know what you are doing. I've always found that messing with the nmi_watchdog file controlled the watchdog.

Pat

CPU overclocking is not enable, (beside the 3 Corsair DDR memories pushed to 1600 MHz).

In BIOS, BCLK is set to 133, Ratio to auto (so between 12 and 20) so system is running @ 20 x 133 MHz

How did you compute a max freq of 2.2 GHz for 1 core ?

Sorry, I assumed that the bus freq (bclk) was 100 MHz.

Pat

Indeed, Monitoring Counters are 48 bits width.

Thanks for this.

CyrIng

Hello,

Is this formula correct to display per logical core its non halted activity including turbo

            DisplayRatio=TurboRatio x State(C0) * MaxNonTurboRatio
              where
                  TurboRatio=Delta(UCC) / Delta(URC)
              and State(C0)=Delta(URC) / Delta(TSC)
              and MaxNonTurboRatio=MSR_PLATFORM_INFO[15-8]

 

Best Reply

Hello Cyring,

It depends on what you mean by 'per logical core its non halted activity'.

Usually I look at the 2 fields separately.

1) average non-halted frequency over the interval = TSC_frequency * delta(CPU_CLK_UNHALTED.THREAD) / delta(CPU_CLK_UNHALTED.REF)

2) %of time cpu is unhalted = 100 * delta(CPU_CLK_UNHALTED.REF)/delta(TSC)

Item 1) tells me "when the cpu was running (not halted), what was the average frequency". Item 2) tells me "what % of time was the cpu running".

There is an article http://software.intel.com/en-us/articles/measuring-the-average-unhalted-frequency.

Pat

Hello,

Thanks a lot for your help.

Now it works as I wish : Turbo gives 2 bump.

To test it, I have made a demo Linux live CD, including the source code and the developer packages (Code::Blocks IDE)

 

CyrIng
 

 

Good day,

I'm making my program retro-compatible with any Core 2 64 bits architectures. It is split in 3 algorithms :

  1. Nehalem and above architectures, based on fixed performances counters
    step a- Initialize counters, write the MSR IA32_PERF_GLOBAL_CTRL(0x38f) and IA32_FIXED_CTR_CTRL(0x38d)
    step b- Read the MSR IA32_FIXED_CTR1(0x30a)  , IA32_FIXED_CTR2(0x30b) , IA32_TIME_STAMP_COUNTER(0x10) , MSR_CORE_C3_RESIDENCY(0x3fc) and MSR_CORE_C6_RESIDENCY(0x3fd)
    step c- Computes, displays C0, C3, C6 states
    step d- Loop to step b
     
  2. Core 2 algorithm, similar to the previous one, except that there is none MSR_CORE_C3_RESIDENCY and MSR_CORE_C6_RESIDENCY.
    step a- Initialize counters
    step b- Read the IA32_FIXED_CTR MSRs
    step c- only C0 states are taken into account.
    step d- Loop to step b
     
  3. A fallback algorithm for Genuine architectures:
    step a- Read the MSR IA32_APERF(0xe8) , IA32_MPERF(0xe7) and IA32_TIME_STAMP_COUNTER(0x10)
    step b- Computes, displays C0 states only from values read in step a
    step c- Loop to step a
     

When program starts and the processor signature detected from CPUID, one of the 3 algorithms is selected then launched.

So far, testing are like below :

 * +-------------------+---------------------------+--------+-----------------+
 * | Intel Processors  | System [Desktop/Laptop]   | Status | Algorithme      |
 * +-------------------+---------------------------+--------+-----------------+
 * | Core i7-920       | Asus Rampage II Gene [D]  |   OK   | Nehalem         |
 * +-------------------+---------------------------+--------+-----------------+
 * | Core 2 Duo T5500  | Acer Aspire 5633 [L]      |   OK   | Core 2          |
 * +-------------------+---------------------------+--------+-----------------+
 * + Core 2 Quad Q8200 | Unknown [L]               |   OK   | Genuine         |
 * +-------------------+----+----------------------+--------+-----------------+
 * + Pentium Dual Core 5700 | Acer Desktop [D]     |   KO   | Core 2          |
 * +------------------------+----------------------+--------+-----------------+
The Pentium Dual Core 5700 is detected with a CPUID 'Core2 Yorkfield' signature but the MSR IA32_FIXED_CTR1(0x30a) and IA32_FIXED_CTR2(0x30b) return a zero value.

Are there really no such fixed counters in this processor ?

Regards

CyrIng

Dear all

I really tried to build an ring0 driver to port XFreq to Windows 7 Pro 64 bits, BUT because of certificate issues, my template driver runs only and only if Windows boots with the signature verification disabled.

I've explored the solution provided in the Intel pcm source code but the WinRing0x64.dll failed loading whatever is the PATH.

I will appreciate any contribution for a C language Windows driver built with the mingw-w64 gcc compiler, thus I could focus programming the Intel processors.

CyrIng

Hello Cyring

The PCM instructions for using the WinRing0 driver work for me. You'll need to download the signed driver from the RealTemp website (or whatever the PCM instructions say). Usually I just put the WinRing0*.dll/sys files in the same dir as the .exe file and that works for me.

Pat

And don't forget that on x64 windows you need to start the program with elevated privileges (right click on the program and do 'run as administrator' (or run the program from a cmd.exe windows that you started with 'run as admin' privilege).

Hello Patrick,

Looking at the source code, pcm seems to start in winring0/OlsApiInit.h , an OpenLibSys.org driver from Hiyohiyo.

The entry point is InitOpenLibSys() followed by a serie of GetProcAddress() calls to map the selected DLL functions addresses which return zero in my case : this is the place where I'm stuck

I assume that WinRing0x64.dll is in the same dir as your .exe file?

You can put in debug statements and see where the load process is failing... maybe the LoadLibrary() is failing, maybe one of the GetProcAddress() calls is failing. Maybe you have another (conflicting) version of WinRing0*.h/dll/sys somewhere else? If you have _PHYSICAL_MEMORY_SUPPORT defined in OlsApiInit.h I don't think the memory routines are actually defined in driver/dll.

Pat

With debug statements, the load process fails in LoadLibrary(_T("WinRing0x64.dll"));

Fyi, below the wrapper code:

#include <stdio.h>
#include <stdlib.h>
#include <windows.h>
#include <tchar.h>
#include "OlsApiInit.h"

HMODULE hOpenLibSys = NULL;

BOOL initWinRing0Lib()
{
    	const BOOL result = InitOpenLibSys(&hOpenLibSys);
    	if(result == FALSE) hOpenLibSys = NULL;
    	return result==TRUE;
}

int main()
{
    printf("winring0() = %d\n", initWinRing0Lib());
    return 0;
}

 

and the Linux build command line:

i686-w64-mingw32-gcc -Wall -D_M_X64 -O2 -march=corei7  -c ~/src/Windows/winring0/main.c -o obj/Release/main.o
i686-w64-mingw32-g++  -o bin/Release/winring0.exe obj/Release/main.o  -s  -lkernel32

 

The Windows working directory :

Regards

CyrIng

Can you change the string _T("WinRing0x64.dll") to "WinRing0x64.dll" and see if it works?

Pat

Got it !

LoadLibrary() and GetLastError() were telling me error 193 translated by HRESULT_FROM_WIN32() into 0x800700c1 

My compiler toolchain was 32 bits and it has to be as follow to built a Windows 64 bits app.

x86_64-w64-mingw32-gcc -Wall -O2 -march=corei7  -c ~/src/Windows/winring0/main.c -o obj/Release/main.o
x86_64-w64-mingw32-g++  -o bin/Release/winring0.exe obj/Release/main.o  -s  -lkernel32

The final initialization test code works fine with Windows 7 and 2012 64 bits

#include <stdio.h>
#include <stdlib.h>
#include <windows.h>
#include <tchar.h>
#include "OlsApiInit.h"

HMODULE hOpenLibSys = NULL;

BOOL initWinRing0Lib()
{
	        const BOOL result = InitOpenLibSys(&hOpenLibSys);
        	if(result == FALSE) hOpenLibSys = NULL;
        	return result==TRUE;
}

int main()
{
    if(initWinRing0Lib())
        	DeinitOpenLibSys(&hOpenLibSys);
    return 0;
}

Thanks Patrick for your help.

CyrIng

One problem that I have encountered in using MSR IA32_PERF_STATUS to look at the frequency is that you cannot read this counter while in the user-space context.  MSRs can only be read by the kernel, so you have to call the MSR device driver, which may have to set up an inter-processor interrupt to the target processor, and finally the target processor reads its IA32_PERF_STATUS MSR.   In the time between the normal execution of the user code and the execution of the RDMSR command the frequency can change, and I have observed that on our Haswell (Xeon E5 v3) systems it often does change (depending on BIOS options & power-limiting).

This is probably one of the reasons why Intel recommends that you obtain the average frequency over intervals from cycle counters, rather than reading the instantaneous multiplier. If CR4.PCE is set, then you can execute the RDPMC instruction in user space and avoid any perturbations to the system (like a transition into the kernel) that might increase the probability of a frequency change.   Either the fixed-function counters or the programmable counters are useful for this purpose.  I prefer the fixed-function counters since these can be enabled once and require no further kernel driver calls to program.

An irritation that I ran across this week is that (at least on my RHEL 6.5 & 6.6 systems), the "perf stat" command disables the fixed-function performance counters after it uses them, rather than checking their previous configuration and returning them to that state after use.   Stupid *&^!@#%^ software.

John D. McCalpin, PhD
"Dr. Bandwidth"

Hello John,

True, I also crossed the same conclusion that IA32_PERF_STATUS is not the recommended MSR to measure the frequency

As you can read in my Algorithm , I'm also reading the PMU fixed counters, on a sample period, which are multiplied by the relative frequency ratio read from MSR_PLATFORM_INFO and MSR_TURBO_RATIO_LIMIT

You may get more detail in the source code of XFreq server and the uCycle() function where the frequency computation happens.

After more than a year programming those counters, I find them pretty precised. Programmable counters are also on my next schedule, especially to measure the QPI bandwidth, DRAM, RAPL ...

perf stat like others Linux kernel drivers (such as Watchdog) don't bother with the state of counters when exiting or sharing. That's why I reserved a save area for registers. See function Init_MSR_Nehalem()

I wish you could tell me how my program is running with your Xeon E5 v3 ?

Best regards

CyrIng

Windows 7 64bits is so frustrating.
There is nothing difficult to program from scratch a ring0 driver, but running it without the certification stuff is a no go
winring0x64.sys is also a painful solution (and a source of virus issues)
I won't ask my users to deactivate signature check ...
So far I give up to port XFreq to Windows

C7 issue with a 5-3450S IvyBridge:

Reading msr MSR_CORE_C7_RESIDENCY (0x3fe) all the time returns a zero value, same with. MSR_PKG_C7_RESIDENCY (0x3fa), while C3, C6 are OK ?
Should I drive C7 differently ? Some kind of additional initialization of this counter ?

Thanks for any help
CyrIng

Hello cyring,

I don't know why C7 is showing up as zero but it shows as zero on my ivybridge laptop as well. Perhaps there isn't enough difference between C6 and C7 on ivybridge so it isn't used. I do see that 'MSR_PKG_CST_CONFIG_CONTROL: Package C-State Limit' reports C7.

On a more personal note, I will soon be leaving Intel and this is probably my last posting. It has been a pleasure sharing with and learning from the users of this forum. Good luck in the future folks,

Pat

OK, Thank you for the tips.
Wish you good luck and if you can transfer me your black belt -;)

CyrIng

FYI, CoreFreq, a light version of XFreq is available in the GitHub

CoreFreq is based on a Linux Kernel driver which spawns one thread per Core.

Each threads loop reads the msr in a more precised way than in user land.

Regards

CyrIng

Hello,

This is CoreFreq , a Linux Kernel driver which handles the performance monitoring counters, and displays the Core frequencies, c-states, temps & Instructions per second or cycle.

I have a question :

The Base Clock is estimated from an invariant TSC.

How to compute this frequency with a variant TSC, such as Core 2 processor ?

Regards

CyrIng

This is the CoreFreq algorithm:

CoreFreq, the true processor frequencies through a Linux kernel driver.

 

I want to make it as precise as possible.

- Are the {m,s,m}fences instructions a requirement around bit atomic operations such as LOCK REX BTS, LOCK REX BTR, LOCK REX AND ? 

- Both, kernel and user-space threads are bind to the same cpu, and serialized by a 64 bits long integer which is stored at the first address of the shared memory page. 
--> Does it guarantee the atomicity and the order of this synchronization test ? 
--> Without a lock prefix, I have noticed that threads seems synchronized; is it something expected or a false positive ? 

- The clock estimation is based on the RDTSCP instruction through a ten times loop bind to BP. There is a few gap but the quotient is constant, however the remainder derive from a few decimals. 
--> How to make it more accurate ? 

- The kernel threads, one per cpu, sleep 1000 ms before reading fixed counters msr. 
--> Should the next operations, such as computing the delta, msr  & Temps readings, be calibrated in their elapsed time estimation ? 
--> Should this last value be subtracted from the 1000 ms waiting period ? 

Thanks for any help.

CyrIng

Attachments: 

AttachmentSize
Download CoreFreq-algorithm.png238.51 KB

Leave a Comment

Please sign in to add a comment. Not a member? Join today