Latency Issues when altering C-state of a Sandy-Bridge Processor

Latency Issues when altering C-state of a Sandy-Bridge Processor

Hi All,

I am doing a study in controlling idle-states (C0 to C6/7) of a Sandy Bridge Processor

specifically on a (i7-2677M).

I have observed a some issues in reporting application latency and thus power consumption

when tapping into the built in hardware counters while using Papi-C, and manually setting C-state preferences

by writing to the: /dev/cpu_dma_latency to hit C-state modes.

Based on the Intel documentation and C-state definition, placing cores at higher (deeper sleep states) should incur higher wakeup time latency.

However, after manually setting the preference to have all cores running at C0 (active state) and running a simple micro-benchmark (DGEMM),

I am getting higher latency (run time) values at this state (all cores at C0) versus (having all cores initially at C6/C7 deep sleep state),

and having the preference at default (ondemand).

Experimental Setup: 1 thread, DGEMM, Ubuntu 12.04

1. All cores set preference to be C6/C7:  runtime = 4.84 sec

2. All cores set to C0 (active mode): runtime = 7.70 sec

It doesn't make sense to me that having all cores initially at C0 perform

significantly worse than having them all at C6/C7. Could anyone explain why this may be the case?

Best Regards,

Neil

15 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hello Neil,

Sounds like something is not accomplishing what you are trying to accomplish.

It seems like, if you are keeping the cores at C0, then you should get the same performance as booting with the 'idle=poll' boot option.

You can monitor how much time the cores and whole processor are staying in each C state with cstate residency MSRs 0x3f8 - 0x3fe (such as MSR_PKG_C3_RESIDENCY).

I guess it is not clear to me what you are trying to do (nor why).

Pat

I suppose that Neil is trying to understand why cores at C0 state performed worse than cores at C6/7 state.

And I would say that he is probably not accomplishing what he thinks he is accomplishing.

And I provided some simple checks that he can do to see if what he thinks he is doing is actually happening.

The question of why he is distracting folks from trying to accomplish real work is still unanswered. If he is finding issues with the ways things work when they are properly setup, that is one thing. If he is just messing with parameters and then wondering why things don't work as well, this is study that could suck up everyone's time and accomplish nothing. Sorry... not a lot of patience today, too much work to do.

Thank you both for your prompt responses.

I am trying to capture runtime vs energy usage for a given task while alternating between different performance states (p0-pn) and idle states (c0 to c6/c7).

My first experiment is to try to investigate the overhead (wake-up) when launching an application under two case:

1) having a core/processor initially in C0 (should incur no wake-up overhead)

2) having them initially in C6/C7, then having the core (process) go to C0 when the application is launched.

I will try to read the MSR registers you pointed Pat.

In the meantime, I suppose I can provide my experimental setup.

I am running in a linux environment using "cpupower" tools to monitor C-state residencies. Doing this, I was able to verify the residencies for both the assumptions above. When I set the preference for all cores to be in C0: I get an average of 99% at "POLL" state. When I set the preference for C6/C7, I get 98% residency.

Then when running the tests above, I am getting those runtime values.

Thanks,

Neil

As another note, I found code that is able to manually set (in priviliged mode) the idle-state. As in the previous thread,

I was able to verify the c-state residencies.

The code is as follows:

-------- BEGIN

#include <stdint.h>
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>

#include <iostream>

int main(int argc, char **argv) {
  int32_t l;
  int fd;
  if (argc != 2) {
    fprintf(stderr, "Usage: %s <latency in us>\n", argv[0]);
    return 2;
  }
  l = atoi(argv[1]);
  printf("setting latency to %d us\n", l);
  fd = open("/dev/cpu_dma_latency", O_WRONLY);
  if (fd < 0) {
    perror("open /dev/cpu_dma_latency");
    return 1;
  }
  if (write(fd, &l, sizeof(l)) != sizeof(l)) {
    perror("write to /dev/cpu_dma_latency");
    return 1;
  }
  while (1) pause();
}

------ END

When compiled and ran under privileged mode: I set:

$./cpu_latency 109

The above command forces all cores to be at C6/C7 since that is the wake-up latency of those idle-states.

When I run:

$./cpu_latency 0

This forces all cores to be in C0.

For each case above, I launch my micro-benchmark and record the runtime (using Papi-C) and observed

longer runtimes when all cores are initially in C0.

Hopefully this clarifies things. Thanks in advance.

Best,

Neil

Hello Neil,

Have you actually run with 'idle=poll' or are you just taking the articles word that setting cpu_dma_latency = 0 acommplishes the same thing?

Pat

Hello Pat,

I have not yet tried booting with "idle=poll" option. where may this be done? 

Best,

Neil

On Linux systems, the "idle=poll" option is typically added to a command line in /boot/grub/menu.lst.

If I recall correctly, setting the /dev/cpu_dma_latency to zero may not be enough to disable the C1 state -- you should check the p-state occupancy MSRs to see what you are actually getting.

The single-threaded test case you are running is not the right kind of workload to look for C-state transition latencies.  The O/S will run your one thread on one core and leave the other core idle.   O/S services will either interrupt your process for a short period or will wake up the other core, but the overhead will be negligible in either case.

To look for C-state transition latency penalties, you need either a workload that is receiving bursts of external interrupts that must be processed (e.g., a web server), or a workload that creates and destroys threads with very short lifetimes -- but with enough time between thread destruction and thread creation that the hardware will put the idle cores into deep C-states.

I would guess that you are seeing the combined effects of C-states and P-states in your tests.  The Intel i7-2677m has a base frequency of 1.8 GHz and can run one core at up to 2.9 GHz, or both cores at up to 2.6 GHz.   The actual turbo boost depends on the temperature, current, and power consumption of the system under test.   For a compute-intensive single-threaded task like DGEMM, one would expect a higher turbo boost for the active core if the other core is in a deep C state. Your performance ratio of 7.70 seconds / 4.84 seconds (=1.591) is very close to the maximum turbo boost ratio of 2.9 GHz / 1.8 GHz (=1.611).

If you run your DGEMM test under "perf stat", it should report the average frequency during the run.

John D. McCalpin, PhD
"Dr. Bandwidth"

You can used the fixed function counters to monitor the "average" freqency of your system.  It's very hard to talk ipc when you're boosting.. because what's a clk worth.  It varies depending upon the freq.   When boosting I think ipc is somewhat worthless.. you need to focus upon ipc when you're not boosting.. lessen power and then from the saved cac, boost as much as you can.  Monitoring time of execution while you boost is prone to headaches.  So I recomment you get the fixed function counters.. and focus on the "reference clk" which doesn't change at any boost speed. My 2 cents..

Perfwise

Will ipc vary accordingly to boosted speed and be dependent on total heat dissipation?

Hi John,

Can processor while working at boosted frequency(presumably at maximum clock rate) sustain the same ipc when compared to processor running at lower speed ?I mean if there is direct dependency on total heat disipation which could in turn lower processor ipc.

Thanks in advance

I have not heard of any throttling mechanisms that would be applied when running at a boosted frequency in Turbo mode (see note).  If the temperature or power or current gets too high, the power management unit will drop the processor to the non-boosted frequency.   If the temperature is still too high there are a number of throttling options available.  The most common is to reduce frequency (even if you have requested that the processor stay in the maximum p-state), but there are also throttles related to instruction fetch/issue/execution/completion that can be used to reduce the dynamic power consumption of an operating core.  As far as I know these throttling mechanisms are all "emergency" measures and are only applied if dropping the frequency is not enough to control the temperature/power/current. 

Since memory latency is only weakly dependent on CPU frequency, the IPC will typically be lower when running at higher (including Turbo) frequencies -- the stalls make up a higher fraction of the total time when the CPU is runner faster. 

This material tends to be fairly weakly documented in public material, but the programmable Running Average Power Limit (RAPL) functionality in the Sandy Bridge and newer processors provides some visibility into the behavior and some control over the policies.    The Xeon E5-2500 Uncore Guide shows how to use the counters in the uncore Power Management Unit to see how the processor is responding to power and temperature issues.  The algorithms used for managing power are not published, but these counters make the behavior visible.  Of course you might need a fairly exotic setup to create controlled experiments.   When I worked at that other company that makes x86-64 processors, we had labs with equipment that allowed precise control over processor voltage and temperature that we used to test & validate the power-throttling features.

Note: Some of Intel's documentation seems (to me) to suggest that "opportunistic performance boost" and "turbo" frequency boost might be different things.  The "opportunistic" and "turbo" boosts look to be controlled in different places, so there might be mechanisms other than CPU frequency at work in "opportunistic" boosting.   This is just speculation, but there are certainly many places in the microarchitecture where one might trade performance for power in a dynamic way without changing the frequency.  Has Intel done any of these?  Dunno...

John D. McCalpin, PhD
"Dr. Bandwidth"

John thank you very much for detailed explanation.

>>>but there are also throttles related to instruction fetch/issue/execution/completion that can be used to reduce the dynamic power consumption of an operating core.  >>>

I wonder if processor designers are able to measure an energy consumption and heat dissipation of single instruction execution so it could be used directly by throttling mechanism at very high (boosted) frequency.I am thinking about the energy/heat per instruction execution stored in tables.

Leave a Comment

Please sign in to add a comment. Not a member? Join today