Power Management: So what is this policy thing?

Unlike a lot of previous recent blogs, this series is about power management in general. At the very end of the series, I’ll write specifically about the Intel® Xeon Phi™ coprocessor.

I have talked incessantly over the years about power states (e.g. P-states and C-states), and how the processor transitions from one state to another. For a list of previous blogs in this series, and well as other related blogs on power and power management, see the article at [List0]. But I have left out an important component of power management, namely the policy.

A policy is a collection of rules used for guidance, for example, a security policy. A power management policy contains the rules / logic that guide power management state transitions. The implementation of that policy is done by the power management (PM) manager or module.

One way to divide power management functions is between 5 domains: hardware, BIOS or nearly BIOS level drivers, kernel level drivers (ring 0), system power management controls (ring 3), and user power management controls (ring 3). This arrangement can differ depending upon the OS and technology being used (e.g. mobile vs. server). See Figure PWRMNGR.

Latencies drive this distribution of power management functionality. Power Management can only work if its impact on executing applications is trivial. Latency is not so important for transitions into an idle state – the processor is not doing anything or it would not be transitioning into the idle state in the first place. In contrast, transitions out of an idle state and into the run state must take place as quickly as possible. So the designers of the power management infrastructure distribute its functionality across the OS, hardware, and user levels. The lowest layers must be simple and react as quickly as possible when transitioning from the idle state to the run state (e.g. from C1 to C0). As an example, transitions from C1 to C0 are less than a microsecond for the Intel® Xeon Phi™ coprocessor. As we look at higher layers of the power management stack, the transitions they govern are more latency tolerant and can involve more complex decision making logic.

As an interesting aside, the entire power management stack does not have to be running on the system being managed. The current generation (as of 2014) Intel® Xeon Phi™ coprocessor necessarily has part of the power management logic implemented on the host. I will discuss this further below. (This will likely change in future generations of the coprocessor.)

Power Management Stack


Figure PWRMNGR. The power management module and the power management policy


In the Hardware and BIOS: At these very lowest levels, power management is limited to mapping power management instructions to the underlying hardware, such as calls to invoke different P and C-states. See Alex Hung’s power management blogs for a good description of the BIOS mapping of HW power management functionality to ACPI definitions in reference section below[. Given its simplicity, this level introduces no perceptible latency to an executing user application.

In the Kernel (ring 0): Ultimately, power management decisions involve transitions between run and different idle states, and such decisions introduce latencies. For example, if a processor is in C3 and an interrupt occurs, it must transition from C3 to C0; run the interrupt routine, and then transition back to C3. But as in all things, it is not this simple. These transitions also involve software logic and decision making, such as determining whether the processor should instead use a higher idle state with less latency such as C1. It does not make any sense to have this decision making logic at the BIOS level as many repeated transitions can result in non-trivial cumulative latency (as well as violating good programming practice).

Typical kernel level power management involves functionality where latency is critical but involves some computation and decision making. This decision making takes place in ring 0 (kernel) which can avoid the latencies inherent in ring 3 context switches and other OS overhead. At this level, statistics are also collected to help the power management software better predict transitions, such as when future interrupts will occur.

In the OS (ring 3): Power management functionality at this level takes more time and becomes involved only when necessary or when minimizing latency is not as critical. An example might be adjusting policy based upon collected interrupt frequency and duration statistics. Another example might be the decisions involving P-state transitions. Such transitions do not involve any state saving and restoration. As such, its decision making can take place at a higher level and at a more leisurely pace in the power management stack.

In User Space (ring 3): This is where policy is set and initialized. At this high level, latency is much less of an issue with some rare exceptions.

One such rare exception is seen in the Intel® Xeon Phi™ coprocessor where the host necessarily becomes involved in some power state transitions. This is because when the coprocessor is in a package C-state, it is all but powered down; no power management software can run on the coprocessor when it is in a package idle state (PC-3 and PC-6). The host must wake the coprocessor up, essentially performing a fast boot up. This means that part of the coprocessor’s power management stack is executing on the host (i.e. remotely). As such, transitions from the deepest package idle state (PC6) to C0 can get close to 500 milliseconds+. See my article on power states referenced below.


In the next blog, we will look briefly at different power management policies.



NOTE: As previously in my blogs, any illustrations can be blamed solely on me as no copyright has been infringed or artistic ability shown.

[List0] Kidd, Taylor (10/23/2013), List of Useful Power and Power Management Articles, Blogs and References, http://software.intel.com/en-us/articles/list-of-useful-power-and-power-management-articles-blogs-and-references. Retrieved February 21st, 2014.

+ There are state diagrams that detail these changes and the conditions for them. Introducing these diagrams, as well as the kernel level power management APIs, is at a level of depth that is inappropriate for this article. If you have an unquenchable desire to know, they can often be found in processor data sheets or software developer’s guides.

For more complete information about compiler optimizations, see our Optimization Notice.