By Taylor Kidd, Intel Corporation
This article is essentially a collection of blogs I wrote on the same subject. The differences are simply a degree of formalism.
TABLE OF CONTENT:
Here is a rough outline for this article:
- Introduction and motivation
- Power management infrastructure
- Types of power management policies
I am only writing about processor (including the Intel® Xeon Phi™ coprocessor) power management, meaning P, C and T states. D (device) and S (standby & hibernate) states are a different topic –though no less important in more general purpose systems. See the APCI specification in the reference below [LIST0].
This series is about power management policy. The policy defines how a system, e.g. the OS, implements power management. Hopefully, this is going to be a relatively short series. The truth is that I can never tell until I have written at least half of it.
The primary reason I am writing this series is because I made a minor error about the power management policy implemented by the Intel® Manycore Platform Software Stack (Intel® MPSS). This in turn pointed out that I had not yet written about the difference between power management policy and power management implementation. This is not simply an abstract discussion as the coprocessor has two policies available to it. I will discuss both later.
The secondary reason is that it completes the power management picture as policies are not monolithic and have changed over the years. The policy implemented by a system depends upon both the capabilities of the hardware (e.g. support of C-states) and the needs of the environment (e.g. HPC versus social media servers).
The next chapter in this series looks at power management from a system’s perspective.
I have talked incessantly over the years about power states (e.g. P-states and C-states), and how the processor transitions from one state to another. For a list of previous blogs in this series, and well as other related blogs on power and power management, see the article at [List0]. In all this writing, I have left out an important component of power management, namely the policy itself.
A policy is a collection of rules used for guidance, for example, a security policy. A power management policy contains the rules / logic that guide power management state transitions. The implementation of that policy is done by the power management (PM) manager or module.
One way to divide power management functions is between 5 domains: hardware, BIOS or nearly BIOS level drivers, kernel level drivers (ring 0), system power management controls (ring 3), and user power management controls (ring 3). This arrangement can differ depending upon the OS and technology being used (e.g. mobile vs. server). See Figure PWRMNGR.
Latencies drive this distribution of power management functionality. Power management can only work if its impact on executing applications is trivial. Latency is not so important for transitions into an idle state – the processor is not doing anything or it would not be transitioning into the idle state in the first place. In contrast, transitions out of an idle state and into the run state must take place as quickly as possible. So the designers of the power management infrastructure distribute its functionality across the OS, hardware, and user levels. The lowest layers must be simple and react as quickly as possible when transitioning from the idle state to the run state (e.g. from C1 to C0). As an example, transitions from C1 to C0 are less than a microsecond for the Intel Xeon Phi coprocessor. As we look at higher layers of the power management stack, the transitions they govern are more latency tolerant and can involve more complex decision-making logic.
As an interesting aside, the entire power management stack does not have to be running on the system being managed. The current generation (as of 2014) Intel® Xeon Phi™ coprocessor necessarily has part of the power management logic implemented on the host. I will discuss this further below. (This will likely change in future generations of the coprocessor.)
Figure PWRMNGR. The power management module and the power management policy
In the hardware and BIOS: At these very lowest levels, power management is limited to mapping power management instructions to the underlying hardware, such as calls to invoke different P and C-states. See Alex Hung’s power management blogs for a good description of the BIOS mapping of HW power management functionality to ACPI definitions in reference section below+. Given its simplicity, this level introduces no perceptible latency to an executing user application.
In the Kernel (ring 0): Ultimately, power management decisions involve transitions between run and different idle states, and such decisions introduce latencies. For example, if a processor is in C3 and an interrupt occurs, it must transition from C3 to C0; run the interrupt routine, and then transition back to C3. But as in all things, it is not this simple. These transitions also involve software logic and decision making, such as determining whether the processor should instead use a higher idle state with less latency such as C1. It does not make any sense to have this decision-making logic at the BIOS level as many repeated transitions can result in non-trivial cumulative latency (as well as violating good programming practice).
Typical kernel level power management involves functionality where latency is critical but involves some computation and decision making. This decision making takes place in ring 0 (kernel) which can avoid the latencies inherent in ring 3 context switches and other OS overhead. At this level, statistics are also collected to help the power management software better predict transitions, such as when future interrupts will occur.
In the OS (ring 3): Power management functionality at this level takes more time and becomes involved only when necessary or when minimizing latency is not as critical. An example might be adjusting policy based upon collected interrupt frequency and duration statistics. Another example might be the decisions involving P-state transitions. Such transitions do not involve any state saving and restoration. As such, its decision making can take place at a higher level and at a more leisurely pace in the power management stack.
In User Space (ring 3): This is where policy is set and initialized. At this high level, latency is much less of an issue with some rare exceptions.
One such rare exception is seen in the Intel® Xeon Phi™ coprocessor where the host necessarily becomes involved in some power state transitions. This is because when the coprocessor is in a package C-state, it is all but powered down; no power management software can run on the coprocessor when it is in a package idle state (PC-3 and PC-6). The host must wake the coprocessor up, essentially performing a fast boot up. This means that part of the coprocessor’s power management stack is executing on the host (i.e. remotely). As such, transitions from the deepest package idle state (PC6) to C0 can get close to 500 milliseconds+. See my article on power states referenced below.
In the next section, we look briefly at different power management policies.
Power management policy has evolved over the years. The earliest policies consisted of little more than some critical temperature sensors and an interrupt routine that attempted (often unsuccessfully) to cleanly shut down the system before something really bad happened. Today’s sophisticated power management policies do such things as progressively shutting down parts of processor circuitry during idle with almost no impact upon performance, rapidly alternating between idle and active states, reducing processor frequencies, exploiting thermal lows to temporarily overclock the processor, and a host of other things.
EXAMPLE POLICY #1: This is one of the simplest policies. It was used in a real-time system I worked on so long ago that its existence has faded from human memory. A few well-placed temperature sensors and some hardware logic were placed on the processor’s boards. When the sensors reached certain thresholds, the hardware logic generated a high priority hardware interrupt. The interrupt routine did its best to save system state and shut down the power before anything really unpleasant occurred. To say it a different way, the policy was to save system state and cleanly shut down the system if the temperature of the hardware exceeded a certain preset threshold. I recall that it was successful only 50% of the time.
EXAMPLE POLICY #2: I wrote briefly about this policy in my previous blog on T-states; see reference [TSTATES] below. This policy uses a technique that is a precursor to P-states to give the processor a chance to cool while not interfering with the execution of most applications. When the temperature of the processor exceeds a certain threshold, the processor’s clock will start and stop with a certain duty cycle. The periods where the clock stops (i.e. is gated) allow the processor to cool. Though this slows down a running application, it ceases running when the clock is stopped, the impact for most applications is minimal outside of taking longer to execute. The exception is when the application depends upon time sensitive external events, such as externally triggered interrupts.
EXAMPLE POLICY #3: P-states. I’ve written about this quite a bit. See Power Management States: P-States, C-States, and Package C-States, reference [CPSTATES] below. Like T-states, it allows the processor to cool by slowing down applications. Unlike T-states, it is far less disruptive as the chip temporarily operates as if it has a slower oscillator, something that the design of most general purpose digital devices can accommodate. Check out the Intel® Xeon Phi™ Coprocessor System Software Developers Guide [SDG], June 2013, Figure 3-2 “Power Reduction Flow”, for an example of P-state power transition logic. As the processor is always running, slowing down the clock doesn’t affect the processing of most external events/interrupts.
EXAMPLE POLICY #4: C-States. I’ve talked about this so much that even I’m a little bored. Saying anything else will serve no purpose except to put the reader to sleep. See reference [CPSTATES] below.
EXAMPLE POLICY #5: Remote power management. In the case of the Intel® Xeon Phi™ coprocessor, part of the power management is remote. See my discussion in Power Management States: P-States, C-States, and Package C-States, reference [REMOTE]. The processor shuts down to such an extent that it is no longer capable of responding to waking events. Shutting down provides you with the ultimate in power savings as your usage is, for all intents and purposes, 0 Watts. Unfortunately, the disadvantage is significant; once you remove all the power, you can no longer respond to waking events, say from the PCIe bus. To say it another way, you cannot leave the “off” state without some external intervention, a.k.a., the host.
You can see the advantage of this in that power usage can theoretically be zero Watts. This is quite a power savings. Unfortunately, it comes at a cost, namely that this deepest power state will last for a very long time, actually forever, unless someone flips the power switch of the processor back on.
What’s up next? We’ll wrap up the general part of this discussion with a summary.
How about the future? Have we reached the pinnacle of power management?
Hardware and software are still evolving to be even more energy efficient. An example is the “tickless” OS. In the old days, OSs had to periodically wake up the processor (i.e., perform an interrupt) around a hundred times a second and check to see if anything needed to be done, such as task switching or handling incoming data from some device. OSs haven’t needed to do this for decades, but this legacy periodic “tick” has been part of every OS until the last few years. Every wake-up meant the processor was entering a runtime state, which can potentially prevent it from dropping into the lowest power C-states. The impact is that energy is unnecessarily wasted due to a requirement that no longer exists. Thankfully, most common OSs are now tickless to one extent or another.
As devices and application domains evolve, the pressure to conserve even more energy is very strong, not only for mobile devices but for huge data centers. Mobile devices have the effrontery to get smaller and smaller; data centers need to service more and more people with more and more data; applications keep putting greater demands on processing power; and consumers demand longer battery life.
These trends have resulted in a nearly 3000-fold increase in the performance / power ratio++ over the last 30 to 35 years+++, and the evolution of power management hasn’t stopped. Given the strong driving forces of data center and hand-held devices, I can imagine that tomorrow’s power management will eke out even more savings as well as minimize some of the negative situations that can prevent the effective adoption of power management in certain corner cases, e.g., cases where OS jitter can’t be tolerated and precise periodic interrupts are needed.
Can you think of anything that the processor and software can do to save even more energy (using existing hardware)? Does the processor or OS do something that isn’t really necessary anymore? Does technology have a new, more power-efficient feature that existing software still doesn’t exploit? Are there power hotspots that should be looked at? Are there areas where the processor could save energy, but the cost trade-off (e.g., latency or reliability) is too great? Can the cost trade-off be mitigated allowing the processor to save more energy? These are some of the questions that very creative architects and engineers are asking in their pursuit of improving the performance / power ratio even further.
+ There are state diagrams that detail these changes and the conditions for them. Introducing these diagrams, as well as the kernel level power management APIs, is at a level of depth that is inappropriate for this article. If you have an unquenchable desire to know, they can often be found in processor data sheets or software developer’s guides.
++You cannot simply look at energy usage as it is a moving target: scales get smaller, silicon area gets bigger, new materials and gate technologies appear, etc.
+++ This estimate is derived for Intel general-purpose processors only starting with the 80286. It is a very rough ballpark estimate obtained from general Internet sources.
[LIST0] Kidd, Taylor (10/23/2013), List of Useful Power and Power Management Articles, Blogs and References, http://software.intel.com/en-us/articles/list-of-useful-power-and-power-management-articles-blogs-and-references. Retrieved February 21st, 2014.
[TSTATES] Kidd, Taylor (2013) “C-States, P-States, where the heck are those T-States?” https://software.intel.com/en-us/blogs/2013/10/15/c-states-p-states-where-the-heck-are-those-t-states. (Downloaded May 14th, 2014)
[CPSTATES] Kidd, Taylor (2014) “Power Management States: P-States, C-States, and Package C-States” https://software.intel.com/en-us/articles/power-management-states-p-states-c-states-and-package-c-states. (Downloaded May 14th, 2014)
[SDG] “Intel® Xeon Phi™ Coprocessor System Software Developers Guide [SDG], June 2013,” https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-system-software-developers-guide. (Downloaded May 14th, 2014)
[REMOTE] Kidd, Taylor (2014), Power Management States: P-States, C-States, and Package C-States, https://software.intel.com/en-us/articles/power-management-states-p-states-c-states-and-package-c-states. (Downloaded May 14th, 2014)
NOTE: As previously in my blogs, any illustrations can be blamed solely on me as no copyright has been infringed or artistic ability shown.