This is an aggregate of a series of blogs I wrote on power management states. That series is part of an even larger collection of blogs addressing all sorts of power management topics including power management states (this one), turbo or hyper-threading power states, power management configuration, and power management policy. With one exception, the discussion in these blogs should be generally useful to everyone, even though they concentrate on the Intel® Xeon Phi™ coprocessor. The exception is the series on configuration, which by its nature, is a more platform dependent topic and focused on the Intel® Xeon Phi™ coprocessor and the Intel® Manycore Platform Software Stack (Intel® MPSS). In addition to this collection of power management blogs, there are two other companion collections: a series on different techniques for measuring power performance1, and another loose set of my earlier blogs that cover a variety of topics such as where C*V2*f comes from.
The article you are currently reading was originally published as a series of five blogs:
In addition to these, there is a bonus blog in an appendix:
At the Intel® Developer Zone, you can find the individual blogs listed in yet another article, List of Useful Power and Power Management Articles, Blogs and References.
So exactly which power states exist on the Intel Xeon Phi coprocessor? What happens in each of the power states?
This is not going to be a high powered, in depth, technical treatise on power management (PM). If you want that, I suggest you read the Intel Xeon Phi Coprocessor Software Developer's Guide (SDG)2. As a word of warning, when the Power Management section of the SDG refers to writers of SW (i.e. programmers), whether explicitly or implicitly, they do not refer to you or me. Its target audience consists of those poor lost souls that design operating systems (OSs) and drivers. One of the objectives of the series of blogs you are now reading is to look at PM from the perspective of an application developer, i.e. you or I, and not a writer of operating systems or drivers.
I am also not going to talk to what C, P and PC-states are. If you want an introduction to these concepts before digging into this blog series, I recommend an earlier blog series I wrote on that topic listed in the Endnote section.
Briefly, the coprocessor has package-based P-states, core-based C-states (which are sometimes referred to as CC-states) and package-based C-states (PC-states). It also has the capability to operate in Turbo mode3. There are no per-core based P-states.
The host and coprocessor share responsibility for power management on the coprocessor. For some PM activities, the coprocessor operates alone. For others, the host component acts as a gate keeper, sometimes controlling PM, and at other times overriding actions taken by the coprocessor's PM.
In the forthcoming series, I discuss package-based P-states, including Turbo mode4, core-based C-states and package-based PC-states. I will also discuss what control you have as an application developer over coprocessor PM.
I cannot guarantee that all the Intel Xeon Phi coprocessor SKUs (i.e. coprocessor types) expose these power management features.
Right up front, I am going to tell you that P-states are irrelevant, meaning they will not impact the performance of your HPC application. Nevertheless, they are important to your application in a more roundabout way. Since most of you belong to a group of untrusting and always questioning skeptics (i.e. engineers and scientists), I am going to go through the unnecessary exercise of justifying my claim.
The P-states are voltage-frequency pairs that set the speed and power consumption of the coprocessor. When the operating voltage of the processor is lower, so is the power consumption. (One of my earlier blogs explains why at a high though technical level.) Since the frequency is lowered in tandem with the voltage, the lower frequency results in slower computation. I can see the thought bubble appearing over your head: "Under what situations would I ever want to enable P-states and, so, possibly reduce the performance of my HPC app?" Having P-states is less important in the HPC domain than in less computationally intense environments, such as client-based computing and data-bound servers. But even in the coprocessor's HPC environment, longer quiescent states are common between large computational tasks. For example, if you are using the offload model, the coprocessor is likely unused during the periods between offloads. Also, a native application executing on the coprocessor will often be in a quiescent phase for many reasons, such as their having to wait for the next chunk of data to process.
The P-states the coprocessor Power Management (PM) supports is P0 through Pn. The number of P-states a given coprocessor SKU (i.e. type) supports will vary, but there are always at least two. Similarly, some SKUs will support turbo P-states. The coprocessor's PM SW handles the transition from one P-state to another. The host's PM SW has little if any involvement.
A natural thing to wonder is how much this is going to impact performance; we do happen to be working in an HPC environment. The simple answer is that there is, in any practical sense, no impact on HPC performance. I am sure that, at this point, you are asking yourself a series of important questions:
Let us first consider question two. I understand that power is not a direct priority for you as an application writer. Even so, it does impact your application's performance indirectly. More power consumption has to do with periferals such as higher facility costs in terms of electricity usage due to greater air-conditioning demands, facility space needs, etc.
This also impacts your application in a very important way, though how is not initially obvious. If the facility can get power consumption down while not losing performance, this means that it can pack more processors in the same amount of space all while using the same power budget. And this is a good thing for you as a programmer or scientist. When you get right down to it, lower power requirements mean that you can have more processors running in a smaller space, which means that you, as an application designer or scientist, can run not only bigger problems (more cores) but run them faster (less communication latency between those more cores).
Let us go back to P-states. P-states will impact performance in a theoretical sense, but not in a way that is relevant to an HPC application. How is this possible? It is because of how transitions between the P-states happen. It all has to do with processor utilization. The PM SW periodically monitors the processor utilization. If that utilization is less than a certain threshold, it increases the P-state, that is, it enters the next higher power efficiency state. The word term in the previous sentence is "utilization". When executing your computationally intensive HPC task on the coprocessor, what do you want your utilization to be? Ideally, you want it to be as close to 100% as you can get. Given this maximal utilization, what do you imagine is the P-state your app is executing in? It is P0, the fastest P-state (ignoring turbo mode). Ergo, the more energy efficient P-states are irrelevant to your application since there is almost never a situation where a processor supporting your well-tuned HPC app will enter one of them.
In summary, the "HPC" portions of an HPC application will run near 100% utilization. Near 100% utilization usually guarantees always using the fastest (non-turbo) P-state, P0. Ergo, P-states have essentially no impact upon the performance of an HPC application.
How do I get my application to run in one of those turbo modes? You cannot as it is just too dangerous. It is too easy to make a minor mistake resulting in overheating and damaging the coprocessor. If your processor supports turbo, leave its management to the OS.
C-states are idle power saving states, in contrast to P-states, which are execution power saving states. During a P-state, the processor is still executing instructions, whereas during a C-state (other than C0), the processor is idle, meaning that nothing is executing. To make a quick analogy, a processor lying idle is like a house with all the lights on when no one is at home. If no one is at home, meaning the house is idle, why leave the lights on? The same applies to a processor. If no one is using it, why keep the unused circuits powered up and consuming energy? Shut them down and save energy.
C0 is the "null" idle power state, meaning it is the non-idle state when the core is actually executing and not idle.
Core Versus Package Idle States
The coprocessor has up to 60 plus cores in one package. Core idle power states (C-states) are per core, meaning that one of those 60 plus cores can be in C0, i.e. it is executing and not idle, while the one right next door is in a deep power conservation state of C6. In contrast, PC-states are Package idle states which are idle power conservation states for the entire package, meaning all 60 plus cores and supporting circuitry on the silicon. As you can guess, to drop the package into a PC-6 state, all the cores must also be in a C6 state. Why? Since the package has functionality that supports all the cores, to "turn off" some package circuitry impacts all of them.
Figure 1. Dropping a core into a core-C1 state
Each core has three idle states, C0, C1 and C6.
C0 to Core-C1 Transition: Look at figure one. C1 happens when all four hardware (HW) threads supported by a core have executed a HALT instruction. At this point, let us now think of each of the four HW threads as the Operating System (OS) perceives them, namely as four separate CPUs (CPU 0 through 3). Step one: the first three CPUs belonging to that core execute their HALT instruction. Step two: that last CPU, CPU-0 in the illustration, attempts to execute its HALT instruction. Step three: it interrupts to an idle residency data collection routine. This routine collects, you guessed it, idle residency data and stores that data in a data structure accessible to the OS. CPU 0 then HALTs. Step four: at this point, all the CPUs are halted and the core enters a core-C1 state. In the core-C1 state, the core (and its "CPUs") is clock gated.5
Figure 2. Is it worth entering C6: Is the next interrupted far enough out?
Figure 3. Is it worth entering C6: Is the estimated idle time high enough?
After entering core-C1: Now that the core is in C1, the coprocessor's Power Management routine comes into play. It needs to figure out whether it is worthwhile to shut the core down further and drop it into a core-C6 state. In a core-C6 state, further parts of the core are shut down and power gated. Remember that the coprocessor's Power Management SW executes on the OS core, typically core 0, and is not affected by the shutdown of other cores.
What type of decisions does the coprocessor's Power Management have to make? There are two primary ones as we discussed in the last chapter: Question one: Will there be a net power savings? Question two: Will any restart latency adversely affect the performance of the processor or of applications executing on the processor? Those decisions correspond to two major scenarios and are shown in figures two and three above. Scenario one is where the coprocessor PM looks at how far away is the next scheduled or expected interrupt. If that interrupt is soon enough, it may not be worth shutting down the core further and suffering the added latency caused by bringing the core back up to C0. As is the case in life, the processor can never get anything for free. The price of dropping into a deeper C state is an added latency resulting from bringing the core or package back up to the non-idle state. Scenario two is where the coprocessor's Power Management looks at the history of core activity (meaning its HW threads) and figures out whether the execution (C0) and idle (C1) patterns of the core make core-C6 power savings worthwhile.
If the answers to both of these questions are "yes", then the core drops down into a core-C6 state.
After entering core-C6: The processor next decides if it can drop into the package idle states. I will cover that discussion in my next blog in this series.
Upon reading the SDG (Intel Xeon Phi Coprocessor Software Developer's Guide), you'll find a variety of confusing names and acronyms. Here's my decoder ring:
Package Auto C36: also referred to as Auto-C3, AutoC3, PC3, C3, Auto-PC3 and Package C3
Package Deep-C3: also referred to as PC3, DeepC3, DeeperC3, Deep PC3, and Package C3
Package C6: Also referred to as PC6 and C6 and Package C6.
Before we dig deep into package C-states, I want to give you some background about circuitry on a modern Intel® processor. A natural way of dividing up the circuitry of a processor is that composing the cores —basically that supporting the pipeline, ALUs, registers, cache, etc.— and everything else (supporting circuitry). It turns out that "everything else" can be further divided into that support circuitry not directly related to performance (e.g. PCI Express* interfacing), and that which is (e.g. the bus connecting cores). Intel calls support circuitry that directly impacts the performance of an optimized application the "Uncore".
Figure 4. Circuitry types on the coprocessor
Since that is out of the way, let us get back to package C-states.
After gating the clocks of every one of the cores, what other techniques can you use to get even more power savings? Here's a trivial and admittedly flippant example of what you could do: unplug the processor. You'd be using no power, though the disadvantages of pulling the power plug are pretty obvious. A better idea is to selectively shutdown the more global components of the processor in such a way that you can bring the processor back up to a fully functional state (i.e. C0) relatively quickly.
Package C-States are just that - the progressive shutdown of additional circuitry to get even more savings. Since we have already shut down the entire package's circuitry associated with the cores, the remaining circuitry is necessarily common to all the cores, thus the name "package" C-states.
Idle state packages
there are three package C states: Auto-C3, Deep-C3, and (package) C6. As a reminder, all these are package C-states, meaning that all the threads/CPUs in all the cores are in a HALT state. If all the cores in the coprocessor are in a HALT state, how can the Power Management (PM) software (SW) run? The answer is obvious, if the PM SW can't run on the coprocessor, it runs on the host.
Figure 5. Coprocessor and host power management responsibilities and control
There are two parts to controlling power management on the Intel® Xeon Phi™ coprocessor, the PM SW that runs on the coprocessor, and the PM component of the MPSS Coprocessor Driver that runs on the host. See figure one. The coprocessor part controls transitions into and out of the various core C-states. Naturally, when it is not possible for the PM SW to run on the coprocessor, such as for package Deep-C3 and package C6, the host takes over. Package Auto-C3 is shared by both.
What is shut down in Package C-States
See table 3-2 of the Intel® Xeon Phi™ Coprocessor Software Developer's Guide (SDG).
|Package Idle State||Core State||Uncore State||TSC/LAPIC||C3WakeupTimer||PCI Express* Traffic|
On expiration, package exits PC3
Package exits PC3
Package Auto-C3: Ring and Uncore clock gated
Package Deep-C3: VccP reduced
Package C6: VccP is off (I.e. Cores, Ring and Uncore are powered down)
TSC and LAPIC are clocks which stop when the Uncore is shutdown. They have to be set appropriately when the package is reactivated. "PC3" is the same as the package Auto-C3 state.
Into Package Auto-C3: You can think of the first package state, Auto-C3, as a transition state. The coprocessor PM SW can initiate a transition into this state. The MPSS PM SW can override this request under certain conditions, such as when the host knows that the Uncore part of the coprocessor is still busy.
We will also see that the package Auto-C3 state is the only package state that can be initiated by the coprocessor's power management. Though this seems a little unfair at first, upon further thinking the reason is obvious. At the start of a transition into package Auto-C3, the coprocessor SW PM routine is running and can initiate the transition into the first package state. (To be technically accurate, the core executing the PM SW can transition quickly out of a core C-state into C0 quickly)
Beneath Auto-C3, the coprocessor isn't executing and transitions to deeper package C-states are best controlled by the host PM SW. Not only is this due to the coprocessor's own PM SW is essentially suspended, but because the host can see what is happening in a more global sense, such as Uncore activity after all the cores are gated, and traffic across the PCI Express bus.
Into Package Deep-C3: The host's coprocessor PM SW looks at idle residency history, interrupts (such as PCI* Express traffic), and the cost of waking the coprocessor up from package Deep-C3 to decide whether to transition the coprocessor from a package Auto-C3 state into a package Deep-C3 state.
Into Package C6: Same as the Package Deep-C3 transition but only more so.
This is the fourth installment of a series of blogs on Power Management for the Intel Xeon Phi coprocessor.
For those of you who have read my blog presenting an intuitive introduction to the Intel Xeon Phi coprocessor, The Intel Xeon Phi coprocessor: What is it and why should I care? PART 3: Splitting Hares and Tortoises too, Let's take this description a little further. In Figure A, we have one such diligent high tech worker. He is analogous to one coprocessor CPU/HW thread.
Figure 6. Diligent high tech worker, i.e. an Intel® Xeon Phi™ HW thread
There are 4 HW threads to a core. See Figure B. It's pretty obvious so I'm not going to bother with a multipage boring description of what it means. There is also that mysterious light bulb. The light bulb represents the infrastructure that supports the core, such as timing and power circuits.
Figure 7. Diligent high tech workers in a room, i.e. an Intel® Xeon Phi™ coprocessor core
So what does all this have to do with power management? I ask you to visualize that on every one of those desks is a computer and a desk light.
The Core in C0: When at least one of the high tech workers is diligently working at their task. (I.e. At least one of the core's CPUs/HW threads is executing instructions.)
CPU Executing a HALT: When one of those diligent workers finishes his task, he turns out his desk lamp, shuts down his computer, and leaves. (I.e. one of the HW threads executes a HALT instruction.)
Entering Core-C1: When all four diligent workers finish their tasks, they all execute HALT instructions. The last one finishing turns off the lights. (I.e. The core is clock gated.)
Figure 8. A building full of diligent high tech workers, i.e. an Intel® Xeon Phi™ coprocessor
Let's expand this analogy. Imagine if you will, a building with many rooms, 60 plus in point of fact. See Figure C.
Entering Package Auto-C3: Everyone has left the floor, so the movement sensor automatically shut off the floor lights. (I.e. the coprocessor power management software clock gates the Uncore and other support circuitry on the silicon).
Entering Package Deep-C3: It's the weekend so facilities (i.e. the MPSS Coprocessor Driver Power Management module) shuts down the air condition and phone services. (I.e. the host reduces the coprocessor's VccP and has it ignore interrupts.)
Entering Package C6: It's Christmas week shutdown and forced vacation time, so facilities turns off all electricity, air condition, phones, servers, elevators, toilets, etc. (I.e. the host turns off the coprocessor's Vccp and shuts down its monitoring of PCI Express* traffic.)
We discussed the different types of power management states. Though the concepts are general, we concentrated on a specific platform, the Intel® Xeon Phi™ coprocessor. Most modern processors, have such states with some variation.
Intel® processors have two types of power management states, P-states (runtime) and C-states (idle). The C-states are then divided into two more categories, core and package. P-states are runtime (C0) states and reduce power by slowing the processor down and reducing its voltage. C-states are idle states meaning that they shutdown parts of the processor when the cores are unused. There are two types of C-states. Core C-states shutdown parts of individual cores or CPUs. Since modern processors have multiple cores, package C-states shutdown the circuitry that supports those cores.
The net effect of these power states is to substantially reduce the power and energy usage of modern Intel® processors. This energy reduction can be considerable, in some cases by over an order of magnitude.
The impact of these power savings cannot be over stated for all platforms from smart phones to HPC clusters. For example, by reducing the power and energy consumption of the individual processors in an HPC cluster, the same facility can support more processors. This increases processor density, reduces communication times between nodes, and makes possible a much more powerful machine that can address larger and more complex problems. At the opposite end in portable passively cooled devices such as smart phones and tablets, reduced power and energy usage lengthens battery life and reduces cooling issues. This allows more powerful processors which in turn increases the capability of such devices.
A T-state was once known as a Throttling state. Back in the days before C and P states, T-states existed to save processors from burning themselves up when things went very badly, such as when the cooling fan failed while the processor was running as fast as she could. If a simple well placed temperature sensor registered that the junction temperature was reaching a level that could cause damage to the package or its contents, the HW power manager would place the processor in different T-States depending upon temperature; the higher the temperature, the higher the T-State.
The normal run state of the processor was T0. When the processor entered a higher T-state, the manager would clock gate the cores to slowdown execution and allow the processor to cool. For example, in T1 the HW power manager might clock gate 12% of the cycles. In rough terms, this means that the core will run for 78% of the time and sleep for the rest. T2 might clock gate 25% of the cycles, etc. In the very highest T-state, over 90% of the cycles might be clock gated. (See the figure below.)
Figure 9. Running Time for T0/P0, P1, and T1 States
Note that in contrast to P-states, the voltage and frequency are not changed. Also, using T-states the application runs slower not because the processor is running slower, but because it is suspended for some percent of the time. In some ways, you can think of a T-state as being like a clock gated C1 state with the processor not being idle, i.e. it is still doing something useful.
In the figure above, the top most area shows the runtime of a compute intensive workload if no thermal overload occurs. The bottom shows the situation with T states (i.e. before P states), where the processor begins to toggle between running and stopped states to cool down the processor. The middle is what happens in current processors, where the frequency/voltage pair is reduced allowing the processor to cool.
There are a few more practical reasons you should be at least aware of T-states.
Taylor Kidd, List of Useful Power and Power Management Articles, Blogs and ReferencesMay 19, 2014
For those of you with a passion for power management, see the Intel Xeon Phi Coprocessor Software Developer's Guide. It has state diagrams and other goodies. I recommend sections 2.1.13, "Power Management", and all of section 3.1, "Power Management (PM)".
Taylor Kidd, Intel® Xeon Phi™ Coprocessor System Software Developers Guide April 14, 2014, updated January 14, 2015
Here is a list of earlier blogs
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserverd for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804