I hate moving. Nothing ever goes as it should. It takes 10 times longer than you expected. And that last box is finally unpacked just before you end up moving again.
There's got to be a catch
There are 5 CC-states and, depending upon how you count, 6 PC-states in the Penryn line of Intel processors. And, in Microsoft XP, there are 4 OS C-states. So are there 5 C-states, 6 C-states, 4 C-states, or 15 C-states? Choose the number that you are least uncomfortable with. Personally, I first imagine a 3 set Venn diagram with overlapping elements and added transition annotations. Then I get confused and give up.
Given what we've talked about above, it seems as if we should always drop a core into the lowest permissible CC-state, right?
There are a few reasons for not doing this. First off, the OS's Power Management (PM) policy, and not the hardware, determines when a core enters a CC-state. From our standpoint as a hardware manufacturer, we have little to do with this. I'll talk about why this is important later. Secondly, there is always a cost for dropping into a lower C-state. That cost is the amount of time required for the core to transition from an idle state, e.g. CC5, to C0. As you start using deeper CC-states, latency becomes significant. For example, the latency to go from CC3 to CC0 is around 20 us, literally ages when we're talking about 3 GHz processors.
This latency penalty is even worse once you realize that the phrase, "in a given C-state," is misleading. As I've mentioned above, it's easy to think of a core as descending as a waterfall from C0 into C1 into C2 into C3. (See Figure A.) If this were the case, you'd suffer only one 20 usec penalty. No, it's oscillating between C0 and C3 hundreds, if not thousands, of times a second until the OS's PM code decides that the percentage residency merits ascending / descending to the next C-state (e.g. CC3 to CC2). In Windows, the C-state that a core transitions to is based on the % idle over a certain interval. This means that each transition exacts that 20 usec penalty, and there are hundreds of transitions. Doing the math, experiencing a 20 usec delay 100 times a second is a whopping 2 msec of added latency per second. (See Figure B.)
Even if it is possible to drop a core into a deeper CC-state, the OS has to ask itself various questions, such as what is the likelihood that processes are going to be doing more work very soon, so that dropping into a deeper CC-state might actually cost an unacceptable penalty? Similarly, the processor has to ask whether dropping a core into a lower CC-state is going to cause incorrect operation, say whether the delay in the processing of an interrupt will cause an event to be lost.
Figure A. The waterfall misconception.
Figure B. What actually happens "in a given C-state".