- 1. Introduction
- 2. Logging and Profiling
- 3. Does my Application have an Issue?
- 4. Identifying Root Cause
Finding where an application consumes power can be very challenging. The purpose of this guide is to provide easy-to-follow step-by-step guidance to developers on how to find and troubleshoot power-efficiency issues in their applications. We recommend that the developer read the guide in its entirety at least once to get a feel for how to go about finding and fixing power related issues. You may then use the guide as a quick reference companion that you can refer to while you go about troubleshooting your application.
The scope is limited to Windows 7* and Windows 8*. We have tried to cover as many of the common power related issues as possible. We will continue to add to and refine the manual. Your suggestions are certainly welcome!
Of note here is that the guide does not cover how to go about measuring power consumption. Though power consumption information is certainly useful for detailed analysis, it is not necessary for finding and fixing the most common power related issues in an application.
There are several distinct phases involved in profiling and troubleshooting an application for power issues. Ideally, developers must consider power-efficiency an integral requirement for their applications and make every effort to choose power-efficient algorithms and language constructs from the very start. More information on how to optimize an application for power-efficiency can be found at . Once an application has been written, we must profile it to see if there were any power-issues that crept in during development. Section 2 (Logging and Profiling) introduces the tools and methodology we will use to collect data about an application. Section 3 (Does my Application have an Issue) discusses a few metrics that can be used to quickly determine if the application may have an issue. Section 4 (Identifying Root Cause) provides step-by-step guidelines on how to identify the root cause of an issues and suggestions on how to fix it.
We begin by collecting and logging data about the application being profiled. Data we are interested in include processor C-state residencies (more on this later), interrupt rates, context switch rates, call stacks etc. There are several tools you can use to collect this data but we will use these three:
- Performance Monitor (PerfMon): is a Windows tool used to view performance data. Although the name of the tool implies that it is only for performance, it also provides useful information that can be used for power analysis: CPU utilization (% processor time), Interrupt Rate, Context Switching rate, and System Call Rate. In Windows 8, it also provides information like clock interrupt rate and the wake up rate. You can find more information on PerfMon at .
- Battery Life Analyzer (BLA): is a software tool that monitors various software and hardware activities that affect battery life. We will use BLA to quickly determine if our application even has a power-efficiency issue that needs further analysis. Please refer to  for more information on how to use BLA.
- Microsoft Windows Performance Toolkit (Xperf): consists of the Windows Performance Recorder (WPR) to collect data, and the Windows Performance Analyzer (WPA) to analyze the data. WPR is based on Event Tracing for Windows (ETW) and captures detailed system and application behavior, and resource usage. WPA on the other hand is used to review the event trace log files created by WPR. It displays the performance data in graphs and tables, making it easy to investigate potential issues. With this toolkit you can root cause an issue by drilling down to the process, thread, and API level to find the power hungry calls in the application. An introductory tutorial on how to use this toolkit can be found at .
Profiling and analysis must always be done on a “clean” system. A clean system here means that it only runs applications or services that are needed to operate the system and to run the target application. Applications like antivirus software and disk utilities should be disabled or removed.
An application must be profiled under two scenarios:
- Idle Mode: An application is in the idle mode when it is running without doing anything. For example, a media player is in the idle mode when it is running but is not playing a video clip. It is very important to do this analysis because many user applications spend much of their time in this mode. An idle application should only consume minimal resources.
- Active Mode: The application is in the active mode when it is running a workload. For example, a media player is in the active mode when it is running a video clip. In this mode we are generally interested in analyzing performance issues impacting power as well as wasted work done impacting overall energy consumption on the system.
Data logging steps are as follows:
- Start PerfMon. The counters of interest are: Deepest C-State, Interrupt Rate, Clock Interrupt Rate, Wake-up Rate, Context Switch Rate, and System Call Rate
- Start your application and wait for fifteen minutes for the system to settle down before collecting data.
- Collect PerfMon data for at least 3 minutes. Save the results and close PerfMon.
- Start BLA. Set it up to collect data for the following modules: “Software Activity Analysis” and “CPU C-States”
- Wait for the system to settle before logging. Collect data for at least 3 minutes.
- Review PerfMon and BLA data to see if your application needs further analysis. Please refer to section “Does my Application have an Issue?” for details. If no issues, you are done.
- Close BLA and start WPR.
- Wait for fifteen minutes for the system to settle.
- Begin WPR logging. Collect data for at least 3 minutes.
- Refer to section “Identifying Root Cause” to find what may be causing issues in your application.
Once we have collected data on our applications we must be able to quickly decide if we even have a power efficiency problem. This section covers some common metrics that we can use to decide whether our application is power-efficient or not.
Modern processors implement various power saving states called C-states. A processor is in C0 state when it is active. The Operating System can put individual cores in the processor into deeper C-states as needed. As a core enters deeper C-states (for example, from C0 to C3 to C7) it becomes more power-efficient by turning off more and more functions. The flip side is that a core takes longer to exit deeper C-states and hence there is a latency penalty. The “Package C-state” is the shallowest C-state amongst the cores in a multi-core processor. In general, an idle application should allow the processor to enter its deepest C-state and stay there as long as possible. Hence “C-State Residencies” – the amount of time (as a percentage) the processor spends in various c-states – are a very important metric of application power-efficiency. It is also valuable to observe the frequency with which a processor enters/exits various c-states; generally, less power will be consumed when the frequency of transitions is reduced. More information about C states can be found at .
It is recommended that you use BLA to collect C-state residencies. PerfMon and Xperf report up to C3 while BLA can report even C7 residencies. Further, BLA reports the actual C-state residencies (called “hardware C-states”) while the other tools report only estimation (called “software C-states”.)
For Idle Application
There are no recommended values for an active application. It really depends on the category of application – if it is really heavily CPU-bound, such as an unlocked encoder, then it may show 100% C0 residency.
How to Detect
1. High Platform Timer resolution
2. The use of PeekMessage without WaitMessage to process Windows messages
3. Non-concurrent CPU and GPU Activities
4. Lack of timer coalescing in multithreaded applications
5. Excessive I/O
The operating system uses the Platform Timer to wake up the CPU and schedule processes. On Windows 7 this timer is usually triggered every 15.6ms while on Windows 8 the Operating System dynamically sets the timer resolution based on scheduling needs. Both Operating Systems allow applications to change the platform timer resolution so that they can better handle any time-critical tasks. For example, a video conferencing application may change the timer resolution to 1ms so that it can receive and decode frames as soon as possible. In general, a higher platform timer resolution will consume more power because the CPU is woken up more frequently. Use BLA to look at the platform timer resolution and to see if your application is changing it while idle and/or active.
An idle application rarely needs a timer value of less than 15.6ms. If your application changes the timer value during active processing, ensure that it restores it back to 15.6ms when it goes to idle.
How to Detect
1. An application increased the timer resolution while in the active mode but did not restore it when going into idle
Interrupt Rate is the average rate, in incidents per second, at which the processor receives and services hardware interrupts. This value is an indirect indicator of the activity of devices that generate interrupts, such as the system clock, the mouse, disk drivers, data communication lines, network interface cards, and other peripheral devices. We get this value from PerfMon.
In general, the interrupt rate will need further investigation if it is greater than 8000-9000 interrupts/sec in active mode. Take a look at device activity (graphics, USB, network etc.)
How to Detect
1. Excessive I/O activity (Most likely cause)
2. Short timer resolution
3. Longer DPC call.
Context Switching Rate is the combined rate at which all processors on the computer are switched from one thread to another. Context switches occur when a running thread voluntarily relinquishes the processor, is preempted by a higher priority ready thread, or switches between user-mode and privileged (kernel) mode to use an Executive or subsystem service. A high rate of context switching means that the processor is being shared repeatedly. High context switching causes more overhead and prevents the system from going idle thus increasing power consumption. Try to reduce this value by decreasing the number of active threads. We get this value from PerfMon.
Do further investigation if the Context Switch Rate is greater than 7K per active thread in the application.
How to Detect
1. Platform timer resolution is high
2. Busy Wait
3. Use of sleep API with low timeout value (e.g. sleep(0)), WaitForSingleObject with short timer period
System Call Rate is the combined rate of calls to operating system service routines by all processes running on the computer. These routines perform all of the basic scheduling and synchronization of activities on the computer, and provide access to non-graphic devices, memory management, and name space management. Again, in general, if this value in the active mode is three times greater or equal to that in the idle mode then it is unusual high. We get this value from PerfMon.
Do further investigation if System Call Rate is greater than 45K per active thread in the application. See reference  for more details.
How to Detect
1. Platform timer resolution is high
2. Use of WaitForSingleObject API with short timer period
3. Excessive I/O activity
4. Frequent calls to device drivers
In this section we will show how to identify the root cause of an issue. For each issue we will also make some recommendations on how to fix it.
Some applications purposely change the platform timer resolution to 1ms to ensure smoothness of operation. This will cause the system to wake up more often, thus consuming more power. BLA can be used to identify the process that changes the platform timer resolution.
1. In BLA select and run the module “Software Activity Analysis” if you haven’t already done so.
2. Select the “Active Analysis” tab to display the timer tick. In the example below we highlight one sample application where the platform timer period was changed to something other than the default.
- Do not change the platform timer resolution unless it is absolutely necessary. If changing the timer is unavoidable, make sure to change it back to 15.6ms when no longer required.
- If the application needs a high resolution time stamp, use QueryPerformanceCounter or RDTSC API instead of GetXXXTime APIs.
Similar to the above problem, periodic activities with short intervals will cause the system to wake up more often. WPA can be used to identify periodic activities.
- Using WPA, open the data (ETL file) you logged using WPR in section 3. Select the option “Computation”
- Then select “CPU Usage (precise) By Process, Thread”
- Select option “Timeline by Process, Thread”.
- See if the chart shows what appears to be periodic activity. For example, in the figure below we see a periodic activity at approximately every 11ms.
- Look at the data table to see which specific thread is causing the periodic activity. For example, in the figure below, column TimeSinceLast shows that thread 3644 is activated every 10,905 microseconds.
· Minimize the use of periodic activities. If periodic activities are indeed needed, use the largest time period your application can tolerate.
· Reduce frame refreshing or render activities in Idle Mode since it can impact the display power.
The Sleep API is used to suspend the current thread for a period of time. The Sleep API call with zero duration is commonly used to context switch the calling thread out if another thread of the same or higher priority is ready. However, if this statement is called very often, it will increase the context switch rate and cause the power consumption to increase.
Use WPA to detect calls to sleep().
- If Sleep(0) is being used as a method synchronize multiple threads in an application consider the use of mutex/semaphores instead.
- If Sleep(0) is being used as a method to balance load amongst worker threads consider using it in conjunction with pause API. More details can be found in reference .
The WaitForSingleObject API, as its name implies, is to wait until the specified object is in the signaled state or the time-out interval elapses. If this API is called with a very short time interval, it will impact the battery life. WPA can detect where this API is called in an application
- In WPA, select the option “Computation”
- Then select “CPU Usage (precise) By Process, Thread”
- Select option “Timeline by Process, Thread”.
- See if the chart shows what appears to be periodic activity.
- Double-click on the thread that causes periodic activities.
- Make sure to select columns NewProcess, NewThreadId, NewThreadStack, Separator (orange line), Count and TimeSinceLast (ms)
- Look for calls to WaitForSingleObject in the call stack.
- See if you can use TryEnterCriticalSection with spin count instead (using it with no spin count may not be that efficient.) Please see reference  for details.
This API is used to dispatch incoming sent messages, check the thread message queue for a posted message, and retrieve the messages if they exist. If there is no message waiting, it will immediately return. Using PeekMessage in a loop will keep the system in the active mode and will consume a lot of power.
Use WPA to look at the call stack and see if PeekMessage appears frequently in the area where the application should have been idle. (Follow steps similar to those in section 4.4).
Use WaitMessage API in combination with PeekMessage to save power. See reference  for details.
Polling, generally, refers to the situation where a device is repeatedly checked for readiness, and if it is not, the computer returns to a different task. Polling is not good for power since it will keep the CPU active, thus consuming unnecessary power.
This depends on calls of polling APIs. Under WPA use call stacks to analyze to see if polling is done excessively.
Replace polling logic with event-driven logic whenever possible. For example, use event-driven to alert that data is arrived at certain I/O port instead of constantly checking that I/O port for data.
Spin wait or busy-wait is very similar to polling. Unlike polling, spin-wait checks to see if a device is ready, and if it is not, the computer doesn’t return to a different task. Spin wait without using the pause API will waste power.
- Busy wait often shows a high context switch rate relative to CPU utilization.
- If debugging ring0 applications, you will see NTDelayExecution in the call stack.
Insert the pause statement into the spin-wait loop.
This is detected using GPUView. The figure bellows shows an example of non-concurrent CPU and GPU activities.
Consider using SetMaximumFrameLatency()
This feature was introduced in Windows 7 and allows scheduled timers to specify a tolerable delay for timer expiration. This allows the OS to group multiple software timer expirations into a single period of processing. This way the CPU gets interrupted less often, thus saving power. More information about this feature can be in reference .
Under WPA use overtime views of threads. See if different threads are active at different times.
Allow timer coalescing with a time latency that the application can tolerate.
High Network I/O, USB or storage activities are indicated by high package C2 residencies (as observed by BLA.) You can then use WPA to view which specific IO device is causing the issue.
The HDD Spin-down module in BLA can be used to see if the disks stay spun-up most of the time.
If the HDD Spin-down analysis is showing that the HDD is spending a lot of time spun-up, then the Disk Activity modules’ analysis can be used to identify the software that is causing the disk spin-up.
Reduce the I/O activities by the application when feasible. Always read/write a large chunk of data instead of doing it many time with small set of data.
- Power Efficiency: Developing Power Aware Apps, http://software.intel.com/en-us/articles/energy-efficient-software-developing-power-aware-apps/
- Khang Nguyen, “Using Battery Life Analyzer for Studying Application Power Consumption”, available online http://software.intel.com/en-us/blogs/2012/08/29/using-battery-life-analyzer-for-studying-application-power-consumption-2/
- Khang Nguyen, “Using Windows Performance Toolkit in Analyzing Application Power Consumption”, available online http://software.intel.com/en-us/blogs/2012/09/06/using-windows-performance-toolkit-in-analyzing-application-power-consumption
- Khang Nguyen, “Using Performance Monitor in Analyzing Application Power Consumption”, available online http://software.intel.com/node/327993
- Alon Naveh, Efraim Rotem, Avi Mendelson, Simcha Gochman, Rajshree Chabukswar, Karthik Krishnan, Arun Kumar, “Power and thermal management in the Intel® Core™ Duo processor”, available online http://www.intel.com/technology/itj/2006/volume10issue02/art03_Power_and_Thermal_Management/p01_abstract.htm
- Michael Chynoweth, “Implementing Scalable Atomic Locks for Multi-Core Intel® EM64T and IA32 Architectures”, available online http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures
- Dale Taylor, “PeekMessage: Optimizing Applications for Extended Battery Life”, available online http://software.intel.com/en-us/articles/peekmessage-optimizing-applications-for-extended-battery-life
- Joe Olivas, Mike Chynoweth, “Benefitting Power and Performance Sleep Loops”, available online http://software.intel.com/en-us/comment/1134767
- Windows Timer Coalescing, available online http://msdn.microsoft.com/en-us/windows/hardware/gg463269.aspx
- System Level Bottlenecks, available online http://msdn.microsoft.com/en-us/library/cc558658(v=bts.10).aspx
About the Authors
The authors are Application Engineers working within the Scale Engineering organization of the Developer Relations Division at Intel, each focusing on software enabling for the developer.