Creating Energy-Efficient Software[PDF 2MB]
This paper examines software methodologies, designs, and software development tools that can be used to improve the energy efficiency of application software and extend mobile platform battery time. Computational efficiency, data efficiency, and context-aware methods can all contribute to creating applications that are power-aware. There are many additional resources available in the form of white-papers, developer kits, and analysis tools. These are referenced in the paper and in the References section. See the list of References for more information.
For years mobile platform vendors have sought means to extend the battery life for mobile platforms. Battery technologies have gradually improved, processors have new low-power states, and displays have dramatically improved their power consumption. There is still room for improvement. Software can play an important role in reducing the power used on mobile platforms and extend the battery time.
The purpose of this paper is to explain the software methodologies and designs that can be used today to save energy and extend mobile platform battery time as well as describe various tools that support the development of energy-efficient software.
The methodologies described here have been researched and tested by Intel software Application Engineers. In each case, we document the resul ts of the experiments and provide a reference to more detailed information. For a look into the typical test environment and the mechanisms for measuring total platform power, see Appendix A.
The remainder of the paper covers the following topics:
Joule – the international standard unit of energy measurement
Energy – The conventional definition of energy is the “capacity to do work”. A device that is energy-efficient requires less energy for its “work” or task than its energy-inefficient counterpart. For this paper, we use the term to mean the amount of joules required to carry out a specific task. For example the energy required to lift a 100 gram object 1 meter against the pull of earth’s gravity is about 1 Joule.
Power – the amount energy consumed per unit of time, typically measured in Watts, where a Watt equals 1 Joule per second. For example, a light bulb rated at 60 Watts consumes 60 Joules in one second. Notebook computers running at their highest energy state are rated between 40 to 60 Watts, but on average consume far less.
Heat – more specifically resistive heat, is a natural by-product of running current through a conductor; engineers strive to minimize this in computer design; too much heat means more cooling (typically by a fan) which requires more energy
While we strive to use the terms energy and power appropriately, there may be instances where they are used synonymously.
The power profile of various components on the mobile platform depends on the usage model. For example, the relative contribution of processor power to the overall platform power will be significant in a CPU-intensive workload, but it will not be a dominant factor while the platform is idling. Furthermore, it may also vary depending on whether both the cores were utilized or not (i.e. single-threaded vs. multithreaded). The following provides an idea of how the profile varies during various usage models. The CPU, memory, and file system tests were run using SiSandra benchmarks (http://www.sisoftware.co.uk). Note that the platform power below does not include LCD, since we have excluded it from our analyses. (Others include WLAN, HD-Audio, mini-card, ICH, and other peripherals.)
As seen above, mobile developers need to have an idea of power drainage depending on the usage model, and target specific components for extending battery life and conserving power.
The goal of computational efficiency is to complete a task more quickly. Intuition tells us that if the CPU can accomplish the task in fewer instructions or by doing work in parallel in multiple cores, and then drop the CPU to a low-power state, then the overall energy required to complete the task will be lower. One approach to achieve this is to use the best algorithms and data structures for the particular problem. Another method, for which we have research results below is to take advantage of the performance per watt advantages of Intel multi-core processors and take full advantage of multi-threading to increase application performance and save energy.
Algorithms and data structures are a long-standing area of research in computer science. Considerable effort has gone into research to find more efficient means to solve problems and to investigate and document the corresponding time and space tradeoffs. While optimizing specific algorithms is not an area of interest for our team per se, we can conclude from computer science theory that the choice of algorithms and data structures can make a vast difference in the performance of an application. All other things being equal, using an algorithm that computes a solution in ?(n log n) time is going to perform better than one that does the job in ?(n2) time. For a particular problem, a stack may be better than a queue and a B-tree may be better than a binary tree or a hash function. The best algorithm or data structure to use depends on many factors, which indicates that a study of the problem and a careful consideration of the architecture, design, algorithms, and data structures can lead to an application that performs better and consumes less energy. For a detailed study of the analysis of algorithms see [7,8,9].
For mobile platforms, power consumption has always been one of the major areas of importance. With multithreaded applications, the job at hand may be able to finish faster than single-threaded applications. As a result, the boost in performance may result in power savings as system resources will be used for less time, as compared to a single-threaded version. 
There are other considerations introduced with multithreading an application, such as the effects on power/performance when the threads in the application are imbalanced (when one thread does significantly more work than the other threads), differences in CPU utilization of the threads (for example, one thread might consume 100 percent CPU, while the other threads might consume 10–20 percent of the CPU), and when the t hreads are affinitized to a single core rather than running on separate cores. This research investigated such issues with a wide variety of multithreaded applications and multitasking scenarios, and proposes recommendations that should be considered when multithreading an application.
All the tests described here were conducted on dual-core Intel Core Duo engineering sample systems with the (code-name) Napa platform. Power measurements were accomplished with Fluke NetDAQ®
A variety of applications (single-threaded and multithreaded implementations), along with test kernels developed in-house, are characterized here for power/performance measurements. These applications include a variety of content creation applications, kernels from the gaming space, kernels using Intel® Integrated Performance Primitives (IPP), and office productivity applications.
Single-threaded and multithreaded implementations of the applications discussed here were tested with two power schemes that Microsoft Windows* XP provides when coupled with Intel SpeedStep® technology: MaxPerf or Always-On (AO) mode and Adaptive or Portable-Laptop (PL) mode. AO mode provides maximum available frequency while PL mode adjusts the frequency to conserve energy.
In the results reported below, the following threading models were tested:
The graphs in this section discuss power/performance results on an Intel Core Duo engineering sample system running Windows* XP. Time is expressed in seconds. Power measurements were done with Fluke NetDAQ, which reports average power (in watts [W]) which is then converted to total power using application run-time data (mWHr).
The graph in Figure 1 indicates performance data for running single-threaded (ST) and multi-threaded (MT) versions of several CPU-intensive applications. Cryptography and video encoding applications have two MT implementations and results are indicated as MT-1 and MT-2. For content creation applications, multithreading is done with only one implementation, indicated as MT-1. The multithreaded applications clearly show significant performance improvements over running single-threaded versions. For example, the ST version of cryptography takes ~50 seconds to complete, while both the MT-1 and MT-2 versions take only ~25 seconds.
Figure 1: Balanced Threading Performance
|Figure 2: Balanced Threading - CPU Power (Adaptive)||Figure 3: Balanced Threading - Platform Power (Adaptive)|
Figures 2 and 3 indicate CPU power and Total Platform power for adaptive (portable/laptop) mode, respectively. Adaptive mode is chosen as it favors power consumption by dynamically changing CPU frequency on demand. For each application run (ST and MT), power data-gathering is normalized to the longest run-time. For example, as indicated in Figure 1, the cryptography workload runs for ~50 seconds in ST mode and ~25 sec in MT modes. The power data is measured for 50 seconds in both ST and MT cases. The intent here is to determine if power can be saved by finishing the CPU-intensive task faster (as in the case of MT) and going to idle state for the remaining duration.
As indicated in Figure 2, power saving is achieved by finishing the job faster (MT) and idling for the remainder of the time as the ST version. This indicates that multithreading done correctly not only shows performance improvements but also saves power. For example, the cryptography ST version running for ~50 seconds consumes ~150 mWHr of total power, while running the cryptography MT version for ~25 seconds and idling the system for the remaining 25 seconds consumes ~110 mWHr of total power. Hence, multithreading helps save power. The graph in Figure 3 indicates the implication of multithreading on total platform power. As indicated, running a multithreaded version of an application consumes lower platform power compared to running a single-threaded version.
In this section, we will examine power/performance implications on an application with an imbalanced threading model. For this study, a sample game physics engine was created (using Microsoft DirectX*). The sample application has two parts: 1) Physics Computation (collision detection and resolution for graphics objects) and 2) Rendering (updated positions are drawn onto screen). The design of the application was deliberate so that balanced and imbalanced threading could be studied for a CMP (chip multi-processing) processor. Briefly:
The intent behind creating two multithreaded implementations here is to evaluate power/performance impact when an imbalanced threading model is used, as compared to a balanced threading model.
With the two implementations, performance data in different power schemes (MaxPerf, Adaptive, and Adaptive with GV3 fix) are shown in Figure 4.
As indicated on the first two sets in the graph above (Figure 4), performance of imbalanced multithreaded (Imbalanced-MT) implementation degrades from ~64 seconds in MaxPerf mode to ~120 seconds in Adaptive model. However, with the GV3 fix from Microsoft, performance of the Imbalanced-MT completes in less time than ST, as it should. In all cases, Balanced-MT has better performance than Imabalanced-MT.
Figure 4: Imbalanced Threading Performance
Figure 5 shows the results for the measurements of platform power consumption. The power measurements discussed here were normalized using a technique mentioned in the Balanced Threading Model section. As expected, platform power consumption increases with a reduction in the performance in Adaptive (PL)-Default mode. Since the Imbalanced-MT workload now takes much longer to finish, the performance degradation causes an increase in power consumption. With the GV3 fix in place, the improvements in performance yield corresponding improvement in total platform power consumption.
The third set in Figure 4 indicates data with a kernel hotfix. In this case, imbalanced-MT implementation in Adaptive (PL) mode shows similar power/performance data as that of MaxPerf (AO) mode. With this fix, processors run at optimum frequency, not causing degradation in Adaptive (PL) mode. Platform power data with the fix is shown in Figure 5 in the Adaptive (PL)-with GV3 Fix column.
Figure 5: Imbalanced Threading - Platform Power
These results indicate that the imbalanced threading model/under-utilized CPU may cause degradation in performance, causing increased power consumption. We recommend use of a balanced threading model while multithreaded applications. Thread imbalance can be identified by using tools like the Microsoft Perfmon*, the timeline view offered by Intel® VTune™ and Intel® Threading Tools (such as Intel® Thread Checker and Intel® Thread Profiler) to track individual thread run-time and processor utilization counters.
Multitasking Scenarios with one Application Affinitized to Single Core
One of the common usage scenarios for PC users is running multiple applications simultaneously-multitasking. To understand the performance and power impact of running two applications simultaneously, an experiment was conducted with two office productivity applications running concurrently. Since scheduling the two applications using different techniques is likely to show power/performance impacts, the following scenarios were examined:
These scheduling configurations were chosen to identify if a certain scheduling mechanism favors both power and performance as compared to the others.
Performance data for these configurations is shown in Figure 6. As indicated in the graph, there is no significant performance difference observed with different scheduling configurations.
Figure 6: Multitasking Scenario Performance
As shown in Figure 7, CPU power consumption data-the second scenario that is affinitizing both applications to individual cores-demonstrates slightly higher power consumption as compared to Windows XP scheduling.
A similar impact is seen (Figure 8) on platform power consumption, where hard-affinitizing both the applications to each core shows slightly higher platform power consumption as compared to letting Windows XP do the scheduling.
|Figure 7: Multitasking Scenario - CPU Power||Figure 8: Multitasking Scenario - Platform Power|
The following can be deduced from the threading tests described in this section:
Another way to achieve better computational efficiency and performance is to use an optimizing compiler, performance libraries, and/or make use of advanced instruction sets.
Many compilers now offer OpenMP directives that look for opportunities for parallelism. OpenMP* is a portable, scalable model for developing parallel applications. For example, the Intel® Professional Edition Compilers offer support for creating multi-threaded applications. Features include advanced optimization, multi-threading, and processor support that for processor dispatch, vectorization, auto-parallelization, OpenMP*, data prefetching, and loop unrolling, along with highly optimized libraries. Microsoft Visual Studio 2005 also includes the OpenMP portable threading library.
Performance libraries such as Intel® Performance Primitives (IPP) contain highly optimized algorithms and code for common functions such as video/audio encode/decode, cryptography, speech encoding and recognition, computer vision, signal processing, etc. The Intel® Math Kernel libraries provide highly efficient algorithms for many math and scientific functions including fast Fourier transforms, linear algebra, and vector math.
Advanced instruction sets help the developer take advantage of new processor features that are specifically tailored to certain applications. For example, the Intel Streaming SIMD Extensions (SSE) 4.1 is a set of new instructions tailored for applications that involve graphics, video encoding and processing, 3D imaging, gaming, web servers, and application servers. Optimizing with these instructions will deliver increased performance and energy efficiency to a broad range of 32-bit and 64-bit applications. For more detail on SSE 4.1 and Application Targeted Accelerators, see [Ref14].
 It is interesting to note that this is not always true. Due to the quadratic relationship between processor states and voltage, it can be demonstrated that a process running for a longer time at a lower P-state may actually use less total energy than running the same process at a high P-state for less time. This is an area of future research.
 GV3is a Microsoft hotfix (KB896256) to change the kernel power manager to track CPU utilization across the entire package instead of individual cores. It resolves an issue the power manager had with incorrectly calculate the optimal target performance state for the processor when one core was much less busy than the others. The performance state was set too low and performance suffered in adaptive mode.
 More detail on this study can be obtained from: DVD Playback Power Consumption Analysis
 More details from this analysis are available at: Power Analysis of Disk I/O Methodologies
 More detailed coverage of this topic can be found at: Data Transfer over Wireless LAN Power Consumption Analysis
 For complete details of this study, please see: Enabling Games for Power
 For details on Extech Power Analyzers, see: http://www.extech.com/instrument/products/310_399/380803Power.html
|Prev||1 2 3 4||Next|
Page 1 of 4
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804