A new updated version of this white paper topic can be found at: /en-us/articles/developing-green-software
Creating Energy-Efficient Software[PDF 2MB]
This paper examines software methodologies, designs, and software development tools that can be used to improve the energy efficiency of application software and extend mobile platform battery time. Computational efficiency, data efficiency, and context-aware methods can all contribute to creating applications that are power-aware. There are many additional resources available in the form of white-papers, developer kits, and analysis tools. These are referenced in the paper and in the References section. See the list of References for more information.
For years mobile platform vendors have sought means to extend the battery life for mobile platforms. Battery technologies have gradually improved, processors have new low-power states, and displays have dramatically improved their power consumption. There is still room for improvement. Software can play an important role in reducing the power used on mobile platforms and extend the battery time.
The purpose of this paper is to explain the software methodologies and designs that can be used today to save energy and extend mobile platform battery time as well as describe various tools that support the development of energy-efficient software.
The methodologies described here have been researched and tested by Intel software Application Engineers. In each case, we document the resul ts of the experiments and provide a reference to more detailed information. For a look into the typical test environment and the mechanisms for measuring total platform power, see Appendix A.
The remainder of the paper covers the following topics:
- Background – some fundamentals about power, energy, and the platform power profile
- Computational Efficiency – methods to reduce energy costs by improving application performance
- Data Efficiency – methods reduce energy costs by minimizing data movement and using the memory hierarchy effectively
- Context Awareness – enabling applications to make intelligent decisions at runtime based on the current state of the platform
- Operating Systems – how to take advantage of the resources offered by the OS to save energy
- Tools and Technologies – support for creating energy-efficient applications
Joule – the international standard unit of energy measurement
Energy – The conventional definition of energy is the “capacity to do work”. A device that is energy-efficient requires less energy for its “work” or task than its energy-inefficient counterpart. For this paper, we use the term to mean the amount of joules required to carry out a specific task. For example the energy required to lift a 100 gram object 1 meter against the pull of earth’s gravity is about 1 Joule.
Power – the amount energy consumed per unit of time, typically measured in Watts, where a Watt equals 1 Joule per second. For example, a light bulb rated at 60 Watts consumes 60 Joules in one second. Notebook computers running at their highest energy state are rated between 40 to 60 Watts, but on average consume far less.
Heat – more specifically resistive heat, is a natural by-product of running current through a conductor; engineers strive to minimize this in computer design; too much heat means more cooling (typically by a fan) which requires more energy
While we strive to use the terms energy and power appropriately, there may be instances where they are used synonymously.
Platform Power Profile
The power profile of various components on the mobile platform depends on the usage model. For example, the relative contribution of processor power to the overall platform power will be significant in a CPU-intensive workload, but it will not be a dominant factor while the platform is idling. Furthermore, it may also vary depending on whether both the cores were utilized or not (i.e. single-threaded vs. multithreaded). The following provides an idea of how the profile varies during various usage models. The CPU, memory, and file system tests were run using SiSandra benchmarks (http://www.sisoftware.co.uk). Note that the platform power below does not include LCD, since we have excluded it from our analyses. (Others include WLAN, HD-Audio, mini-card, ICH, and other peripherals.)
As seen above, mobile developers need to have an idea of power drainage depending on the usage model, and target specific components for extending battery life and conserving power.
The goal of computational efficiency is to complete a task more quickly. Intuition tells us that if the CPU can accomplish the task in fewer instructions or by doing work in parallel in multiple cores, and then drop the CPU to a low-power state, then the overall energy required to complete the task will be lower. One approach to achieve this is to use the best algorithms and data structures for the particular problem. Another method, for which we have research results below is to take advantage of the performance per watt advantages of Intel multi-core processors and take full advantage of multi-threading to increase application performance and save energy.
Algorithms and data structures are a long-standing area of research in computer science. Considerable effort has gone into research to find more efficient means to solve problems and to investigate and document the corresponding time and space tradeoffs. While optimizing specific algorithms is not an area of interest for our team per se, we can conclude from computer science theory that the choice of algorithms and data structures can make a vast difference in the performance of an application. All other things being equal, using an algorithm that computes a solution in ?(n log n) time is going to perform better than one that does the job in ?(n2) time. For a particular problem, a stack may be better than a queue and a B-tree may be better than a binary tree or a hash function. The best algorithm or data structure to use depends on many factors, which indicates that a study of the problem and a careful consideration of the architecture, design, algorithms, and data structures can lead to an application that performs better and consumes less energy. For a detailed study of the analysis of algorithms see [7,8,9].
For mobile platforms, power consumption has always been one of the major areas of importance. With multithreaded applications, the job at hand may be able to finish faster than single-threaded applications. As a result, the boost in performance may result in power savings as system resources will be used for less time, as compared to a single-threaded version. 
There are other considerations introduced with multithreading an application, such as the effects on power/performance when the threads in the application are imbalanced (when one thread does significantly more work than the other threads), differences in CPU utilization of the threads (for example, one thread might consume 100 percent CPU, while the other threads might consume 10–20 percent of the CPU), and when the t hreads are affinitized to a single core rather than running on separate cores. This research investigated such issues with a wide variety of multithreaded applications and multitasking scenarios, and proposes recommendations that should be considered when multithreading an application.
All the tests described here were conducted on dual-core Intel Core Duo engineering sample systems with the (code-name) Napa platform. Power measurements were accomplished with Fluke NetDAQ®
A variety of applications (single-threaded and multithreaded implementations), along with test kernels developed in-house, are characterized here for power/performance measurements. These applications include a variety of content creation applications, kernels from the gaming space, kernels using Intel® Integrated Performance Primitives (IPP), and office productivity applications.
Single-threaded and multithreaded implementations of the applications discussed here were tested with two power schemes that Microsoft Windows* XP provides when coupled with Intel SpeedStep® technology: MaxPerf or Always-On (AO) mode and Adaptive or Portable-Laptop (PL) mode. AO mode provides maximum available frequency while PL mode adjusts the frequency to conserve energy.
In the results reported below, the following threading models were tested:
- Data Domain Decomposition: the available data set is divided into separate parts, and each thread works on its individual portion.
- Functional Domain Decomposition: each thread works on separate functionality pieces/sections of code within an application.
- Balanced Threading: each thread has an equal amount of work as other active threads of the application.
- Imbalanced Threading: there is a significant difference in the amount of work done by each thread within an application.
The graphs in this section discuss power/performance results on an Intel Core Duo engineering sample system running Windows* XP. Time is expressed in seconds. Power measurements were done with Fluke NetDAQ, which reports average power (in watts [W]) which is then converted to total power using application run-time data (mWHr).
The graph in Figure 1 indicates performance data for running single-threaded (ST) and multi-threaded (MT) versions of several CPU-intensive applications. Cryptography and video encoding applications have two MT implementations and results are indicated as MT-1 and MT-2. For content creation applications, multithreading is done with only one implementation, indicated as MT-1. The multithreaded applications clearly show significant performance improvements over running single-threaded versions. For example, the ST version of cryptography takes ~50 seconds to complete, while both the MT-1 and MT-2 versions take only ~25 seconds.
Figure 1: Balanced Threading Performance
|Figure 2: Balanced Threading - CPU Power (Adaptive)||Figure 3: Balanced Threading - Platform Power (Adaptive)|
Figures 2 and 3 indicate CPU power and Total Platform power for adaptive (portable/laptop) mode, respectively. Adaptive mode is chosen as it favors power consumption by dynamically changing CPU frequency on demand. For each application run (ST and MT), power data-gathering is normalized to the longest run-time. For example, as indicated in Figure 1, the cryptography workload runs for ~50 seconds in ST mode and ~25 sec in MT modes. The power data is measured for 50 seconds in both ST and MT cases. The intent here is to determine if power can be saved by finishing the CPU-intensive task faster (as in the case of MT) and going to idle state for the remaining duration.
As indicated in Figure 2, power saving is achieved by finishing the job faster (MT) and idling for the remainder of the time as the ST version. This indicates that multithreading done correctly not only shows performance improvements but also saves power. For example, the cryptography ST version running for ~50 seconds consumes ~150 mWHr of total power, while running the cryptography MT version for ~25 seconds and idling the system for the remaining 25 seconds consumes ~110 mWHr of total power. Hence, multithreading helps save power. The graph in Figure 3 indicates the implication of multithreading on total platform power. As indicated, running a multithreaded version of an application consumes lower platform power compared to running a single-threaded version.
In this section, we will examine power/performance implications on an application with an imbalanced threading model. For this study, a sample game physics engine was created (using Microsoft DirectX*). The sample application has two parts: 1) Physics Computation (collision detection and resolution for graphics objects) and 2) Rendering (updated positions are drawn onto screen). The design of the application was deliberate so that balanced and imbalanced threading could be studied for a CMP (chip multi-processing) processor. Briefly:
- Balanced: For this implementation, graphical objects (and background imagery) were divided into two parts, and each thread takes care of the collision detection and resolution of its own set of objects.
- Imbalanced: In this implementation, one thread was tasked with performing collision detection and resolution for the colliding objects, while the other thread calculated the updated positions. The result was the desired goal of the first thread being more CPU-intensive than the second thread.
The intent behind creating two multithreaded implementations here is to evaluate power/performance impact when an imbalanced threading model is used, as compared to a balanced threading model.
With the two implementations, performance data in different power schemes (MaxPerf, Adaptive, and Adaptive with GV3 fix) are shown in Figure 4.
As indicated on the first two sets in the graph above (Figure 4), performance of imbalanced multithreaded (Imbalanced-MT) implementation degrades from ~64 seconds in MaxPerf mode to ~120 seconds in Adaptive model. However, with the GV3 fix from Microsoft, performance of the Imbalanced-MT completes in less time than ST, as it should. In all cases, Balanced-MT has better performance than Imabalanced-MT.
Figure 4: Imbalanced Threading Performance
Figure 5 shows the results for the measurements of platform power consumption. The power measurements discussed here were normalized using a technique mentioned in the Balanced Threading Model section. As expected, platform power consumption increases with a reduction in the performance in Adaptive (PL)-Default mode. Since the Imbalanced-MT workload now takes much longer to finish, the performance degradation causes an increase in power consumption. With the GV3 fix in place, the improvements in performance yield corresponding improvement in total platform power consumption.
The third set in Figure 4 indicates data with a kernel hotfix. In this case, imbalanced-MT implementation in Adaptive (PL) mode shows similar power/performance data as that of MaxPerf (AO) mode. With this fix, processors run at optimum frequency, not causing degradation in Adaptive (PL) mode. Platform power data with the fix is shown in Figure 5 in the Adaptive (PL)-with GV3 Fix column.
Figure 5: Imbalanced Threading - Platform Power
These results indicate that the imbalanced threading model/under-utilized CPU may cause degradation in performance, causing increased power consumption. We recommend use of a balanced threading model while multithreaded applications. Thread imbalance can be identified by using tools like the Microsoft Perfmon*, the timeline view offered by Intel® VTune™ and Intel® Threading Tools (such as Intel® Thread Checker and Intel® Thread Profiler) to track individual thread run-time and processor utilization counters.
Multitasking Scenarios with one Application Affinitized to Single Core
One of the common usage scenarios for PC users is running multiple applications simultaneously-multitasking. To understand the performance and power impact of running two applications simultaneously, an experiment was conducted with two office productivity applications running concurrently. Since scheduling the two applications using different techniques is likely to show power/performance impacts, the following scenarios were examined:
- Microsoft Windows XP scheduling both applications using its scheduling algorithm (no affinitization).
- Each application hard-affinitized to each core. Application 1 runs on core 0 and application 2 runs on core 1.
- One of the applications hard-affinitized to Core 0 while Windows XP schedules the other application.
These scheduling configurations were chosen to identify if a certain scheduling mechanism favors both power and performance as compared to the others.
Performance data for these configurations is shown in Figure 6. As indicated in the graph, there is no significant performance difference observed with different scheduling configurations.
Figure 6: Multitasking Scenario Performance
As shown in Figure 7, CPU power consumption data-the second scenario that is affinitizing both applications to individual cores-demonstrates slightly higher power consumption as compared to Windows XP scheduling.
A similar impact is seen (Figure 8) on platform power consumption, where hard-affinitizing both the applications to each core shows slightly higher platform power consumption as compared to letting Windows XP do the scheduling.
|Figure 7: Multitasking Scenario - CPU Power||Figure 8: Multitasking Scenario - Platform Power|
The following can be deduced from the threading tests described in this section:
- Threading done right provides performance boosts as well as power savings - As indicated by the data in the Balanced Threading Model section, a properly multithreaded application demonstrates performance improvement as well as power savings. This includes multithreaded implementations with a minimum of imbalance and synchronization points. While multithreading an application, we recommend using a threading model in which all the threads perform an equal amount of work independently. Minimizing synchronization points between threads leads to more time spent in the parallel section, which translates to good performance improvements and power savings.
- Thread imbalance may cause performance degradation and may not provide power/performance benefits as compared to balanced threading - As discussed in the Imbalanced Threading Model section, applications with an imbalanced threading model may show less performance improvement as compared to a balanced threading model, and thus consume more power as compared to the balanced implementation. Thread imbalances may cause frequent thread migration across cores, which may result in reporting incorrect processor activity states. This may lead processors to go down to lower frequency states (if the hotfix provided by Microsoft is not enabled) in adaptive schemes even if one of the threads i s utilizing full-processor resources. This issue may occur while running single-threaded applications on a dual-core system in Adaptive mode as well.
- Utilize the GV3 hotfix (KB896256) from Microsoft - If a multithreaded application indicates performance degradation/increased power consumption in Adaptive (PL) mode, install GV3 hotfix from Microsoft; the issue might be that the OS is getting incorrect information about processor performance while in Adaptive mode. GV3 hotfix will track CPU utilization across the entire package rather than individual cores, and will enable the OS to operate at optimum frequency.
- Use OS scheduling vs. hard affinitizing - In general, for Intel Core Duo systems, it is advisable to use OS scheduling as opposed to affinitizing threads or applications. The OS scheduler will utilize any underutilized core, while hard affinitizing may potentially degrade performance since an application may need to wait for the availability of a specific processor even though other processors in the system are idle.
Compilers, Performance Libraries, and Instruction Sets
Another way to achieve better computational efficiency and performance is to use an optimizing compiler, performance libraries, and/or make use of advanced instruction sets.
Many compilers now offer OpenMP directives that look for opportunities for parallelism. OpenMP* is a portable, scalable model for developing parallel applications. For example, the Intel® Professional Edition Compilers offer support for creating multi-threaded applications. Features include advanced optimization, multi-threading, and processor support that for processor dispatch, vectorization, auto-parallelization, OpenMP*, data prefetching, and loop unrolling, along with highly optimized libraries. Microsoft Visual Studio 2005 also includes the OpenMP portable threading library.
Performance libraries such as Intel® Performance Primitives (IPP) contain highly optimized algorithms and code for common functions such as video/audio encode/decode, cryptography, speech encoding and recognition, computer vision, signal processing, etc. The Intel® Math Kernel libraries provide highly efficient algorithms for many math and scientific functions including fast Fourier transforms, linear algebra, and vector math.
Advanced instruction sets help the developer take advantage of new processor features that are specifically tailored to certain applications. For example, the Intel Streaming SIMD Extensions (SSE) 4.1 is a set of new instructions tailored for applications that involve graphics, video encoding and processing, 3D imaging, gaming, web servers, and application servers. Optimizing with these instructions will deliver increased performance and energy efficiency to a broad range of 32-bit and 64-bit applications. For more detail on SSE 4.1 and Application Targeted Accelerators, see [Ref14].
 It is interesting to note that this is not always true. Due to the quadratic relationship between processor states and voltage, it can be demonstrated that a process running for a longer time at a lower P-state may actually use less total energy than running the same process at a high P-state for less time. This is an area of future research.
 GV3is a Microsoft hotfix (KB896256) to change the kernel power manager to track CPU utilization across the entire package instead of individual cores. It resolves an issue the power manager had with incorrectly calculate the optimal target performance state for the processor when one core was much less busy than the others. The performance state was set too low and performance suffered in adaptive mode.
 More detailed coverage of this topic can be found at: Data Transfer over Wireless LAN Power Consumption Analysis
 For details on Extech Power Analyzers, see: http://www.extech.com/instrument/products/310_399/380803Power.html
|Prev||1 2 3 4||Next|
Page 1 of 4