Tuning Algorithm Performance and Energy Utilization using the Intel® Power Gadget API on Mac OS X*

Application developers striving to deliver the fastest and most efficient applications have a new tool in their optimization toolkit, the Intel Power Gadget 2.5. The Intel Power Gadget is a software-based estimation tool for applications running on 2nd Generation Intel® Core™ processors. It’s available on the Microsoft Windows* and Mac* OS X operating systems. The Intel Power Gadget is a great tool that provides real-time data on processor frequency and estimated processor package power. It’s very easy to profile an application by executing the Intel Power Gadget and the application concurrently. In addition, the Intel Power Gadget library exposes a C API which optimization engineers can integrate into their application for more detailed power and energy monitoring. This paper discusses how to use the API to compare the power and energy usage of three different implementations of the same RGBA to YPbPr color space conversion algorithm (based on ITU-R BT.601). The powerGadgetExperiment sample code attempts to follow the approach that a typical optimization engineer would use to tune an application. A scalar implementation is written first, then threading is introduced, and then threading with SIMD (vector) instructions are used.

Download Article

Download Tuning Algorithm Performance and Energy Utilization using the Intel® Power Gadget API on Mac OS X* [PDF 458KB]

Setup and Testing

Prerequisites

Concurrent with the publication of this paper, Patrick Konsor has written an excellent introduction to the Intel Power Gadget C API that includes a description of each function. Follow the steps in that paper to install the kernel extension and framework on a computer running Mac OS X. After installation, the framework and header files will be located at the /Library/Frameworks/IntelPowerGadget.framework folder.

If you want to integrate the Intel Power Gadget API into your application, then follow Apple Xcode procedures to add the framework to the application’s build, and include the EnergyLib.h header file in source files that use the C API.

Testing

The powerGadgetExperiment sample application was tested on an Apple MacBook* Pro with these specifications:

  • 2.7 GHz 3rd Generation Intel Core i7 processor
  • 8 GB 1600 MHz DDR3
  • Mac OS X 10.8.2

The Intel Power Gadget provides power and energy information on a system level basis so it is important to exit any applications that are not being monitored. The powerGadgetExperiment application accepts a single parameter that specifies the number of pixels to convert. Tests were run for 16384 (default), 32768, and 65536 pixels. The power and energy usage of 100,000 runs of each implementation of the conversion routine is measured.

The powerGadgetExperiment Sample Application

The Power Gadget APIs are simple with just a few usage guidelines to remember. The library needs to be initialized, samples are collected before and after the algorithm of interest, and then query for power and energy data. Remember that the Power Gadget library and device driver consume power as well, it is recommended to collect samples no faster than every 50 milliseconds to minimize the contribution of Power Gadget energy use in the query results.

Calling IntelEnergyLibInitialize initializes the library; this will cause the library to establish a connection to the EnergyDriver.kext kernel extension and to load Model Specific Registers (MSR) information from msrConfig.txt. Please refer to section 9.4 and Appendix B of the Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide for more information on MSRs. Briefly, most Intel® architecture-32 and Intel 64 processors contain model-specific registers to simplify software programming. Table B-10 of the Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide shows that the MSR_PKG_ENERGY_STATUS MSR was introduced in the 2nd Generation Intel Core processor and earlier processors do not support the MSR. The sample code uses the MSR_FUNC_POWER #define (see EnergyLib.h) in calls to GetPowerData.

The profiled workload is based on ITU-4 BT.601 gamma corrected RGB to analog YPbPr conversion algorithm. The input is RGBA and the output is YPbPrA. The alpha value, or opacity, is copied unmodified. The sample code has three versions that convert the same pixel data, and their results are compared to confirm correctness. The threaded versions use Apple’s Grand Central Dispatch (GCD) to manage threads on multicore systems. The GCD-threaded vectorized version uses Intel® Advanced Vector Extensions (Intel® AVX) to process multiple data operands in a single instruction.

The single threaded scalar version is implemented in scalarConvertRGB_YPbPr and is shown here:

 for (unsigned int i = 0, index = 0; i < pixelCount; i++, index += 4) { //Calculate Y pResults[index] = 0.299*pInputData[index] + 0.587*pInputData[index+1] + 0.114*pInputData[index+2]; //Calculate Pb pResults[index+1] = -0.169*pInputData[index] - 0.331*pInputData[index+1] + 0.500*pInputData[index+2]; //Calculate Pr pResults[index+2] = 0.500*pInputData[index] - 0.419*pInputData[index+1] - 0.081*pInputData[index+2]; //Copy Alpha pResults[index+3] = pInputData[index+3]; } 

The GCD-threaded scalar version is implemented in gcdConvertRGB_YPbPr. Separate threads doing the same work on different pixels provide data-level parallelism. Each thread converts a different section of the input.

The gcdVectorConvertRGB_YPbPr function has the GCD-threaded Intel® AVX implementation.

Single Instruction Multiple Data (SIMD) Vectorization Primer

A vector or SIMD enabled-processor can simultaneously execute an operation on multiple data operands in a single instruction. An operation performed on a single number by another single number to produce a single result is considered a scalar process. An operation performed simultaneously on N numbers to produce N results is a vector process (N > 1). This technology is available on Intel processors or compatible, non-Intel processors that support SIMD or AVX instructions. The process of converting an algorithm from a scalar to vector implementation is called vectorization.

This function extends the parallelism introduced in gcdConvertRGB_YPbPr by including instruction-level parallelism via vectorized C intrinsics. Each iteration of the for-loop performs a single read of four RGBA pixels and produces four YPbPrA results. The combination of loop unrolling and Intel AVX produces results more efficiently.

Test Results

Results are shown for 100,000 runs of each conversion function for 16384, 32768, and 65536 pixels. The test results show that the GCD-threaded vectorized version performs the color format conversion faster while using less energy than the two scalar functions. The run time results are shown in Table 1.

Table 1 - Conversion Run Time (Seconds, 100,000 runs)

Pixel Count Scalar Runtime Threaded Scalar Runtime Threaded Vector Runtime

16384

42.2

12.6

2.4

32768

84.8

24.0

3.1

65536

168.6

46.5

4.9

The GCD-threaded vector version completes the conversion in the shortest amount of time with threaded vector to scalar speedups from 17 to 34x. The speedup is calculated by dividing the scalar version by an optimized version. The speedup results are shown in Table 2.

Table 2 - Speedup Results

Pixel Count threaded scalar to scalar threaded vector to scalar threaded vector to threaded scalar

16384

3.35

17.53

5.24

32768

3.53

27.18

7.70

65536

3.65

34.42

9.46

The data shows that the threaded vectorized version performs the fastest conversion. Optimization engineers are also interested in power and energy use. The average power and cumulative energy results are shown in Table 3.

Table 3 - Average Power and Energy Use (100,000 runs)

Pixel Count Scalar Power (W) Threaded Scalar Power (W) Threaded Vector Power (W) Scalar Energy (mWh) Threaded Scalar Energy (mWh) Threaded Vector Energy (mWh)

16384

17.98

36.70

46.29

211.37

128.84

31.04

32768

18.53

36.74

44.23

437.45

245.64

38.42

65536

18.24

36.28

43.52

855.88

468.12

59.35

The Energy Gadget C API reports average power in Watts, and cumulative energy consumption expressed in milliWatt-hours and Joules. The results show that the scalar version uses the least average power but because the scalar version runs longer it consume the most energy. The cumulative energy is more important because that energy comes from the battery or power source. Longer battery life means that a mobile worker can work longer between battery charges. Figure 1 shows a graphical representation of the cumulative energy consumed to convert pixels.

Figure 1 - Cumulative Energy Consumption

Figure 2 compares the energy cost to convert a pixel for each implementation of the algorithm. Smaller is better.

Figure 2 - Energy Consumption per Pixel Conversion

Conclusions

The Intel Power Gadget and C API are two powerful tools that optimization engineers can use to understand and optimize their software. Use the Intel Power Gadget to study applications and integrate the C API into applications to gather more finely tuned power and energy data. These tools help software engineers create applications that consume less energy, extend battery life, and enable mobile users to work longer and be more productive.

References

  1. Intel Power Gadget 2.5
  2. Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide
  3. ITU-T Recommendation BT.601 “Studio encoding parameters of digital television for standard 4:3 and wide screen 16:9 aspect ratios”
  4. Grand Central Dispatch (GCD) Reference
For more complete information about compiler optimizations, see our Optimization Notice.