<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Fri, 25 May 2012 08:23:50 -0700 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/articles/parallel/type/technical-article/feed/" rel="self" type="application/rss+xml" />
    <title>Intel Software Network articles Feed</title>
    <link>http://software.intel.com/en-us/articles/parallel/type/technical-article/</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>Conserving Active Power</title>
      <description><![CDATA[ <h2 class="sectionHeading">Abstract</h2>
This article presents techniques to optimize applications to save power during active use. These techniques include multi-threading, batching of I/0 accesses, and reducing memory bandwidth. Some of the techniques are applicable for general program usage while others may be specific to a certain type of application or platform.<br /><br />
<h2 class="sectionHeading">Introduction</h2>
Battery life consumption has become an increasing focus for today’s electronic devices. Many people report frustration when the battery life of small devices (such as Ultrabook™ devices, laptops, cell phones and tablets) are not capable of surviving a typical day, leaving users searching for a charging station at inopportune times. Major contributors to these battery woes are applications which are not optimized to conserve energy. This part of the problem can be solved by improving application battery consumption, both while the program is in an idle state and while it is active. This paper will focus on ways for an application to optimize operations during active use.<br /><br />
<h2 class="sectionHeading">Active Power</h2>
The definition of the active period for an application is whenever the program is doing useful work. The term “useful work” is work that is necessary to complete the activity requested directly by the user. Work done in the background wouldn’t be considered “useful work”. An example of an active application period might be calculating results in a spreadsheet program or when a search engine is looking for matches to the user’s input. Both of these examples are initially requested by the user, but then the program is completing the work without user interaction. Other examples include streaming a video or running a virus scan.<br /><br />It is very easy to understand that conserving battery power during application idle periods makes a lot of sense, but an often overlooked area is the possible power savings that can be achieved during the time when useful work is being performed. Taking your application to the next level in low-power performance can increase the responsiveness of your program, enable a more pleasing user experience (by extending battery life), and keep the user engaged for as long as possible. It’s a win, win result!<br /><br />
<h2 class="sectionHeading">General Recommendations</h2>
Here are some suggestions that you might want to take when looking for ways of saving power during the active periods of your program. You might notice that some of these are the same or very similar to those recommended for idle power savings. That makes sense, since the top energy savers are likely to be contained in the entire program and not just in one particular section. The basic idea behind all of these suggestions is to look for the energy bottlenecks and then see if there is a way to reduce or eliminate them. References on each technique are collected in the References section at the end.<br /><br />
<h2 class="sectionHeading">Multi-Threading and Concurrency Maximizations</h2>
With today’s multiple processor chips becoming more and more prevalent, using all available cores improves performance, and can also improve power usage. Why work one core for a long time at or near 100% performance if you can get the job done faster and with less power by using all available processors? If you can maximize threading concurrency, then your program will run faster and use less energy in the process. The graph below shows the energy consumption results for three different applications when tested on a 2nd Generation Intel® Core™ processor with a Windows* 7 operating system. The media app shows a great improvement on the multi-threaded version over the single threaded version, but all three programs show less energy consumption when the program was multi-threaded.<br /><br />
<p ><img src="http://software.intel.com/file/43592" /></p>
<br />
<div ><b>Energy Savings of three different Single vs Multi-threaded Applications Results collected on a 2nd Generation Intel® Core™ processor with a Windows 7 OS</b><br /></div>
<br /><b><i>User to Kernel Transactions</i></b><br /><br />A leading consumer of power during an application’s active state is a high frequency of system calls. When an application has been threaded, contention may be caused between the threads that are interacting with the system kernel. To avoid these problems, use the Windows API call “EnterCriticalSection()” to synchronize inter-process communication between threads in user space rather than the Windows API call “WaitForSingleObject()”, which runs in kernel space. When a 4 thread test application was tested using low lock contention, it showed up to a 60% energy savings on a 2nd Generation Intel Core™ processor Windows 7 platform.<br /><br />
<p ><img src="http://software.intel.com/file/43593" /></p>
<br />
<div ><b>Comparison between WaitForSingleObject() and EnterCriticalSection() on a 4 threaded application Data gathered on a 2nd Generation Intel® Core™ processor on a Win7 platform</b><br /></div>
<br />One common source of these kernel-level synchronization calls are transitions between active and idle states. By batching periodic activities and avoiding many transitions between active and idle periods, the concurrency of your threads will be maximized. Together with multi-threading, these changes will help your application do more work with less power.<br /><br />
<h2 class="sectionHeading">Minimize Interruptions and Avoid Frequent Polling</h2>
You have likely noticed how much less work you get done when you are constantly stopping to resolve one issue after the other. The system that your program is running on reacts the same way. So, if you can keep your system timer resolution low and avoid frequent periodic polling, you can reduce the amount of energy lost through waiting for system resources.<br /><br />
<h2 class="sectionHeading">Optimize Frequently Used Code</h2>
This is a pretty fundamental concept. The high-use areas of code will gain the most from optimization, so start with these sections first. One way to optimize this code is to use the latest vectored instructions such as SSE4.x and AVX1 or AVX2. These instructions help perform often used actions faster and with less power requirements. <br /><br />
<h2 class="sectionHeading">Bundle and Save</h2>
How many times have you heard that from retail and service companies? It is as true for conserving power usage as it is for money. For power, bundling refers to batching disk I/O accesses. As shown in the table below, power savings can be achieved by increasing the disk idle time between accesses. This also includes any action that will require the application to change its state. If you can bundle calls to resources, then the application can save power.<br /><br />
<p ><img src="http://software.intel.com/file/43594" /></p>
<br />
<div ><b>Sample Storage Power Optimization Savings </b><br /></div>
<br />
<h2 class="sectionHeading">Reducing Memory Bandwidth</h2>
Achieving memory bandwidth optimization is another way to reduce the overall power usage of your application during its active state. A few suggestions are:<br /><br />
<ul>
<li>Avoid unneeded format conversions between different graphics formats (such as YUV and RGB) for accesses by the GPU and the CPU</li>
<li>Cache frequently accessed data structures</li>
<li>Limit kernel/user space data moves</li>
</ul>
Utilizing less power-hungry methods and being aware of which methods reduce the energy requirements can be of benefit as well. The table below shows several Microsoft DirectX* 9 Present Methods and their accompanying power usages when doing video playback. These figures were collected on a 2nd Generation Intel Core™ processor platform running Windows* 7. In this example, each pixel in a HD video frame occupies either 2 bytes or 4 bytes in memory for the YUY2 and the RGB formats respectively. Hence, the memory footprint of the YUY2 format is smaller and more efficient.<br /><br />
<p ><img src="http://software.intel.com/file/43595" /></p>
<br />
<div ><b>Power Usage per DX9 Present Methods from a 2nd Generation Intel® Core™ processor platform with Windows* 7</b><br /></div>
<br />
<h2 class="sectionHeading">Additional Suggestions</h2>
In addition to the recommendations listed above, try reducing time spent in privileged mode and replace Sleep(0) wait-loops with the Pause() function. There are specific recommendations for software designed for specific types of workloads. For instance, there are specific hardware and software suggestions to improve video conferencing power consumption, noted in the References section.<br /><br />
<h2 class="sectionHeading">Power Aware Applications Rock</h2>
While you are working on saving energy consumption during the active state, keep in mind that those applications which are conscious of environment changes while running can extend the battery life of a device as well. There are a variety of ideas that have been suggested as ways to make your program power aware. The following list of examples is by no means exclusive. Feel free to innovate in order to reduce energy needs. Sometimes it will take several iterations of trying different configurations to achieve a balanced energy performance application model.<br /><br />
<ul>
<li><b>3D Games:</b> When a system indicates it is running in battery mode, cap the rendering frame to a lower rate</li>
<li><b>Video Player:</b> When the battery level is getting low, reduce or disable filters for image and color enhancement during video playback</li>
<li><b>Any Program:</b> 
<ul>
<li>Don’t update the display when the application is minimized</li>
<li>Choose good power aware defaults</li>
<li>Delay non-critical tasks when in battery mode</li>
<li>Suggest to user additional ways to reduce power consumption in the form of optional settings</li>
</ul>
</li>
</ul>
Using any or all of the suggestions and including your own will help create a better experience for the user. And in the end, isn’t that a key reason to create the software in the first place?<br /><br />
<h2 class="sectionHeading">Conclusion</h2>
As we continue to produce devices that are capable of running all day without a recharge and performing more and more work, the demand for power-savvy applications will grow. Developers hold a pivotal role in transforming our old energy wasteful programs into sleek, green versions. Attention paid now to maximizing the ratio of useful work to power consumption will no doubt pay off in multiple ways. Now is the time to shake loose the old conventions and forge good power usage habits. And remember if you don’t innovate your application to take care of active power performance, you know full well that your competition will.<br /><br />
<h2 class="sectionHeading">About the Author</h2>
<p><img src="http://software.intel.com/file/43596"  /> Judy Hartley is a Software Applications Engineer who has been working in the Software and Services Group since 2005. She has contributed to many software products and written about her experiences through blogs and whitepapers. Recently Judy has been working on Graphics and Power tools and training for future Intel processors.</p>
<br  />
<h2 class="sectionHeading">References</h2>
Optimizing Active Power – William Cheung – Presentation: <a href="http://software.intel.com/en-us/videos/channel/power/optimizing-active-power/1532764101001">http://software.intel.com/en-us/videos/channel/power/optimizing-active-power/1532764101001</a><br /><br />Whitepaper on Energy Efficient Platforms – Considerations for Application Software &amp; Services: <a href="http://software.intel.com/file/38273">http://software.intel.com/file/38273</a><br /><br /><b><i>Effective Multi-Threading:</i></b><br />
<ul>
<li><a href="http://software.intel.com/en-us/parallel/?wapkw=parallel+programming">http://software.intel.com/en-us/parallel/?wapkw=parallel+programming</a></li>
<li><a href="http://software.intel.com/en-us/articles/getting-started-with-parallel-programming-for-multi-core/">http://software.intel.com/en-us/articles/getting-started-with-parallel-programming-for-multi-core/</a></li>
</ul>
<b><i>Kernel Transitions:</i></b><br />
<ul>
<li><a target="_blank" href="http://www.cs.arizona.edu/solar/papers/multi-call.pdf">System Call Clustering:A Profile-Directed Optimization Technique</a></li>
</ul>
<b><i>Interruptions and Polling:</i></b><br />
<ul>
<li><a href="http://software.intel.com/en-us/articles/creating-energy-efficient-software-part-2/">http://software.intel.com/en-us/articles/creating-energy-efficient-software-part-2/</a></li>
<li><a href="http://software.intel.com/en-us/articles/creating-energy-efficient-software-part-4/">http://software.intel.com/en-us/articles/creating-energy-efficient-software-part-4/</a></li>
</ul>
<b><i>Bundle and Save:</i></b><br />
<ul>
<li><a href="http://software.intel.com/en-us/articles/power-analysis-of-disk-io-methodologies/">Power Awareness of Disk I/O Methodologies</a></li>
</ul>
<b><i>Power Aware Applications:</i></b><br />
<ul>
<li><a href="http://software.intel.com/en-us/articles/power-aware-mobilized-windows-applications/?wapkw=power+aware+applications">Power-Aware Mobilized Windows Applications</a></li>
<li><a href="http://software.intel.com/en-us/articles/creating-power-aware-applications-on-linux-using-qt4/?wapkw=power+aware+applications">Creating Power Aware Applications on Linux using Qt4</a></li>
<li><a href="http://software.intel.com/en-us/articles/energy-efficient-software-developing-power-aware-apps/?wapkw=power+aware+applications">Developing Power Aware Apps</a></li>
</ul>
<b><i>Reducing Memory Bandwidth:</i></b><br />
<ul>
<li><a href="http://software.intel.com/en-us/articles/reducing-the-impact-of-misaligned-memory-accesses/?wapkw=reducing+memory+bandwidth">Reducing the Impact of Misaligned Memory Accesses</a></li>
<li><a href="http://software.intel.com/en-us/articles/detecting-memory-bandwidth-saturation-in-threaded-applications/?wapkw=memory+bandwidth">Detecting Memory Bandwidth Saturation in Threaded Applications</a></li>
</ul>
Intel® Media SDK 2012: <a href="http://software.intel.com/en-us/articles/vcsource-tools-media-sdk/">http://software.intel.com/en-us/articles/vcsource-tools-media-sdk/</a><br /><br />CPU Power Utilization on Intel Architectures: <a href="http://software.intel.com/en-us/articles/cpu-power-utilization-on-intel-architectures">http://software.intel.com/en-us/articles/cpu-power-utilization-on-intel-architectures</a> ]]></description>
      <link>http://software.intel.com/en-us/articles/conserving-active-power/</link>
      <pubDate>Fri, 27 Apr 2012 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/conserving-active-power/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/conserving-active-power/</guid>
      <category>Parallel Programming</category>
      <category>Power Efficiency</category>
      <category>Ultrabook</category>
      <category>Server Developer Community</category>
    </item>
    <item>
      <title>Using Intel® Power Gadget 2.0 to measure the energy performance of a compute-intensive application </title>
      <description><![CDATA[ <p>Intel® Power Gadget 2.0 is a software-based power estimation tool enabled for Second Generation Intel® Core™ processors. It includes a Microsoft Windows* sidebar gadget, driver, and libraries to monitor and estimate real-time processor package power information in watts using the energy counters in the processor. The purpose of the gadget is to assist end-users, ISV's, OEM's, developers, etc., in more precise estimation of power from a software level without any hardware instrumentation (external power meter). Additional functionalities include estimation of power on multi-socket systems as well as externally callable APIs to extract power information within sections of code. The Intel Power Gadget 2.0 supports measurement both on battery and with the system plugged into an external AC power source.</p>
<p>For this article, I took a very compute-intensive parallel application that I wrote to solve instances of the logic puzzle Akari. The code uses a backtracking algorithm to explore how to place light bulbs onto a grid under constraints dictated by the rules of the puzzle and the layout of the puzzle instance. Potentially millions of independent tasks can be generated by the code as the solution space is searched by threads executing those tasks. This solution method is eminently scalable to a large number of threads and is able to keep many cores running at peak speed for a sustained amount of time.</p>
<h2 class="sectionHeading">Usage description</h2>
<p>The most common usage of the Intel Power Gadget is through the Windows® 7 Sidebar gadget component. After installation (see gadget documentation for how to do this), you simply bring up the gadget from the Windows* Gadget Gallery.</p>
<p><b>Power Gadget Interface </b></p>
<p>Once the gadget is up, you are able to monitor processor power usage when running a workload or with the system idle. The two panes displayed by the gadget are the Power Pane, which shows a running graph of the power consumed by the CPU, and the Frequency Pane, which shows the a running graph of the CPU frequency of the CPU. Both panes display the last 110 seconds of data collected to enable a user to have some recent history of power consumption and CPU frequency. You can easily tell when an application is active or sitting idle by the ups and downs of the graph line. Just above the graph portion, each pane has a text result showing the current “instantaneous” measurement of power (Watts) or frequency (GHz). Additionally, the Frequency Pane graph has a line (and text) denoting the maximum frequency of the processor on which the gadget is currently running.</p>
<p><img src="http://software.intel.com/file/43399" /></p>
<p>The size of the gadget on your screen can be altered by clicking on the Resize Button located on the right of the graph display. There are really only two sizes available, but it might make the data easier to read or else making the gadget smaller since it can’t actually be minimized.</p>
<p>You can customize the Sampling Resolution (msec) and Max Power (Watts) to be displayed in the gadget. Click on the Options Button to bring up a customization selection window. Both of these settings are modified through a slider interface. The Sampling Resolution setting will affect how often a measurement of the CPU power and frequency are taken by Power Gadget. The update speed of the graphs in gadget will be unaffected. The Max Power setting will affect what range of values along the vertical axis in the Power Pane. Since the maximum TDP (Thermal Design Power) rating on my workstation is 35W, I set this down to 30W to show off more detail in the graphical tracking of power usage.</p>
<p><b>Measuring the application</b></p>
<p>At the top of the customization window you can designate the file location of the comma-separated values (CSV) data generated by the gadget. When you are ready to record data from an application run, click the “Start Log” button. When you are done recording data, click the “Stop Log” button. The CSV data file will be in the location you have specified.</p>
<p>After getting things set up in the Power Gadget interface, there is nothing else special that you need to do to measure the energy consumption of your application. If there is a specific portion of the application to be measured, you can start the application running and start logging data from Power Gadget when the relevant point of the execution begins; then turn off logging when the interesting portion has ended. For my Akari solver application, the entire run was of interest. I first started logging with Power Gadget, waited about 5 seconds, and then launched my application. Once the application had completed, I waited 5 seconds and then stopped logging. The extra logging time before and after the application running was to help better distinguish when the application was actually running from the log data collected. It turned out that this wasn’t necessary.</p>
<p>Upon examination of the log data in the CSV file, I found that it was quite easy to determine when the application began running. The frequency of the processor tended to run around 800 MHz when the application wasn’t running (system at idle), but was consistently 2600 MHz (or 3100 MHz when running on AC power source) during the execution of my compute-intensive application. Thus, I could quickly identify the relevant data for my application runs within the CSV file.</p>
<h2 class="sectionHeading">What data is presented</h2>
<p>As described above, the Intel Power Gadget GUI provides a graphic record of processor power (Watts) and frequency (GHz) in real-time. The last 110 seconds of data are shown in the display.</p>
<p>Within the CSV log file you will find the first three columns hold data for System Time (at each measurement point), RDTSC (ReaD Time Stamp Counter, the number of clock cycles since the CPU was powered up or reset), and Elapsed Time (seconds) from the time Power Gadget started logging to the end of logging. The next columns hold CPU Frequency (MHz), Package Power (Watts), Package Energy (Joules) and the cumulative Package Energy (milliwatt hours, mWh). These four columns will be replicated for each socket within the system. The column labels will indicate the socket from which the data was measured. For most systems there will be only one socket and CPU package involved. The column titles in that case will be suffixed with a zero (‘0’).</p>
<h2 class="sectionHeading">Some Results</h2>
<p>The purpose of this article is not to determine the best scenario for running my Akari solver application in the most energy efficient way. You will want to do this for your application, though, and this article has given you the background on Intel Power Gadget to determine if this checker can help you quantify the current power consumption of your application. Also, as you make modifications to the application you will be able to determine if those changes improve the energy efficiency or cause your application to suck more power than before.</p>
<p>I did experiments when the platform was running on battery (DC) power and when plugged into the wall socket (AC) power. There was an obvious difference in execution time for the application between these two scenarios. The processor ran at a faster frequency when running on AC power and the execution times were shorter than corresponding runs on battery power. I also made measurement runs with and without Hyper-Threading Technology (HT) turned on and a number of different threads running.</p>
<p>While the absolute total energy consumed was different for the application running on DC or AC power, the relative amount of energy used as the number of threads varied was consistent between runs on battery or power cord. For example, when running with HT enabled and four threads on four logical cores, the speedup exhibited was just under 2.2X and the mWh measurement was 27% less than the serial execution on the same workload. There were similarly correlated measurements between runs where HT was disabled. In all cases, when using double the number of thread as logical cores available, the speedup was slightly lower and the measured mWh values were slightly higher than when using the exact number of thread to logical cores.</p>
<h2 class="sectionHeading">System Requirements</h2>
<p>In order to use Intel Power Gadget, your test platform must be equipped with one (single socket) or more (multi-socket) 2nd Generation Intel® Core™ Processors. Previous processors are not supported. The supported operating systems include Microsoft Windows 7*, 32-bit and 64-bit versions, Microsoft Windows* Server 2008, and Microsoft Windows* Server 2008 RC2 (64-bit server platforms). In addition, you will need to have installed the Microsoft* .Net Framework 4 and Microsoft Visual C++ 2010 SP1 Redistributable package (x86 or x64 depending on OS). The presence of these final two installations will be checked at installation time and downloaded as needed.</p>
<p>A functionally identical version of Intel® Power Gadget 2.0 is now available for Mac OS X. However, there are some differences. Specifically, the Power Gadget GUI is an application as opposed to a desktop gadget and does not support multi-socket configurations. Additionally, the EzPwrLibrary API is written in Objective-C. You will need to be running the Mac OS X 10.6 or later version with a 2nd generation Intel® CoreTM processor or later</p>
<h2 class="sectionHeading">Download link</h2>
<p>You can find download links within the Intel Software Network (ISN) article “Intel® Power Gadget 2.0” (<a href="http://software.intel.com/en-us/articles/intel-power-gadget/">http://software.intel.com/en-us/articles/intel-power-gadget/</a>).</p>
<h2 class="sectionHeading">Other supporting links</h2>
<p>An Intel Software Network blog on “<a href="http://software.intel.com/en-us/blogs/2012/01/21/accessing-intel-power-gadget-20-library-in-c/">Accessing Intel® Power Gadget 2.0 library in C++</a>” describes how to use the Intel Power Gadget API and libraries from C++ code.</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/using-intel-power-gadget-20-to-measure-the-energy-performance-of-a-compute-intensive-application/</link>
      <pubDate>Mon, 16 Apr 2012 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/using-intel-power-gadget-20-to-measure-the-energy-performance-of-a-compute-intensive-application/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/using-intel-power-gadget-20-to-measure-the-energy-performance-of-a-compute-intensive-application/</guid>
      <category>Parallel Programming</category>
      <category>Power Efficiency</category>
      <category>Server Developer Community</category>
    </item>
    <item>
      <title>How to build GCC* cilkplus branch in 64bit Ubuntu* 12.04</title>
      <description><![CDATA[ <div id="art_pre_template"><b>Introduction</b>:</div>
<div id="art_pre_template">Intel® Cilk™ Plus is an open source project now. It is supported in GCC* 4.7 but still not merged into the released GCC* 4.7.0 version. We can build the 'cilkplus' branch of GCC* to make it support the Intel® Cilk™ Plus extensions. The steps are exact same with building GCC* upstream code, so this article is targeted for those who are not familiar with building GCC and want to use Cilk Plus with GCC.<br /></div>
<div id="art_pre_template"><br /></div>
<div id="art_pre_template">(1) <b>Preparation</b></div>
<div id="art_pre_template">OS: Ubuntu* 12.04 LTS beta2 64bit.</div>
<div id="art_pre_template">Install following packages before building GCC Cilk Plus:</div>
<div id="art_pre_template">sudo apt-get install binutils<br />sudo apt-get install build-essential<br />sudo apt-get install m4<br />sudo apt-get install autogen<br />sudo apt-get install bison</div>
<div id="art_pre_template"><br /></div>
<div id="art_pre_template">(2) <b>Install gmp</b></div>
<div id="art_pre_template">1. download gmp (gmp-4.3.2.tar.bz2) from: <a href="ftp://gcc.gnu.org/pub/gcc/infrastructure/">ftp://gcc.gnu.org/pub/gcc/infrastructure/</a></div>
<div id="art_pre_template">2. compile and install gmp:</div>
<div id="art_pre_template">
<pre name="code" class="shell">sudo mkdir -p /opt/gmp-4.3.2
tar -jxvf gmp-4.3.2.tar.bz2
cd gmp-4.3.2
./configure --prefix=/opt/gmp-4.3.2
make &amp;&amp; make check &amp;&amp; sudo make install</pre>
Notes: It is recommended to run 'make check'.</div>
<div id="art_pre_template"><br /></div>
<div id="art_pre_template">(3) <b>Install mpfr</b></div>
<div id="art_pre_template">1. download mpfr (mpfr-2.4.2.tar.bz2) from same link as above.</div>
<div id="art_pre_template">2. compile and install mpfr:</div>
<div id="art_pre_template">
<pre name="code" class="shell">sudo mkdir -p /opt/mpfr-2.4.2
tar -jxvf mpfr-2.4.2.tar.bz2
cd mpfr-2.4.2
./configure --prefix=/opt/mpfr-2.4.2 --with-gmp=/opt/gmp-4.3.2
make &amp;&amp; make check &amp;&amp; sudo make install</pre>
Notes: You must install gmp before installing 'mpfr' as 'mpfr' depends on 'gmp'.</div>
<div id="art_pre_template"><br /></div>
<div id="art_pre_template">(4) <b>install mpc</b></div>
<div id="art_pre_template">1. download mpc (mpc-0.8.1.tar.gz) from same link as above.</div>
<div id="art_pre_template">2. compile and install mpc:</div>
<div id="art_pre_template">
<pre name="code" class="shell">sudo mkdir -p /opt/mpc-0.8.1
tar -zxvf mpc-0.8.1.tar.gz
cd mpc-0.8.1
./configure --prefix=/opt/mpc-0.8.1 --with-gmp=/opt/gmp-4.3.2 --with-mpfr=/opt/mpfr-2.4.2
make &amp;&amp; make check &amp;&amp; sudo make install</pre>
Notes: 'mpc' depends on 'gmp' and 'mpfr'.</div>
<div id="art_pre_template"><br /></div>
<div id="art_pre_template">(5) <b>build and install GCC with Cilk Plus</b></div>
<div id="art_pre_template">1. download the source code of 'cilkplus' branch. There are various ways to download it.</div>
<div id="art_pre_template">SVN:</div>
<div id="art_pre_template">The cilkplus-4_7-branch: <a href="http://gcc.gnu.org/svn/gcc/branches/cilkplus-4_7-branch/">http://gcc.gnu.org/svn/gcc/branches/cilkplus-4_7-branch/</a><br /></div>
<div id="art_pre_template">The cilkplus branch: <a href="http://gcc.gnu.org/svn/gcc/branches/cilkplus/">http://gcc.gnu.org/svn/gcc/branches/cilkplus/</a></div>
<div id="art_pre_template">GIT:</div>
<div id="art_pre_template">The GIT mirror: <a href="http://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/heads/cilkplus">http://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/heads/cilkplus</a></div>
<div id="art_pre_template">Here, I download the git mirror for cilkplus branch through browser directly. (The snapshot is 'gcc-0dfa790.tar.gz')</div>
<div id="art_pre_template"><br /></div>
<div id="art_pre_template">2. compile and install gcc cilkplus branch:</div>
<div id="art_pre_template">
<pre name="code" class="shell">export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/gmp-4.3.2/lib:/opt/mpfr-2.4.2/lib:/opt/mpc-0.8.1/lib
export C_INCLUDE_PATH=/usr/include/x86_64-linux-gnu &amp;&amp; export CPLUS_INCLUDE_PATH=$C_INCLUDE_PATH &amp;&amp; export OBJC_INCLUDE_PATH=$C_INCLUDE_PATH
export LIBRARY_PATH=/usr/lib/x86_64-linux-gnu

tar -zxvf gcc-0dfa790.tar.gz
sudo mkdir -p /opt/gcc-4.7-cilkplus
mkdir gcc_cilkplu_build &amp;&amp; cd gcc_cilkplu_build
../gcc-0dfa790/configure --prefix=/opt/gcc-4.7-cilkplus --with-gmp=/opt/gmp-4.3.2 --with-mpfr=/opt/mpfr-2.4.2 --with-mpc=/opt/mpc-0.8.1 --disable-multilib --enable-languages=c,c++
make -j8
sudo make install</pre>
Notes:</div>
<div id="art_pre_template">(1) It is suggested to run 'configure' in a standalone foler instead of running it in the source tree of gcc cilkplus.</div>
<div id="art_pre_template">(2) 'make -j8', this step will build the source code, it will take quite a long time, you may change the argument of '-j' according to your system.</div>
<div id="art_pre_template"><br /></div>
<div id="art_pre_template"><br /></div>
<div id="art_pre_template">(6) <b>Create the 'gccvars.sh' script</b></div>
<div id="art_pre_template">This is optional, you may create a script to help to set the environment of the built gcc. Following is a reference for 'gccvars.sh':</div>
<div id="art_pre_template">
<pre name="code" class="shell"># filename: gccvars.sh
# 'source gccvars.sh' to set the environment of gcc
export C_INCLUDE_PATH=/usr/include/x86_64-linux-gnu:$C_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=$C_INCLUDE_PATH
export OBJC_INCLUDE_PATH=$C_INCLUDE_PATH
export LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LIBRARY_PATH

export GCCDIR=/opt/gcc-4.7-cilkplus
export PATH=$GCCDIR/bin:$PATH
export LD_LIBRARY_PATH=$GCCDIR/lib:$GCCDIR/lib64:/opt/gmp-4.3.2/lib:/opt/mpfr-2.4.2/lib:/opt/mpc-0.8.1/lib:$LD_LIBRARY_PATH
export MANPATH=$GCCDIR/share/man:$MANPATH</pre>
Notes: GCCDIR is the installed path of the built gcc.</div>
<div id="art_pre_template"><br /></div>
<div id="art_pre_template"><br /></div>
<div id="art_pre_template">(7) <b>Test Cilk Plus with GCC</b></div>
<div id="art_pre_template">To use cilk plus with gcc, you need to add '-lcilkrts' flag to the command line. See below test case for more details (it contains how to compile and the results):</div>
<div id="art_pre_template">
<pre name="code" class="cpp">// filename: test_cilkplus.cpp
// compile: g++ test_cilkplus.cpp -lcilkrts

#include &lt;stdio.h&gt;
#include &lt;unistd.h&gt;

#include &lt;cilk/cilk.h&gt;
#include &lt;cilk/cilk_api.h&gt;

void task(int i) {
    printf("task: %d, workder id: %d\n", i, __cilkrts_get_worker_number());
    sleep(1);
}

int main() {
    for(int i=0;i&lt;10;i++)
	    cilk_spawn task(i);
    cilk_spawn task(-1);
    cilk_sync;
    return 0;
}

/* compile and result:
#g++ test_cilkplus.cpp -lcilkrts
#./a.out 
task: 0, workder id: 0
task: 1, workder id: 1
task: 2, workder id: 2
task: 3, workder id: 0
task: 4, workder id: 1
task: 5, workder id: 2
task: 6, workder id: 0
task: 7, workder id: 1
task: 8, workder id: 2
task: 9, workder id: 0
task: -1, workder id: 1
#
notes: you can use 'export CILK_NWORKERS=N' to set the max workders of cilk plus runtime.
*/</pre>
</div>
<div id="art_pre_template">(8) <b>Resources for the Cilk Plus Open Source Project</b></div>
<div id="art_pre_template"><a href="http://software.intel.com/en-us/articles/intel-cilk-plus/">http://software.intel.com/en-us/articles/intel-cilk-plus/</a><br /><a href="http://software.intel.com/en-us/articles/intel-cilk-plus-open-source/">http://software.intel.com/en-us/articles/intel-cilk-plus-open-source/</a><br /></div> ]]></description>
      <link>http://software.intel.com/en-us/articles/how-to-build-gcc-cilkplus-brance-in-64bit-ubuntu-1204/</link>
      <pubDate>Sat, 14 Apr 2012 09:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/how-to-build-gcc-cilkplus-brance-in-64bit-ubuntu-1204/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/how-to-build-gcc-cilkplus-brance-in-64bit-ubuntu-1204/</guid>
      <category>Parallel Programming</category>
      <category>Open Source</category>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
    </item>
    <item>
      <title>Case Study: Parallelizing a Recursive Problem with Intel® Threading Building Blocks</title>
      <description><![CDATA[ <h2>by Louis Feng</h2>
<h2></h2>
<h2>Download Article</h2>
Download <a target="_blank" href="http://software.intel.com/file/43272">Case Study: Parallelizing a Recursive Problem with Intel® Threading Building Blocks</a> [PDF 1.1MB]<br /><br />Recently I have been working closely with DreamWorks Animation engineers to improve the performance of an existing key library in their rendering system. Using a combination of algorithmic improvements and parallelization, we achieved an overall of 35X performance improvement in some cases. One of the requirements for this project was to minimize changes to the library structure to control development cost. In this article, I will share some of the techniques I used to parallelize a recursive problem in the DreamWorks Animation rendering system.<br /><br />Before I dive into the details, let me give a quick overview of the application that I was trying to analyze and speed up. Artists use digital content creation tools to create scene assets, such as virtual cameras, lights, and 3D models. To render an image using these scene assets files, they first must be parsed and then converted to rendering data structures before executing the renderer. This conversion step is referred to as data conditioning. Scene assets are represented by a graph. The graph has nodes representing objects like cameras, lights, and models. A node could also reference another node as an instance, for example, a forest of trees. The data conditioning step recursively transforms in-memory scene objects into the representation needed for rendering, copying all the necessary data. Data conditioning often involves a large amount of data and a large number of data objects. In practice the data conditioning cost varies widely depending on the number of objects and the complexity of objects in the scene. It's essential that this conversion operation is done quickly because it happens at the start of each rendering process. Hypothetically, if it takes an hour to render a frame, we don't want to spend 15 minutes in data conditioning. That would be a 25% overhead. More importantly, part of the interactive workflow, where a few seconds could make a big difference, also has to go through the data conditioning step. The main computation of the data conditioning library involved recursive graph traversal. My task was to figure out how to speed up this step as much as possible.<br /><br />
<p ><img src="http://software.intel.com/file/43240" /></p>
<b>Figure 1.</b> <i>Intel® VTune Amplifier XE Lightweight Hotspot shows the CPU utilization of the data conditioning library. The top area is showing the timeline and two green bars (each representing a thread). The active thread (second green bar) is showing CPU activities. The bottom half of the picture shows the overall CPU utilization.</i><br /><br />A few of the Intel tools and technologies, such as Intel® C++ Compiler, Intel® VTune Amplifier XE, Intel® Inspector XE, and Intel® Threading Building Blocks, were essential in achieving the performance improvements. I used Intel® VTune Amplifier XE's Lightweight Hotspot analysis to profile the library, see Figure 1. This picture is only showing the execution time inside the library, excluding all the system calls using the Module filter in VTune Amplifier XE. As shown in the time line, about 100 seconds are spent on data conditioning and overall CPU utilization is low. After some performance analysis, I found that the data conditioning library is a good candidate for parallel execution because in most cases, data objects can be processed independently.<br /><br />
<p ><img src="http://software.intel.com/file/43241" /></p>
<b>Figure 2.</b> <i>The conditioning library sequential execution call stack and corresponding computation cost. The middle column shows Self CPU time spent in a particular function. The right column shows Total CPU time which includes Self time and Self time of all functions that were called from that function. Here it is shown as a percentage of the total execution time of the run.</i><br /><br />Let's take a look at the call stack from the Hotspot analysis shown in Figure 2. <code>conditionObject()</code> is the main entry point for the recursive traversal. The call stack goes much deeper, which is not shown in this picture. This type of recursion is called mutual recursion or indirect recursion. The <code>conditionObject()</code> method is called from many locations in the library. Almost 90% of the data transformation is spent on <code>conditionObject()</code>, so it's a great target for parallelization.<br /><br />Intel® VTune Amplifier XE analysis provides valuable information on where the performance issues are. For example, I discovered that during a function call, an object was automatically casted into another object when passed in as a function parameter. Constructing a new instance of that class is fairly expensive and that function was called frequently. It would be difficult to detect such issues without a tool like VTune Amplifier XE. Over the course of the project, DreamWorks Animation engineers made many algorithmic and implementation improvements (such as fixing the object casting issue) which sped up the single thread conditioning library performance by over 4.5X. Using TBB for parallelization, I was able to obtain an average of 6.25X additional speed up on an 8 core Xeon® system.<br /><br />Intel has many technologies for enabling parallelization: TBB, OpenMP*, Cilk Plus*, and many others. I chose the TBB library because the problem we are trying to solve is complex and TBB has solved most of them already. Also DreamWorks has standardized on TBB as their primary programming model for data and task parallelism in their graphics code. The computation kernel (in this case, <code>conditionObject()</code>) can be considered as a TBB task. TBB already has thread-safe data structures and algorithms. We can leverage some of the important features of TBB, such as the high performance memory allocator, work-stealing based task scheduler, and synchronization primitives. One thing to note is that data conditioning involves allocating a large number of small objects. This is important to consider because the memory allocator could be a performance bottleneck in multithreaded applications due to synchronization. As I will show later, the TBB high performance memory allocator can help solve this problem (1).<br /><br />One great way to learn TBB is to study how it's used with design patterns (2). When I started working on this project, I considered whether I could simply apply some of the existing TBB design patterns to solve this problem. The recursion is similar to the Agglomeration and Divide and Conquer patterns, but in this case we are dealing with indirect recursion without a clean way to convert to direct recursion requiring a change to the implementation. There are also data dependencies and thread-safety issues that need to be resolved.<br /><br />To figure out a solution, let's step back and look at the problem in a more abstract way. From the call stack, you can see that the recursive computation is called from many different locations. The number of nodes we need to transform is unknown ahead of the time because nodes can create instances of other nodes. There are data dependencies between the nodes, which limits parallelism and increases complexity. Additionally, although one of the design goals of the data conditioning library is to be thread-safe, some parts of it are not. For example, the node data are accessed through a cache data structure called Context. This cache data structure is restricted to a single thread. Ideally, we want to enable multithreading while trying to minimize the changes to the library. More changes to the library means increased risks and complexity for the project.<br /><br />
<p ><img src="http://software.intel.com/file/43242" /></p>
<b>Figure 3.</b> <i>(a) Shows the control flow diagram of the example recursive program. We start by visiting the root node and do some work. For example, the work might involve allocating new objects (e.g. cameras, lights, and 3D models), and then process and compute object data. If this node has children, we visit each child node and do some more work there. This is done recursively. (b) Shows the result of refactoring the code to prepare for parallelization.</i><br /><br />We can actually solve each of the problems independently. To remove the dependencies between the nodes, we do the computation in the following two ways. One is to satisfy the bookkeeping of data objects so that the child node can be processed without blocking on the parent, see Figure 3. For example, we can keep track of all the nodes we have already visited and create an instance of the corresponding data object without actually filling in the data (which is the expensive part). This allows object instances to still reference each other. Another way to remove data dependency is extract these operations into a post-processing step. For example, when one node object requests data from another node object, this type of operation has to be moved into the post-processing stage after task synchronization.<br /><br />
<p ><img src="http://software.intel.com/file/43244" /></p>
<b>Figure 4.</b> <i>Adding TBB into the mix. Instead of executing the compute kernel directly, a TBB task is created and spawned.</i><br /><br />While we don't know the number of tasks ahead of time, using TBB we can create them recursively on the fly. This may not be the most optimal way of using TBB, but the flexibility is important to us, see Figure 4. The independent part of the computation can then run in parallel. To work around the thread-safety issues of the cache, we can create an instance of the data structure for each thread. Fortunately, the cache only uses small amount of memory. If it's unfeasible to create separate cache instances for each thread, I would look into changing the cache data structure to ensure thread-safety.<br /><br /><img src="http://software.intel.com/file/43245" /><br /><b>Figure 5.</b> <i>An example of a recursive program that traverses a graph and does some computation at each node of the graph.</i><br /><br />Let's look at the source code of a simplified example program, see Figure 5. This example has a similar structure as the DreamWorks data conditioning library with many details in the original library safely ignored for the purpose of this discussion. We have a simple program which builds a graph, and for each of the node in the graph we want to do some work through the <code>processNode()</code> function. A few parameters are used by <code>processNode()</code>: the graph node, a context, and the state. Context has a cache that's not thread-safe. The state object has everything else we need to carry around for the computation. Now we are going to make this program run in parallel.<br /><br /><img src="http://software.intel.com/file/43246" /><br /><b>Figure 6.</b> <i>Code refactoring to separate object data dependencies.</i><br /><br />If you recall, our solution to remove object data dependencies is by separating work into multiple parts:<br /><br />
<ul>
<li>Bookkeeping to manage new objects and inter-object references.</li>
<li>Main computation kernel that processes and computes independent object data.</li>
<li>Post-processing on object data that have dependencies.</li>
</ul>
Everything else remains the same. Figure 6 shows the new structure of the code. The key is, with these changes, all the interfaces remained intact. Any external calls to <code>processNode()</code> need not be changed.<br /><br />Now we are ready to add TBB into the mix. Since the <code>computationKernel()</code> function can be run in parallel safely, I will create a TBB task for it. Figure 7 shows the actual code to do just that. The bold faced lines are new code I have added to the example program. <code>TASK_ROOT</code> is going to be the parent task of the tasks we are going to create later.<br /><br /><img src="http://software.intel.com/file/43247" /><br /><b>Figure 7.</b> <i>Added TBB code to spawn tasks for the compute kernel.</i><br /><br />It's an <code>empty_task</code> because it doesn't actually do anything. It's important to set the reference count to 1 immediately. It's used to let TBB know that I am going to call <code>wait_for_all()</code> in a blocking style. Otherwise, <code>wait_for_all()</code> might return before all the tasks complete. For <code>empty_task</code> I also have to destroy it explicitly when it's no longer needed. Now look at the <code>processNode()</code> function. Instead of calling <code>computeKernel()</code>, I created a <code>ComputeTask</code> as the children of our <code>TASK_ROOT</code>. <code>Allocate_additional_child_of()</code> increases the reference count of the parent task. Then the child task is spawned.<br /><br /><img src="http://software.intel.com/file/43248" /><br /><b>Figure 8.</b> <i>The </i><code>ComputeTask</code> <i>that is run by the TBB scheduler.</i><br /><br />Figure 8 shows the implementation of <code>ComputeTask</code> which inherits from the tbb::task base class. It keeps a copy of all the function parameters we passed to it so that it can continue to run when TBB schedules an instance of this task and runs <code>execute()</code>. In this case, <code>computeKernel()</code> function is called with all the parameters and the proper values. While this code will run in parallel, we still have one remaining problem. Recall that Context is not thread safe. So far we have one context that's shared by all the tasks and threads. We need to fix this so that we don't have race conditions. What we need to do is have an instance of <code>Context</code> for each thread. We don't want to create a context for each task because that would be too expensive.<br /><br />To get the per thread data, there are two ways you can do it. One way is for each thread to find out its own unique thread ID at run time. Another way is to use TBB thread local storage (TLS). TBB TLS is essentially a container that stores per thread data. In any given thread that's running, you can ask for your local instance of the data from this container. Each instance of the data is only created the first time when a thread asks for it. For example, if your machine has 16 threads, and you allocated only 5 threads to run TBB tasks. There will be a maximum of 5 instances of the data created for these threads.<br /><br />The Intel® TBB team has recommended using TLS rather than using threads ID for a number of reasons (3). TBB advocates task-based parallelism. It wants us to stay away from exposing the underlying threads. If you know the thread ID, then you can do things that TBB may not intend to be used for. TBB allows you get thread ID, but only use it if you have very good reasons. Another benefit of using thread local storage is that you don't have to worry about how many threads you are working with or what type of system you are running on.<br /><br /><img src="http://software.intel.com/file/43249" /><br /><b>Figure 9.</b> <i>Use TBB thread local storage to access data that cannot be shared between threads.</i><br /><br />Figure 9 shows how I used thread local storage. Instead of passing in the context when the task was created, I used the thread local context instance. I have declared a thread local storage type using TBB <code>enumerable_thread_specific</code> class. I also created an object called <code>THREAD_CONTEXT</code> initialized with an exemplar context object which will hold our thread local data. When <code>THREAD_CONTEXT.local()</code> is called, first <code>THREAD_CONTEXT</code> will check that for this thread whether a context object has already been allocated. If it has been allocated, <code>local()</code> simply returns a reference to it. Otherwise, a new instance of the context object will be created (through copy constructor) then returned.<br /><br />
<p ><img src="http://software.intel.com/file/43250" /></p>
<b>Figure 10.</b> <i>RDL library performance comparison. Shot 2 is a medium size scene and our improvements resulted in 35X speed up, from the original 98.56 seconds run time down to 2.79 seconds. Legend: ST = single thread, MT-TBB = multithreaded using TBB, MT-TBB-Malloc = multithreaded using TBB and TBB malloc.</i><br /><br />In summary, I have divided the computation kernel at each node into three parts:<br /><br />
<ul>
<li>Bookkeeping to keep track of the objects and allow node objects to reference each other if needed</li>
<li>I moved data dependency into the post processing step so that the computation kernel can run independently in parallel.</li>
<li>Thread local storage is used for non-thread-safe data to avoid race conditions.</li>
</ul>
Using these techniques, I was able to speed up the data conditioning library by 6.25X on average on an 8 core Xeon system without making major changes to the library interface. One of the test data, shot 2, was speed up by 35X comparing to the original ~100 seconds of run time, see Figure 10. Notice the speed improvements when TBB malloc is used to replace the standard malloc. I have done additional tests on shots of various sizes and the performance improvements are consistent. Of course, there is still room for improvement. This library uses mutexes in a couple locations to avoid thread safety issues. I believe we can improve the performance further by reducing the number of synchronizations. Some data which didn't require locking, achieve over 8X speed up on an 8 core Xeon. In TBB 4.0, flow graph is introduced to solve graph related problems such as the one we discussed here. TBB graph could help simplify and solve some of the object data dependency problems.<br /><br />1. <b>O'Neill, John, Wells, Alex and Walsh, Matt</b>. Optimizing Without Breaking a Sweat. <i>Intel® Software Network</i>. [Online] [Cited: 11 29, 2011.] <a href="http://software.intel.com/en-us/articles/optimizing-without-breaking-a-sweat/">http://software.intel.com/en-us/articles/optimizing-without-breaking-a-sweat/</a>.<br /><br />2. TBB Documentation. [Online] <a target="_blank" href="http://threadingbuildingblocks.org/documentation.php">http://threadingbuildingblocks.org/documentation.php</a>*.<br /><br />3. <b>Robison, Arch.</b> Abstracting Thread Local Storage. <i>Intel® Software Network</i>. [Online] <a href="http://software.intel.com/en-us/blogs/2008/01/31/abstracting-thread-local-storage/">http://software.intel.com/en-us/blogs/2008/01/31/abstracting-thread-local-storage/</a>.<br />
<div id="vc-meta" >
<div id="vc-meta-author"></div>
<div id="vc-meta-pubdate">04-13-2012</div>
<div id="vc-meta-modificationdate">04-13-2012</div>
<div id="vc-meta-taxonomy">Case Studies</div>
<div id="vc-meta-category-product">
<div class="tbb">Intel® TBB</div>
<div></div>
</div>
<div id="vc-meta-category">
<div></div>
</div>
<div id="vc-meta-thumb">http://software.intel.com/file/43240</div>
<div id="vc-meta-thumb-tout"></div>
<div id="vc-meta-thumb-hero"></div>
<div id="vc-meta-tocenable">no</div>
<div id="vc-meta-abstract">Intel worked closely with DreamWorks Animation engineers to improve the performance of a key rendering system library by up to 35X performance improvement in some cases.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/case-study-parallelizing-a-recursive-problem-with-intel-threading-building-blocks/</link>
      <pubDate>Fri, 13 Apr 2012 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/case-study-parallelizing-a-recursive-problem-with-intel-threading-building-blocks/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/case-study-parallelizing-a-recursive-problem-with-intel-threading-building-blocks/</guid>
      <category>Parallel Programming</category>
      <category>Visual Computing</category>
      <category>Visual Computing Source</category>
      <category>Server Developer Community</category>
    </item>
    <item>
      <title>Designing Application Software for Energy-efficient Performance</title>
      <description><![CDATA[ <b>By Nancy Nicolaisen</b><br /><br />Personal computers are designed to be in processor idle 75% of the time but in fact might more realistically be estimated to be idle in excess of 90% of the time because of the effects of imposed waits for user input, server response, and resource availability. An idle processor is available to sleep and, while in a sleep state, can save most of the energy it would otherwise consume from actively executing. At least on the client side, if all of the theoretical energy-saving potential of processor sleep states were realized, end user energy use could shrink by fantastical amounts with no apparent sacrifice of functionality or productivity.<br /><br />This, however, is not today’s status quo. For various reasons, users sometimes intentionally configure client systems not to sleep, and it is not uncommon for application software to inadvertently (or intentionally) prevent CPUs from entering sleep states. Application developers can’t do anything about the former. However, there is a lot they can do to make sure the sophisticated laptop and tablet solutions they design, code, and deploy are energy efficient. In addition, if targeting a thin client, developers have to be aware of the back-end servers and how they could affect the operation and power envelope for that thin client.<br /><br />
<h2 class="sectionHeading">Follow Best Practices for Creating Energy-efficient Client Device Applications</h2>
From an application developer’s point of view, a key tactic for achieving energy-efficient software performance is effective handling of sleep state transitions. A few general rules can go a long way toward accomplishing this goal—for example:<br /><br />
<ul>
<li>Design applications that allow screens to darken and disks to idle by avoiding behaviors that unnecessarily prevent systems from remaining in a sleep state. Moving from sleep states to full activity states requires some energy, thus, develop algorithms to not keep waking idle processors unnecessarily.</li>
<li>Where possible, eliminate code that keeps processors from transitioning to sleep states.</li>
<li>Employ development frameworks that allow an app to be respectful of sleep status and resilient in handling nonessential workloads.</li>
<li>To prevent users from disabling sleep, become more context aware, and take steps to ensure that systems don’t enter sleep states when users are passively interacting with them (e.g., watching or listening).</li>
<li>Develop power-aware strategies for handling timers and looping. Investigate the use of compiler switches that unroll deterministic loops, and make other adjustments that reduce the overall number of instructions executed (e.g., remove polling).</li>
<li>Use energy-aware tools to identify patterns of processor use in your apps.</li>
</ul>
A well-designed app should have little impact on overall energy consumption when it is open but idle, as Figure 1 shows.<br /><br />
<p ><img src="http://software.intel.com/file/43251" /></p>
<br /><b>Figure 1.</b> A key energy management principle: Idle apps should have negligible impact on power use.<br /><br />
<h2 class="sectionHeading">Tools And Techniques for Evaluating and Optimizing Application Energy Consumption Performance</h2>
Unlike many types of optimization, developers can’t see or infer symptoms of poor application energy performance. To make real progress toward improved client-side application energy efficiency, you need to employ power performance optimization tools and techniques. Figure 2 shows the results for 15 applications in a study. The chart shows two things: the average power over baseline (in Watts) and the percentage impact of that power draw over baseline. For example, Instant Messenger-4 running at idle caused the platform power draw to increase to 1.7 Watts, or 21 percent higher than system idle without the application running. This idle power draw affected battery life by approximately 4 hours. The conclusion from this study is that applications within the same category can exhibit different idle power behaviors.<br /><br />
<p ><img src="http://software.intel.com/file/43252" /></p>
<br /><b>Figure 2.</b> Analyzing app power performance behaviors “in the wild” can be complex.<br /><br />Imbuing client-side applications with power awareness isn’t difficult, but it is something that must be done with deliberate intention. For app developers, this is a matter of finding and using frameworks and instrumentation that help validate good designs and discover the flaws in program logic that need remediation.<br /><br /><b><i>Intel® Energy Checker</i></b><br /><br />The Intel® Energy Checker software development kit (SDK) provides developers with a way to analyze how applications consume power. This information is key to optimization, because gross power usage is far from being the whole story. Real efficiency demands an understanding of exactly how an app’s power consumption relates to its work output. For example, power sinks can be the result of poorly integrated legacy code, duplication of effort in libraries and components, frivolous output activities, and the like.<br /><br />Finding app behaviors that waste energy can be as challenging as finding memory leaks and other sublethal application flaws. Symptoms can be so subtle that it’s impossible to diagnose problems without instrumented code and a controlled, self-documenting test environment. Fortunately, this is precisely what Intel® Energy Checker provides. This SDK allows developers to:<br /><br />
<ul>
<li>Evaluate app productivity versus power consumption</li>
<li>Instrument code to report specific metrics about operations performed, timings, and collateral conditions</li>
<li>Generate large performance data sets using a variety of execution regimes</li>
<li>Evaluate the power consumption impacts of alternative libraries, drivers, and frameworks</li>
<li>Validate optimizations and remediation</li>
<li>Instrument apps in ways that allow customers and third-party testers to certify apps as energy efficient</li>
</ul>
<i><b>What Intel® Energy Checker Offers Client App Developers</b></i><br /><br />The Intel® Energy Checker SDK is a full-featured testing and validation facility. Its fundamental layer comprises a counter application programming interface (API) that allows direct measurement of app productivity. The ability to export and import counters provides a mechanism for analyzing how efficiently apps work with one another and the system overall.<br /><br />Intel® Energy Checker’s companion build and scripting tools allow a means of analyzing code for which source is not available or can’t practically be built with inline instrumentation. Command-line utilities allow Intel® Energy Checker tools and data streams to interoperate with native Windows* and Linux* counters and utilities, making Oracle* Solaris 10–, Mac OS X*–, and Linux* MeeGo-based apps susceptible to evaluation by Intel® Energy Checker testing and validation.<br /><br />One of the biggest advantages the Intel® Energy Checker toolset offers is its support for a broad variety of application development regimes. To help developers get up to speed with their projects, the SDK shipped with sample applets demonstrating how to employ it in the following situations:<br /><br />
<ul>
<li>With threading</li>
<li>Called from Java*</li>
<li>Called from C#</li>
<li>Called from Objective-C</li>
<li>With Linux system information utilities</li>
<li>CPU use histogram generator tools</li>
<li>Cluster energy efficiency</li>
<li>PL sampling measurements</li>
</ul>
The suite supports a majority of the common application programming languages in use today, including C, C++, C#, Objective-C, Java*, PHP, and Perl.<br /><br /><i><b>Using Microsoft Joulemeter to Analyze Energy Efficiency Performance</b></i><br /><br />Joulemeter from Microsoft* Research is focused on creating modeling and optimization tools to assist system architects, administrators, and developers in improving the energy efficiency of computing infrastructures. The central concentrations of the Joulemeter Research Program are on modeling and optimizing power use by computational infrastructure of all types and scales. This information is critical, because to achieve real energy savings, systems have to be optimized from end to end. Even lightweight mobile clients have to be aware of the impacts of their behavior on back-end servers, such as whether they will affect the operation’s overall power performance.<br /><br />The Joulemeter Research Project has published the lightweight stand-alone Joulemeter application* for Windows* 7 laptops and desktops. The app estimates the power consumption of a single computer by tracking resource usage (CPU saturation, screen backlighting, antenna power use, and the like); from these measurements, Joulemeter forecasts system power consumption.<br /><br /><i><b>Intel® Battery Life Analyzer</b></i><br /><br />The Intel® Battery Life Analyzer (BLA) is a lightweight tool that monitors battery life on computers running the Windows* operating system. Empirically evaluating energy-related application performance on battery-powered systems can sometimes yield impressive gains with relatively minor changes in application code. BLA helps developers identify opportunities to create “application idle” state converge on platform idle states. In particular, BLA gets around a problem from which most power management and accounting application programming interfaces (APIs) suffer. Inherently, accounting APIs have to work with sampled data, recorded at timer tick intervals (on the order of 15.6 msec). Therefore, if a software operation starts on a timer tick but ends before the next tick, it can’t be detected by metrics that use full tick granularity.<br /><br />Although this sounds like a negligible shortcoming, it isn’t. Many isochronous operations (think media handling) fall into this category, and such operations can easily become huge fractions of a platform’s overall workload. In contrast, BLA uses fine-grained process information based on microsecond scale time stamps. BLA records both a given activity’s starts and stops. This precision provides not only a more accurate picture of power utilization; it is also a far more complete one. (For a rigorous treatment of this topic, you can find a link to the Intel white paper, “Energy Efficient Platforms—Considerations for Application Software and Services,” in the Helpful Links section.)<br /><br />
<h2 class="sectionHeading">Mobile Device Battery Life Conservation</h2>
More and more, batteries are a key source of power for computing platforms. In early 2011, smart phones outsold PCs 4 to 1 worldwide. Given this, expect to see the energy efficiency of mobile apps become a key concern for all types of software consumers. Fortunately, mobile developers are generally pretty savvy about energy efficiency, as battery-operated devices have always demanded that discipline of them.<br /><br />All mobile development frameworks include methods for detecting power states (connected to AC wall current or running on DC battery power), testing battery levels, and scaling system and application behaviors in response to energy regimes. Apple*, Symbian*, Microsoft*, RIM*, and other mobile device vendors have worked over the years to establish general guidelines that help app developers be good power-management citizens on small devices. Many of these rules translate easily to laptop and desktop apps that are being reworked to improve power performance:<br /><br />
<ul>
<li>Replace timer-based designs with event-driven or interrupt-driven logic.</li>
<li>Avoid using timers as a high-resolution time source. If there is no workable alternative, ensure that timer resolution is reset to the system default when it is not actively engaged in its specific task.</li>
<li>Apps designed to provide passive display of content should explicitly increase display dimming timeout to accommodate playback using power request or availability APIs. The requests should be explicitly rescinded when the app is minimized or inactive.</li>
<li>Screen savers and the like should not alter dimming timeouts. Unless there is an aesthetic reason for them, screen savers do nothing to maintain the health of LCD monitors and are simply wasting energy. Let screens dim, if practical.</li>
</ul>
Ineffective management of sleep states can dramatically multiply an app’s power consumption. Effective use of parallelization, coalescing tasks that are difficult to parallelize in a single thread, and avoidance of excessive requirement for synchronization among threads are all strategies that can help reduce the number of sleep state transitions an app triggers (see Figure 3).<br /><br />
<p ><img src="http://software.intel.com/file/43253" /></p>
<br /><b>Figure 3.</b> Effective management of sleep states is key to good app energy performance.<br /><br />
<h2 class="sectionHeading">Conclusion</h2>
Managing the energy performance of application software may reasonably be expected to become a core competency for developers in the fairly near term, as economic and environmental considerations shape thinking on software engineering best practices. Many good tools exist for this purpose, and the Intel® Energy Checker SDK can help to validate and refine energy-optimization efforts of client software developers targeting both the desktop and mobile platforms.<br /><br />
<h2 class="sectionHeading">Helpful Links and Additional information on Power Management Tools and Resources</h2>
<ul>
<li><a target="_blank" href="http://www.climatesaverscomputing.org/resources/information/software-development">Software development information from Climate Savers Computing</a>*</li>
<li><a href="http://software.intel.com/en-us/articles/intel-energy-checker-sdk/#FAQ">Intel® Energy Checker SDK and user guide</a></li>
<li><a target="_blank" href="http://www.thegreengrid.org/about-the-green-grid.aspx">Learn more about The Green Grid</a>*</li>
<li><a target="_blank" href="http://msdn.microsoft.com/en-us/library/windows/desktop/aa373163(v=vs.85).aspx">Microsoft Power Management Functions* reference</a></li>
<li><a target="_blank" href="http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/Power_Management_Guide/index.html">Red Hat Linux 6 Power Management Guide</a>*</li>
<li><a target="_blank" href="http://www.elinux.org/Power_Management">Power Management for Linux</a>*</li>
<li>Fine-Grained Energy Profiling for Power-Aware Application Design: <a target="_blank" href="http://research.microsoft.com/apps/pubs/default.aspx?id=73662">http://research.microsoft.com/apps/pubs/default.aspx?id=73662</a>*</li>
<li>Intel white paper: “Energy Efficient Platforms—Considerations for Application Software and Services” (<a href="http://www.intel.com/content/www/us/en/green-it/energy-efficiency/energy-efficient-platforms-2011-white-paper.html?wapkw=considerations+for+application+software+and+services">http://www.intel.com/content/www/us/en/green-it/energy-efficiency/energy-efficient-platforms-2011-white-paper.html?wapkw=considerations+for+application+software+and+services</a>)</li>
<li>BLA requests, questions, and feedback: <a href="http://software.intel.commailto:BatteryLifeAnalyzer@intel.com">BatteryLifeAnalyzer@intel.com</a></li>
</ul>
<h2 class="sectionHeading">About the Author</h2>
Nancy Nicolaisen is an author, researcher, and veteran software developer specializing in mobile and embedded device technologies. Her feature articles, columns, and analyses have been internationally circulated in publications such as <i>BYTE, PC Magazine, Windows Sources, Computer Shopper, Dr. Dobbs Journal of Software Engineering, and Microsoft Systems Journal</i>. She is the author of three books—<i>Making Windows Portable: Porting Win32 to Win CE</i> (2002, John Wiley &amp; Sons); <i>The Practical Guide to Debugging 32 Bit Windows Applications</i> (1996, McGraw Hill); and <i>The Visual Guide to Visual C++</i> (1994, Ventana Press)—available in five foreign-language editions. In 2007, she served as technical advisor for the development of the Microsoft Professional Education course “Designing, Building and Managing Wireless Networks.” Ms. Nicolaisen is currently active in exploring open source technologies and trends for mobile, embedded, and wireless devices.<br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/articles/designing-application-software-for-energy-efficient-performance/</link>
      <pubDate>Mon, 09 Apr 2012 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/designing-application-software-for-energy-efficient-performance/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/designing-application-software-for-energy-efficient-performance/</guid>
      <category>Parallel Programming</category>
      <category>Tools</category>
      <category>Intel® AppUp(SM) Developer Community</category>
      <category>Intel SW Partner program</category>
      <category>Intel Software Network communities</category>
      <category>Power Efficiency</category>
      <category>Ultrabook</category>
      <category>Server Developer Community</category>
    </item>
    <item>
      <title>Unhandled Exceptions when Debugging OpenMP applications</title>
      <description><![CDATA[ <strong>Problem :</strong> <br />When debugging OpenMP* applications built with the Intel® C++ or Fortran Compiler an unhandled exceptions dialog like the following<br /><br /><em>First-chance exception at 0x74f9b9bc (KernelBase.dll) in &lt;appname&gt;.exe: 0xA1A01DB1: <br />Intel Parallel Debugger Extension Exception 1<br /><br /></em>may appear. Normally these exceptions are handled by the Intel® Parallel Debugger Extension in the background, but under the environment specified below the exception dialog may pop up whenever an OpenMP application is started for debugging and when it’s terminated.<br /><br /><br /><strong>Environment :</strong> <br />Windows* 7 Enterprise, SP1, 64-bit<br />Microsoft Visual Studio* 2010 SP1<br />Intel® Parallel Studio XE 2011 SP1 Update 2<br />Intel® Inspector XE 2011 Update 9<br /><br /><br /><strong>Resolution :</strong> <br />When the unhandled exception dialog appears, click on 'Continue'. The root cause of the problem is identified and will be fixed in a future version. ]]></description>
      <link>http://software.intel.com/en-us/articles/unhandled-exceptions-when-debugging-openmp-applications/</link>
      <pubDate>Tue, 27 Mar 2012 15:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/unhandled-exceptions-when-debugging-openmp-applications/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/unhandled-exceptions-when-debugging-openmp-applications/</guid>
      <category>Parallel Programming</category>
      <category>Intel Software Network communities</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
      <category>Intel® Inspector XE Knowledge Base</category>
    </item>
    <item>
      <title>Using Intel® Power Checker to measure the energy performance of a compute-intensive application </title>
      <description><![CDATA[ <p>Intel® Power Checker provides developers with a quick and easy way to evaluate the idle power efficiency of their applications on mobile platforms with Intel® Core™ processor or Intel® Atom™ technology running the Microsoft Windows* operating system. Any compiled language application, especially those designed to run on technology based on Intel® products and Java Framework applications can be analyzed by Intel Power Checker. The checker can be used with or without a supported external power meter.</p>
<p>The Intel Power Checker 2.0 now supports measurement both on battery and with the system plugged into an external AC power source. External power measurement is only supported on Intel® Second Generation Core processors and if the Intel® Power Gadget software has been installed.</p>
<p>For this article, I took a very compute-intensive parallel application that I wrote to solve instances of the logic puzzle Akari. The code uses a backtracking algorithm to explore how to place light bulbs onto a grid under constraints dictated by the rules of the puzzle and the layout of the puzzle instance. Potentially millions of independent tasks can be generated by the code as the solution space is searched by threads executing those tasks. This solution method is eminently scalable to a large number of threads and is able to keep many cores running at peak speed for a sustained amount of time.</p>
<h2>How to Use Intel Power Checker</h2>
<p>The Intel Power Checker provides a GUI wizard that leads you through the four steps of power analysis. These four steps in the checker are described below. Before starting the assessment, be sure to know which section of your application (a workload) you want to be measured, as the Power Checker will only measure a 30 second execution interval. (If you want to measure the entire execution workload, you should try some other tool, like Intel Power Gadget.) Your workload could be a compute-intensive portion or an I/O-intense section or just some point in execution that typifies the majority of expected usage.</p>
<h3 >Step 1: Specifying the Power Meter device</h3>
<p>If you have an external power meter attached to your test system, you can select the model being used on the first screen of the wizard. The default is that no external device is being used. For this default case, Intel Power Checker will determine if the system is capable of providing power consumption data and if the correct power driver, EzPwr.sys, is installed. (The driver is part of the default installation of <a href="http://software.intel.com/en-us/articles/intel-power-gadget/">Intel Power Gadget</a>.)</p>
<h3 >Step 2: Measure System Baseline</h3>
<p>The first measurement that the Intel Power Checker will perform is on the next screen within the wizard. This is to measure the baseline power consumption of the hardware without your application running. Prior to this measurement phase any unnecessary processes such as operating system updates, Windows Indexing Service, virus scans, media players, and internet browsers should have been shut down. In other words, to get the most accurate results you should make your test system as idle as possible and ensure that nothing will become a foreground process during your measurement runs.</p>
<p>Once you have a quiescent system, click the “Start” button to begin this phase of the testing. The Intel Power Checker waits 15 seconds to allow the system to come to an idle state before starting the measurements. You need to be sure to position your mouse and the keyboard out of reach, or keep your hands away from them, to avoid any stray contact that might trigger some response from the platform. After the pause, the checker will observe the system for 30 seconds in this idle state. A progress bar will show the time remaining in each part of this phase. Once the baseline data collection is complete, click the “Next” button to proceed to the next phase.</p>
<h3 >Step 3: Measure Active Application</h3>
<p>Before you are taken to the next screen in the wizard, you are instructed to start the application you are interested in measuring. Start up your application and click the “OK” button to advance the GUI to the next screen. Once you have reached the Step 3 screen, use the scroll bar to locate your application in the process list and click on that line to select it. If your application is not listed, click the “Refresh List” button so that your application’s process will be available to select. In addition, you can use the “Apply Filter” button to narrow down the list in order to find your application’s process quickly. .After selecting your application from the list, click “Next” to move on to the data collection for this phase. Before starting the assessment, be sure your application has reached the desired point of measurement. If there are some initial setup computations that are not of interest, you will need to get past this point before letting Intel Power Checker begin measurement. For my Akari application, there is very little setup time. It was typically in the thick of computation by the time I had gotten to the point of selecting the process from the list.</p>
<p>As soon as I could, I clicked the “Start” button to begin capturing measurement data. Since this is one of the crucial power measurements for your application, always begin capturing data <b>after</b> the workload or critical section has begun and make sure this active execution will run longer than the 30 seconds needed to complete the measurement time.</p>
<h3 >Step 4: Measure Idle Application</h3>
<p>The final phase is to measure your application’s idle power consumption. This is another important phase of energy efficiency measurement of an application since your application must not only do efficient computation, but also not waste energy when sitting idle.</p>
<p>This step doesn’t make much sense within my compute-intensive application since there is no idle state of the application. Once you start the application on a given puzzle instance, it simply computes all legal solutions in parallel and then ends. As (multiple) solutions are found, they are printed out by the thread that found it. If there are no solutions, a message is printed just before the application terminates. This latter case describes the workload I used for my tests. Because you must have your application running in “idle” mode for this step, I left the application running at full speed and simply allowed Power Checker to take its measurements.</p>
<p>If your application does have an idle state, perhaps waiting for interaction from the user, the checker will give the system 15 seconds to calm down fully before taking a final 30 second measurement.</p>
<p>Upon completion of this last data collection phase, you will be able to proceed to the results screen within the Intel Power Checker wizard. After all three measurement phases have been completed; a Tool Report File will be generated containing all of the results for later analysis.</p>
<h3 >What data is presented</h3>
<p>The View Results screen of the Intel Power Checker wizard provides basic information about the software assessment. The type of processor in your system and the type and model of the power source that was used are given. Four numerical values for each of the three measurement phases are presented. These values are:</p>
<ul >
<li><b>Elapsed Time:</b> The exact number of seconds that each of the phases lasted.</li>
<li><b>Energy Consumption:</b> The rate that the battery was discharged during each of the three phases.</li>
<li><b>Average C3 State Residency:</b> The percentage of time that the system was in the C3 state during the data collection period.</li>
<li><b>Platform Timer Period:</b> The number of milliseconds that the platform timer collected</li>
</ul>
<p><img src="http://software.intel.com/file/42410" /></p>
<p>Typical results would hopefully show a larger percentage of time spent in the C3 State Residency for the application idle time measurement (the middle of the three columns on the View Results screen). As my puzzle solving application was still computing as much as it did in the active execution measurement step, this was not the case for my results. This is atypical for the intended type of applications Intel Power Checker assumes will be measured. Thus, the C3 State Residency values provided by the tool for the idle application were not valid for my particular application.</p>
<p>The name of the report file and the directory to which it will be found are listed on the View Results screen.</p>
<h2>Some Caveats</h2>
<p>Below are some things you should consider before and during a measurement run using Intel Power Checker.</p>
<ul >
<li>Before you start using Intel Power Checker, be sure your chosen workload will run for at least 30 seconds from the point you wish to measure power consumption. In my case, I required a data set that would force the application to run for at least 75 seconds (30 for active measurement, 15 for idle setup, and 30 for idle measurement) plus the time I needed to click boxes and find my application in the process list. Since I ran the application on several different numbers of threads, I needed to be sure that the fastest execution time was still large enough to get all the timings steps completed during a Intel Power Checker run.</li>
<li>Upon starting Intel Power Checker, the checker may first report that the platform timer period is invalid. In this case, some currently running (background) process has changed the default and it will be up to the user to determine which currently running application has changed the value. Once you have identified the culprit you must stop this process or service before restarting Intel Power Checker. If you are unsure about which active process is preventing Intel Power Checker from starting, you will need to turn off processes one at a time and try Intel Power Checker until the error message doesn’t come up. </li>
<li>Instructions on the Step 3 screen ask you not to touch the keyboard or mouse. If you are measuring an interactive application or you must interact with the application to generate activity for the full 30 seconds, you will need to touch the keyboard and/or mouse. If possible, a workload that can forego interactivity and still compute for the 30 seconds of measurement time would be best. However, if interaction by the user is part of how the application is utilized, interfacing through peripherals will give you a more accurate measure of the overall energy consumption for typical application usage.</li>
<li>A data file is created during each phase of the Intel Power Checker assessment to hold the current information. If you cancel the assessment in any of the three phases then a data file will not be created for that phase. After all three phases have been completed, a Tool Report File, in XML format, will be generated containing all of the results. You can find the name of the report file and where it is located on the View Results screen.</li>
<li>The “Submit Results” button on the View Results screen is optional and only intended for members of the <a href="http://software.intel.com/partner/overview">Intel® Software Partner Program</a> to submit their measurement results to the program. If you are not a member, do not submit your results. Simply click on the “Close” button after you have examined the results compiled by Intel Power Checker.</li>
</ul>
<h2>Some Results</h2>
<p>The purpose of this article is not to determine the best scenario for running my Akari solver application in the most energy efficient way. You will want to do this for your application, though, and this article has given you the background on Intel Power Checker to determine if this checker can help you quantify the current power consumption of your application. Also, as you make modifications to the application you will be able to determine if those changes improve the energy efficiency or cause your application to suck more power than before.</p>
<p>In addition to the average C3 State Residency percentage, the checker delivers the total number of Joules expended during the 30 seconds of execution time measured. From this I can compute the average Watts for execution parts of the application. I have found that a better metric for comparing different applications or different runs of the same application is milliwatt hours (mWh). You need the total execution time of the execution portion of the application to compute this value. Since Intel Power Checker only measures activity in 30 second segments, you will need to have some timing data available, which I happened to have for the different runs I made of my Akari application.</p>
<p>I found significant differences when running with and without Hyper-Threading Technology (HT) turned on. Also, if the platform was running on battery (DC) power or from the wall socket (AC) power, a difference in execution time and power usage was evident. For example, when running with HT on and a full complement of four threads on the 4 logical cores in my system, I saw the AC power run 1.19X faster that when running the same workload on DC power. However, the former run took 1.15X more power.</p>
<p>Comparing results between runs on DC power versus AC power is a not a good comparison, especially in this case. The power source is detected by the system and the processor is allowed to run with Intel® Turbo Boost Technology at a higher frequency if the platform is using external power. Even so, you may need to be concerned about power consumption of your application in both power source circumstances and you will need to run measurement experiments within each setup to gauge how well your application modifications affect overall power consumption.</p>
<h3 >System Requirements</h3>
<p>You can use Intel Power Checker on a laptop or netbook based on Intel® Core™ processor or Intel® Atom™ processor technology. A desktop with an external power meter or a desktop that is capable of providing the power consumption information can also be analyzed. A Java* Runtime Environment (JRE) (version 6 update 11 or higher) is also required to run the checker. Supported operating systems are Microsoft Windows* XP (Service Pack 3), Microsoft Windows Vista* (Service Pack 2), Microsoft Windows* 7 (Service Pack 1 [32-bit and 64-bit]), and Microsoft Windows* Server 2008 R2.</p>
<h3 >Download link</h3>
<p>To download the Intel Power Checker installation package, go to the following link:</p>
<p><a href="http://software.intel.com/partner/app/software-assessment">http://software.intel.com/partner/app/software-assessment/</a>. Click on the Intel Power Checker tab to move down to the download link.</p>
<h3 >Other supporting links</h3>
<p>There is a video demonstration of using Intel Power Checker, “A Look at Intel Power Checker,” at the link: <a href="http://software.intel.com/en-us/videos/channel/intel-software-partner-program/a-look-at-the-intel-power-checker/1127786023001">http://software.intel.com/en-us/videos/channel/intel-software-partner-program/a-look-at-the-intel-power-checker/1127786023001</a>. Dave Valdovinos and Taylor Kidd, both from Intel, show off the GUI wizard as it measures the power performance of a game-like application.</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/using-intel-power-checker-to-measure-the-energy-performance-of-a-compute-intensive-application/</link>
      <pubDate>Mon, 12 Mar 2012 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/using-intel-power-checker-to-measure-the-energy-performance-of-a-compute-intensive-application/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/using-intel-power-checker-to-measure-the-energy-performance-of-a-compute-intensive-application/</guid>
      <category>Mobility</category>
      <category>Parallel Programming</category>
      <category>Intel® AppUp(SM) Developer Community</category>
      <category>Intel Software Network communities</category>
      <category>Intel SW Partner program</category>
      <category>Intel Software Network communities</category>
      <category>Game Development</category>
      <category>Power Efficiency</category>
      <category>Intel® vPro™ Developer Community</category>
      <category>Resources For Software Developers</category>
      <category>Ultrabook</category>
      <category>Server Developer Community</category>
    </item>
    <item>
      <title>Using Intel Cluster Checker to check that MPI applications will properly run over InfiniBand</title>
      <description><![CDATA[ <p class="MsoNormal">One of the benefits of Intel Cluster Checker is that it acts as an application proxy. If the tool passed, then there is a high probability of an MPI application running properly.<o:p></o:p></p>
<p class="MsoNormal">To ensure this, the following exhaustive steps are enforced by Intel Cluster Checker test modules:<o:p></o:p></p>
<p class="MsoListParagraphCxSpFirst" > </p>
<ol>
<li><span >·<span > </span></span><span >Check that base libraries and their uniformity (<b>base_libraries</b>)</span></li>
<li><span >·<span > </span></span><span >Check that MPI tools have consistent paths (<b>mpi_consistency</b>)</span></li>
<li><span >·<span > </span></span><span >Check that per-node MPI jobs can do Hello World independently (<b>intel_mpi_rt</b>)</span></li>
<li><span >·<span > </span></span><span >Check that a global Hello World is successfully executed across compute nodes (<b>intel_mpi_rt_internode</b>)</span></li>
<li><span >·<span > </span></span><span >Runs Intel MPI Benchmarks such as Ping Pong to check available latency and bandwidth (<b>imb_pingpong_intel_mpi</b>)</span></li>
<li><span >·<span > </span></span><span >Stress the communication system by running the HPCC benchmark (<b>hpcc</b>)</span></li>
</ol>&lt;!--[if !supportLists]--&gt;<o:p></o:p>
<p> </p>
<p class="MsoListParagraphCxSpMiddle" ><o:p></o:p></p>
<p class="MsoListParagraphCxSpMiddle" ><o:p></o:p></p>
<p class="MsoListParagraphCxSpMiddle" ><o:p></o:p></p>
<p class="MsoListParagraphCxSpMiddle" ><o:p></o:p></p>
<p class="MsoListParagraphCxSpLast" ><o:p></o:p></p>
<p class="MsoNormal">If the tool reports something, then an MPI application might have issues to complete their work.<o:p></o:p></p>
<p class="MsoNormal">These steps will even catch potential timeouts due wrong configuration on the network stack; and most important, bad cabling or down hardware interfaces. However, if the cluster uses InfiniBand adapters then there is a known issue to be aware of. The global MPI check can hang as any other MPI application will do if InfiniBand is not correctly configured and online.<o:p></o:p></p>
<blockquote>
<p class="MsoNormal"><span >Intel(R) MPI Library Runtime Environment (All nodes), (intel_mpi_rt_internode, 1.8.....................................................</span><span >^C</span></p>
<p class="MsoNormal"><span >Caught signal INT, cleaning before termination.<o:p></o:p></span></p>
</blockquote>
<p class="MsoNormal">With InfiniBand setups, the configuration of Intel Cluster Checker must define openib and dat_conf as dependencies of intel_mpi_rt_internode. This action will ensure that the InfiniBand devices are properly detected and healthy. openib check hardware devices, and dat_conf the DAPL software interface.<o:p></o:p></p>
<blockquote>
<p class="MsoNormal">&lt;intel_mpi_rt_internode&gt;<o:p></o:p></p>
<p class="MsoNormal">&lt;add_dependency&gt;dat_conf&lt;/add_dependency&gt;<o:p></o:p></p>
<p class="MsoNormal">&lt;add_dependency&gt;openib&lt;/add_dependency&gt;<o:p></o:p></p>
<p class="MsoNormal">&lt;/intel_mpi_rt_internode&gt;<o:p></o:p></p>
</blockquote>
<p class="MsoNormal">This decision cannot be done automatically as choosing were to use or not the low latency, high bandwidth capabilities of InfiniBand during the check is at discretion of the user. For instance, the administrator may want to double check that an Ethernet fabric can be properly used to run MPI applications.<o:p></o:p></p>
<p class="MsoNormal">Be aware that this manual requirement may be lifted in the near future.<o:p></o:p></p> ]]></description>
      <link>http://software.intel.com/en-us/articles/using-intel-cluster-checker-to-check-that-mpi-applications-will-properly-run-over-infiniband/</link>
      <pubDate>Tue, 07 Feb 2012 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/using-intel-cluster-checker-to-check-that-mpi-applications-will-properly-run-over-infiniband/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/using-intel-cluster-checker-to-check-that-mpi-applications-will-properly-run-over-infiniband/</guid>
      <category>Parallel Programming</category>
      <category>Intel® Cluster Ready</category>
      <category>Tools</category>
      <category>Intel Software Network communities</category>
      <category>Intel Software Network communities</category>
      <category>Resources For Software Developers</category>
      <category>Server Developer Community</category>
    </item>
    <item>
      <title>OpenCL™ - Programming for CPU Performance</title>
      <description><![CDATA[ This white paper is the third in a series of whitepapers on OpenCL™ describing how to best utilize underlying Intel hardware architecture using OpenCL. This white paper will go over programming considerations for host-side device orchestration, as well as OpenCL kernels for CPU.<br /> <br /> <i>Disclaimer: This article is based on self-experience as well as on conversations with the OpenCL team at Intel.  It will provide you with insights into performance with the current Intel® OpenCL SDK.  Intel may support OpenCL on future devices to bring you more performance on the platform, but no announcement has been made on specific platforms and release dates. Nevertheless, you can use today’s guidelines to scale to the next generation of Intel platforms.</i><br /> <br /> The Intel® OpenCL SDK 1.1 implementation for CPU (Intel® Core™2 Duo or later CPUs) can be retrieved from <a href="http://software.intel.com/en-us/articles/opencl-sdk">http://software.intel.com/en-us/articles/opencl-sdk</a>. It is still evolving alongside the OpenCL specification, so feel free to try it and provide feedback to us at the <a href="http://software.intel.com/en-us/forums/intel-opencl-sdk/">Intel OpenCL SDK Support Forum</a>. At present, Intel OpenCL SDK 1.1 runs on Linux* 64 bit, Microsoft Windows 7* (with SP1) and Microsoft Windows Vista* operating systems (32-bit and 64-bit).<br /> <br /> The inherently heterogeneous nature of OpenCL allows developers to target various devices that might have very different architectures. CPUs are traditionally great for large complex kernels, as they have large out-of-order cores and large caches.<br /> <br />
<h2 class="sectionHeading">Performance Considerations for Devices</h2>
OpenCL programming requires explicit host-side management of queues, contexts, and devices. Thus, to be efficient, the host-side logic needs to incorporate certain architecture knowledge to utilize any given target device in the best way. There are also different strategies available to divide work among multiple devices that also involve work coordination using events and asynchronous callbacks.<br /> <br /> Let us first go over a top-level view of an OpenCL program.<br /> <br />
<p ><img src="http://software.intel.com/file/40588" /></p>
<br /> OpenCL kernels and the host program both have to make sure that underlying hardware is getting efficiently utilized. Figure 1 tries to point out that communication to discrete graphics devices may be achieved over a PCI-E link, which is about 10x slower than communicating to the CPU using memory/cache hierarchy. If data is needed back at the main program, programmers need to take the costs of data transfer into consideration when evaluating the performance of algorithms.<br /> <br />
<h2 class="sectionHeading">High-Level Device Selection Considerations</h2>
<b>Algorithms and Alternatives (Intel® IPP, Intel® MKL, Intel® Media SDK)</b><br /> If an algorithm has a large memory footprint, involves table lookups, has a lot of branches, or requires table lookups (dynamic programming), then pick the CPU as the target for such algorithms. OpenCL does not allow recursion, function pointers, variable length arrays, etc., so check the specification to be sure that the algorithm is supported under the OpenCL framework. Algorithms such as constraint solvers usually perform better on CPUs, as these algorithms need conditional statements.<br /> <br /> Intel continues to offer developers a choice of proven, innovative parallel programming models. Examples are libraries such as Intel® Threading Building Blocks (Intel® TBB), Intel IPP or Intel MKL (for more details, visit <a href="http://software.intel.com/en-us/articles/intel-tbb/">http://software.intel.com/en-us/articles/intel-tbb/</a>, <a href="http://software.intel.com/en-us/articles/intel-ipp/">http://software.intel.com/en-us/articles/intel-ipp/</a>or <a href="http://software.intel.com/en-us/articles/intel-mkl/">http://software.intel.com/en-us/articles/intel-mkl/</a>). OpenCL augments these tools with low-level standard API support on Intel platform.<br /> <br /> The OpenCL compiler is a new technology, so OpenCL applications using OpenCL on the CPU will not always perform as well as applications using highly optimized functions for image, cryptography or signal processing in Intel IPP or Intel MKL (and it is not at the same level of maturity yet as Intel tools such as Intel C/C++ Compilers, Intel IPP and Intel MKL image, signal and crypto processing routines).<br /> <br /> For better optimization of Intel hardware for media applications,  Intel has also released the Intel® Media SDK for Windows* which is optimized for video processing (scaling, color correction, de-interlacing, cropping and sharpening, etc.) and media encoding/decoding (AVC, MP4, H.264, etc.). Intel Media SDK utilizes Intel® Processor Graphics hardware to expedite video processing and encoding. These capabilities of Intel processor graphics are at present only exposed through the Intel Media SDK. This may be a better alternative if your requirements fall in this category and your machine already has Intel processor graphics. For more details, visit <a href="http://software.intel.com/en-us/articles/media/">http://software.intel.com/en-us/articles/media/</a>.<br /> <br /> <b>Turbo Frequencies</b><br /> Most modern devices work at turbo frequencies only when there is enough work to be performed. While measuring for performance, make sure that you take measurements at normal and at turbo frequencies so that you can have better idea on performance/power ratios. Turbo mode usually kicks in when there is lot of work queued.<br /> <br /> Devices perform at higher performance at lower precision requirements. If your algorithm can work at lower precision, use the -cl_fast_relaxed_math build option when compiling your kernels. For more information, see the Optimization guide at <a href="http://software.intel.com/en-us/articles/opencl-sdk/">http://software.intel.com/en-us/articles/opencl-sdk/</a>. Multiply and add options may not work on all device targets (some devices may not support it), so make performance decisions while keeping an eye towards portability.<br /> <br /> <b>Avoid Naïve Selection Criteria</b><br /> Device selection decisions should not be based on number of cores available, as not all cores are equal in capabilities. Selection based on amount and type of work at hand may be better selection criteria. Programmer should also consider data destination (say for audio filter, performing FFT on remote device with data transfer over PCI-E costs may hinder performance) and algorithm at hand to select most appropriate device.<br /> <br />
<h2 class="sectionHeading">Writing Host Program for Performance</h2>
<b>Command Queues</b><br /> A host program makes several critical design decisions for devices at hand. For CPUs, it is better to use out-of-order queues to enable running multiple kernels simultaneously. If it makes sense from a software design perspective, you can use multiple command queues within the same application. If work needs to be synchronized between command queues, use events and callbacks to synchronize work. To synchronize command queue execution with pre-existing C code, use the clEnqueueNativeTask API or user events. Reading and writing data is lot faster on CPUs. Use mapped buffers backed by properly aligned host pointers to get the best performance. Profile your code using OpenCL profiling capabilities, and measure your performance at every level using Intel Performance Debugging BKMs.<br /> <br /> Intel® OpenCL SDK 1.1 allows you to create out-of-order queues. Utilizing out-of-order queues may result in better CPU core utilization for algorithms that involve different concurrent steps.<br /> <br /> <b>Memory Objects</b><br /> CPUs do not have specialized hardware to handle images. If you are writing simple convolutions and image data types are most natural for your problem, use image objects along with samplers. But make sure to use the simplest interpolation mode that suffices for your needs, e.g. many (interpolating) kernels work fine with nearest-neighbor filtering. For more considerations on using images, please refer to Chapter 4.3, “Image Support”, of the <a href="http://software.intel.com/file/37171/">Writing Optimal OpenCL code With Intel OpenCL SDK document</a>[1].<br /> <br /> Intel OpenCL Implicit Vectorization Module tries to create the most optimal code for such cases. As the Intel® OpenCL SDK Compilermatures, performance of such kernels will improve significantly.<br /> <br /> <b>Event callbacks and Multiple Threads</b><br /> Using events and using callbacks based on event completions help to generate code that is easy to understand but hard to debug. One specific reason is that callbacks are asynchronous, so ensuring data safety across threads during callbacks is non-trivial. When multiple threads are issuing commands and commands have completion callbacks, test under various load conditions such as single core/multicore machines under various load conditions (i.e., 90% busy with other work or just single core running OpenCL application)  to weed out hard-to-diagnose thread data safety issues.<br /> <br /> <b>Kernel Objects</b><br /> If you are using threads, multiple command queues and asynchronous event based callbacks, it is better not to share kernel objects, as then you can set kernel arguments as needed in advance without having to worry about synchronization. Note that clSetKernelArg is not thread safe. This strategy is very helpful when you are doing similar work on multiple datasets (say processing 300 pictures or a video multiple frames in advance).<br /> <br /> <b>Profiling</b><br /> Profiling code should be used as a tool to calibrate data transfer costs and execution costs, along with other fixed costs such as submitted and start time differences for a given command queue. Profiling code should be stripped out when releasing code, as profiling comes at a cost. Read more at <a href="http://software.intel.com/en-us/articles/performance-debugging-intro/">Intel Performance Debugging Intro</a>.<br /> <br />
<h2 class="sectionHeading">Writing OpenCL™ Kernel for Performance</h2>
Please refer to the <a href="http://software.intel.com/file/37171/">Writing Optimal OpenCL code With Intel OpenCL SDK document</a>[1] for detailed recommendations related to developing OpenCL kernels targeted for CPUs.<br /> <br /> Intel® OpenCL SDK 1.1now includes an Implicit Vectorization Module. This vectorizer works best with 32-bit (float, int) data types, refer to chapter 2.6 of the <a href="http://software.intel.com/file/37171/">Writing Optimal OpenCL code With Intel OpenCL SDK document</a>[1].  In turn, runtime takes care to execute your job in an efficient and balanced way. This means that for simple image processing type of kernels, programmers do not need to set Global Work Size to number of cores and then have loops in kernels to operate on an image. Instead, providing a sufficient number of Work Groups is preferable; refer to chapter 2.7 of the <a href="http://software.intel.com/file/37171/">Writing Optimal OpenCL code With Intel OpenCL SDK document</a>[1]. Programmers should just program in a natural way (i.e., set up global work size to image width/height, etc.) and then in kernel, write code using vector data types as scalar types, as if programs were writing simple scalar code.  Auto vectorizer often does AOS (array of structures) to SOA (structures of arrays) translations under the hood to get best performance with SIMD units.<br /> <br /> Kernels utilizing vector data-types float4, float8, float16 perform better than kernels written using just scalar floats for the same task. If your algorithm naturally fits with vector data types, use these types. If Image has only RGBs, use float3 data-types (new OCL 1.1 feature).  Float3 data-types utilize SIMD units, so they are not as efficient as Float4 which naturally fits to SSE registers.<br /> <br /> <b>General recommendations for any device</b><br /> These are general recommendations which typically help all kernels regardless of target device.   These include using vector data types explicitly (though we generally advise to use scalar types and then rely on the vectorizer), using built-in functions, avoiding computations in kernels that can be done once, avoiding branching, avoiding handling of edge conditions in kernels, and using the preprocessor for constants.<br /> <br /> <b>Device Fission</b><br /> This is a preview feature available in Intel® OpenCL SDK 1.1. Using this feature (enabled using cl_ext_device_fission), programmers can create subdevices and then create command queues for those subdevices to queue kernels. This way, all resources are not allocated to a single command queue. There are several modes for device fission, such as divide equally, based on counts, or based on affinity domain (create a subdevice for every NUMA node). Please refer to the Intel OpenCL SDK User’s guide for more details.<br /> <br /> To use device fission, always create subdevices before creating contexts, and measure your performance to see if performance of the subdevice with the given number of compute units is acceptable. We will cover this feature more in detail in our next white paper.<br /> <br /> <b>Profiling and Debugging</b><br /> Profile your kernel execution and submission to start times using OpenCL events profiling. This is very helpful when using out-of-order queues to see how well multiple kernels are getting executed. Integration with Intel GPA is also useful for Out-Of-Order queue debugging, as it shows a visual representation of the same information. Debugging is still pretty much print-based, and debugging kernels is still lot easier on CPUs than it is on GPUs. See the Tools introduction at <a href="http://software.intel.com/en-us/articles/introduction-to-intel-opencl-tools/">http://software.intel.com/en-us/articles/introduction-to-intel-opencl-tools/</a>.<br /> <br />
<h2 class="sectionHeading">References</h2>
[1] “Writing Optimal OpenCL™ Code with Intel® OpenCL SDK”, located at <a href="http://software.intel.com/file/37171/">http://software.intel.com/file/37171/</a> and at install-dir&gt;\docs\<br /> <br />
<h2 class="sectionHeading">About the Author</h2>
<img src="http://software.intel.com/file/40587"  /> Vinay Awasthi works as an Application Engineer for the Apple* Enabling Team at Intel at Santa Clara. Vinay has a Master’s Degree in Chemical Engineering from Indian Institute of Technology, Kanpur. Vinay enjoys mountain biking and scuba diving in his free time.<br  /> <br />
<div id="vc-meta" >
<div id="vc-meta-pubdate">12-21-2011</div>
<div id="vc-meta-modificationdate">12-21-2011</div>
<div id="vc-meta-taxonomy"></div>
<div id="vc-meta-category">
<div></div>
</div>
<div id="vc-meta-category-product">
<div></div>
</div>
<div id="vc-meta-thumb"></div>
<div id="vc-meta-abstract">This white paper is the third in a series of whitepapers on OpenCL™ describing how to best utilize underlying Intel hardware architecture using OpenCL. This white paper will go over programming considerations for host-side device orchestration, as well as OpenCL kernels for CPU.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/opencl-programming-for-cpu-performance/</link>
      <pubDate>Wed, 21 Dec 2011 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/opencl-programming-for-cpu-performance/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/opencl-programming-for-cpu-performance/</guid>
      <category>Parallel Programming</category>
    </item>
    <item>
      <title>OpenCL™ – Using Events</title>
      <description><![CDATA[ <h2 class="sectionHeading">Introduction</h2>
This white paper is the fourth in a series of white papers on OpenCL describing how to set up and use events in multithreaded design. This white paper will go over various design choices using OpenCL™ user and command queue-related events for kernels running on CPUs.<br /> <br /> The Intel® OpenCL 1.1 specification Beta implementation for CPU (Intel® Core™2 Duo or later CPUs) can be retrieved from <a href="http://software.intel.com/en-us/articles/opencl-sdk">http://software.intel.com/en-us/articles/opencl-sdk</a>. It is still evolving into a mature product, so feel free to try it and provide feedback to us in the <a href="http://software.intel.com/en-us/forums/intel-opencl-sdk/">Intel® OpenCL SDK Support Forum</a>. At present, Intel OpenCL 1.1 only runs on Linux* 64 bit, Microsoft Windows 7* (with SP1) and Microsoft Windows Vista* operating systems (32-bit and 64-bit).<br /> <br /> Intel OpenCL 1.1 events are used primarily to synchronize commands in a context.  Event objects can be used to track which one of the four states CL_QUEUED, CL_SUBMITTED, CL_RUNNING and CL_COMPLETE a given command is in for a given command queue. User events are used to trigger processing when host threads detect that certain conditions are met. Since user events can be triggered as and when needed, and commands in the command queue can wait on user events as needed, this makes user events the best way to organize command executions when commands are submitted to multiple command queues or to out-of-order command queues.<br /> <br /> Non-user events start with initial state CL_QUEUED, and user events start with CL_SUBMITTED as their initial state. Since the OpenCL specification does not call out specifically what should happen when commands are terminated (behavior is implementation-specific), programmers need to utilize the context creation callback function to handle command termination errors effectively.<br /> <br /> <b>For single in-order command queue configuration</b>, events are usually used to synchronize host thread memory management (e.g. managing buffer ownership in CL/GL CL/D3D10 sharing or clearing/recycling buffers) and kernel executions. Since all commands are executed by the command queue in order, there is no need to synchronize commands within the command queue. The host thread may either put clFinish() (wait for all submitted commands to finish), or use clWaitForEvents()/clGetEventInfo()/clEnqueueBarrier() (ensures all previous commands are finished), to synchronize memory management with kernel execution. clFinish() is a heavy-handed brute force way to make sure all work is done before proceeding further, as it does not return  until all submitted work is done. Command clWaitForEvents() will also block the host thread, but only for commands listed in the event list.<br /> <br /> A better way is to set up an event callback at CL_COMPLETE (only available in OpenCL 1.1) which sets up buffers in the callback function as needed when event completion occurs (this may happen asynchronously, so make sure it is thread safe). This will not block the host thread, and the host thread can freely do other tasks at hand.<br /> <br /> <img src="http://software.intel.com/file/40589" /><br /> <br /> <b>Fig 1.0 Using Event and Event Callbacks in single in-order Command Queue</b><br /> <br /> <b>For single out-of-order command queue configuration</b>, events are used to synchronize host thread memory management and kernel executions, as well as command execution order as required by algorithms within the command queue. The host thread may use clFinish() (wait for all submitted commands to finish), or use clWaitForEvents()/clGetEventInfo()/clEnqueueBarrier() (ensures all previous commands are finished) to synchronize memory management with kernel execution or to explicitly synchronize various set of commands.<br /> <br /> Commands clFinish() will block the host thread, and it will not allow execution of other commands which can be executed while waiting for previous commands.<br /> <br /> Command clWaitForEvents() is little better, as it will block the host thread, but only for commands listed in the event list. This explicit way of managing commands is not the best way to fully utilize the device.<br /> <br /> Developers should submit commands with event wait lists configured as really needed by an algorithm. This way, the command queue has a lot more flexibility in deciding which commands can be executed while others are in pipeline. Here is a simple example of this approach.<br /> <br /> <img src="http://software.intel.com/file/40590" /><br /> <br /> <b>Fig 2.0 Managing various kernels using Out-Of-Order queues and Event/Event-wait-lists. Use non-blocking read and writes.</b><br /> <br /> Another even more efficient way is to run commands in separate threads, using event callbacks and user events as shown below.<br /> <br /> <img src="http://software.intel.com/file/40591" /><br /> <br /> <b>Fig 3.0 Managing various kernels using Out-Of-Order queues and User Events/Event/Event-wait-lists. Use non-blocking read/writes.</b><br /> <br /> <b>In multi-device/device fission context with multiple command queues</b>, the scheme in Fig 3.0 can be extended to use multiple user events (one per command queue). OpenCL events provide a similar design paradigm as graphical user interface (GUI) design based on events generated by the user. This way, the program performs tasks related only to the event at hand. The main thread can simply set up work related to each event in related event callbacks, and continue to do other work without blocking or waiting.<br /> <br /> User events provide a way for the host program to trigger events which are outside the framework of OpenCL commands. OpenCL commands can wait for user event completion before moving forward. This way, a fine-grain control over execution order of various commands can be achieved while managing code complexity.<br /> <br />
<h2 class="sectionHeading">Profiling Using Events and Event Callbacks</h2>
Profiling with events can provide a fine-grain portable way to collect “time taken in nanoseconds”-based data for almost all commands submitted to the command queue.<br /> <br /> Unfortunately, commands cannot block in callback functions (no clFinish, clBuildProgram, clWaitForEvents, or any blocking commands), so a developer cannot simply call clGetEventProfilingInfo in event callbacks for non-blocking commands, just as data provided by clGetEventProfilingInfo is only useful once a command is complete.<br /> <br /> Profiling data for an event that triggered a callback at completion can be taken without any issue.<br /> <br />
<h2 class="sectionHeading">Using Markers</h2>
Markers provide points of synchronization based on Marker Events. Programmers can use marker-based approaches if there is a need to order kernel executions based on a certain order. Using markers and events, programmers can ensure that kernel1 and kernel2 finish and all data is copied to the host before kernel3 starts executing.<br /> <br /> Events provide easy ways to identify commands and provide execution status and profiling information, and can also be used to synchronize commands. Events provide developers with a way to control commands at fine command-level granularity.<br /> <br /> Event wait lists ensure the order in which commands need to execute in a command queue. Developers should always profile all commands to see where time gaps exist, and see if they can be filled with other commands. Profiling usually only gives consistent data when used in large loops, so run often and run on various systems to ensure optimal design choices are made.<br /> <br />
<h2 class="sectionHeading">About the Author</h2>
<img src="http://software.intel.com/file/40587"  /> Vinay Awasthi works as an Application Engineer for the Apple* Enabling Team at Intel at Santa Clara. Vinay has a Master’s Degree in Chemical Engineering from Indian Institute of Technology, Kanpur. Vinay enjoys mountain biking and scuba diving in his free time.<br  /> <br />
<div id="vc-meta" >
<div id="vc-meta-pubdate">12-21-2011</div>
<div id="vc-meta-modificationdate">12-21-2011</div>
<div id="vc-meta-taxonomy"></div>
<div id="vc-meta-category">
<div></div>
</div>
<div id="vc-meta-category-product">
<div></div>
</div>
<div id="vc-meta-thumb"></div>
<div id="vc-meta-abstract">This white paper is the fourth in a series of white papers on OpenCL describing how to set up and use events in multithreaded design. This white paper will go over various design choices using OpenCL™ user and command queue-related events for kernels running on CPUs.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/opencl-using-events/</link>
      <pubDate>Wed, 21 Dec 2011 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/opencl-using-events/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/opencl-using-events/</guid>
      <category>Parallel Programming</category>
    </item>
  </channel></rss>
