<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Wed, 23 May 2012 18:31:27 -0700 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/articles/home/type/technical-article/feed/" rel="self" type="application/rss+xml" />
    <title>Intel Software Network articles Feed</title>
    <link>http://software.intel.com/en-us/articles/home/type/technical-article/</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>Unhandled Exceptions when Debugging OpenMP applications</title>
      <description><![CDATA[ <strong>Problem :</strong> <br />When debugging OpenMP* applications built with the Intel® C++ or Fortran Compiler an unhandled exceptions dialog like the following<br /><br /><em>First-chance exception at 0x74f9b9bc (KernelBase.dll) in &lt;appname&gt;.exe: 0xA1A01DB1: <br />Intel Parallel Debugger Extension Exception 1<br /><br /></em>may appear. Normally these exceptions are handled by the Intel® Parallel Debugger Extension in the background, but under the environment specified below the exception dialog may pop up whenever an OpenMP application is started for debugging and when it’s terminated.<br /><br /><br /><strong>Environment :</strong> <br />Windows* 7 Enterprise, SP1, 64-bit<br />Microsoft Visual Studio* 2010 SP1<br />Intel® Parallel Studio XE 2011 SP1 Update 2<br />Intel® Inspector XE 2011 Update 9<br /><br /><br /><strong>Resolution :</strong> <br />When the unhandled exception dialog appears, click on 'Continue'. The root cause of the problem is identified and will be fixed in a future version. ]]></description>
      <link>http://software.intel.com/en-us/articles/unhandled-exceptions-when-debugging-openmp-applications/</link>
      <pubDate>Tue, 27 Mar 2012 15:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/unhandled-exceptions-when-debugging-openmp-applications/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/unhandled-exceptions-when-debugging-openmp-applications/</guid>
      <category>Parallel Programming</category>
      <category>Intel Software Network communities</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
      <category>Intel® Inspector XE Knowledge Base</category>
    </item>
    <item>
      <title>Using Intel® Power Checker to measure the energy performance of a compute-intensive application </title>
      <description><![CDATA[ <p>Intel® Power Checker provides developers with a quick and easy way to evaluate the idle power efficiency of their applications on mobile platforms with Intel® Core™ processor or Intel® Atom™ technology running the Microsoft Windows* operating system. Any compiled language application, especially those designed to run on technology based on Intel® products and Java Framework applications can be analyzed by Intel Power Checker. The checker can be used with or without a supported external power meter.</p>
<p>The Intel Power Checker 2.0 now supports measurement both on battery and with the system plugged into an external AC power source. External power measurement is only supported on Intel® Second Generation Core processors and if the Intel® Power Gadget software has been installed.</p>
<p>For this article, I took a very compute-intensive parallel application that I wrote to solve instances of the logic puzzle Akari. The code uses a backtracking algorithm to explore how to place light bulbs onto a grid under constraints dictated by the rules of the puzzle and the layout of the puzzle instance. Potentially millions of independent tasks can be generated by the code as the solution space is searched by threads executing those tasks. This solution method is eminently scalable to a large number of threads and is able to keep many cores running at peak speed for a sustained amount of time.</p>
<h2>How to Use Intel Power Checker</h2>
<p>The Intel Power Checker provides a GUI wizard that leads you through the four steps of power analysis. These four steps in the checker are described below. Before starting the assessment, be sure to know which section of your application (a workload) you want to be measured, as the Power Checker will only measure a 30 second execution interval. (If you want to measure the entire execution workload, you should try some other tool, like Intel Power Gadget.) Your workload could be a compute-intensive portion or an I/O-intense section or just some point in execution that typifies the majority of expected usage.</p>
<h3 >Step 1: Specifying the Power Meter device</h3>
<p>If you have an external power meter attached to your test system, you can select the model being used on the first screen of the wizard. The default is that no external device is being used. For this default case, Intel Power Checker will determine if the system is capable of providing power consumption data and if the correct power driver, EzPwr.sys, is installed. (The driver is part of the default installation of <a href="http://software.intel.com/en-us/articles/intel-power-gadget/">Intel Power Gadget</a>.)</p>
<h3 >Step 2: Measure System Baseline</h3>
<p>The first measurement that the Intel Power Checker will perform is on the next screen within the wizard. This is to measure the baseline power consumption of the hardware without your application running. Prior to this measurement phase any unnecessary processes such as operating system updates, Windows Indexing Service, virus scans, media players, and internet browsers should have been shut down. In other words, to get the most accurate results you should make your test system as idle as possible and ensure that nothing will become a foreground process during your measurement runs.</p>
<p>Once you have a quiescent system, click the “Start” button to begin this phase of the testing. The Intel Power Checker waits 15 seconds to allow the system to come to an idle state before starting the measurements. You need to be sure to position your mouse and the keyboard out of reach, or keep your hands away from them, to avoid any stray contact that might trigger some response from the platform. After the pause, the checker will observe the system for 30 seconds in this idle state. A progress bar will show the time remaining in each part of this phase. Once the baseline data collection is complete, click the “Next” button to proceed to the next phase.</p>
<h3 >Step 3: Measure Active Application</h3>
<p>Before you are taken to the next screen in the wizard, you are instructed to start the application you are interested in measuring. Start up your application and click the “OK” button to advance the GUI to the next screen. Once you have reached the Step 3 screen, use the scroll bar to locate your application in the process list and click on that line to select it. If your application is not listed, click the “Refresh List” button so that your application’s process will be available to select. In addition, you can use the “Apply Filter” button to narrow down the list in order to find your application’s process quickly. .After selecting your application from the list, click “Next” to move on to the data collection for this phase. Before starting the assessment, be sure your application has reached the desired point of measurement. If there are some initial setup computations that are not of interest, you will need to get past this point before letting Intel Power Checker begin measurement. For my Akari application, there is very little setup time. It was typically in the thick of computation by the time I had gotten to the point of selecting the process from the list.</p>
<p>As soon as I could, I clicked the “Start” button to begin capturing measurement data. Since this is one of the crucial power measurements for your application, always begin capturing data <b>after</b> the workload or critical section has begun and make sure this active execution will run longer than the 30 seconds needed to complete the measurement time.</p>
<h3 >Step 4: Measure Idle Application</h3>
<p>The final phase is to measure your application’s idle power consumption. This is another important phase of energy efficiency measurement of an application since your application must not only do efficient computation, but also not waste energy when sitting idle.</p>
<p>This step doesn’t make much sense within my compute-intensive application since there is no idle state of the application. Once you start the application on a given puzzle instance, it simply computes all legal solutions in parallel and then ends. As (multiple) solutions are found, they are printed out by the thread that found it. If there are no solutions, a message is printed just before the application terminates. This latter case describes the workload I used for my tests. Because you must have your application running in “idle” mode for this step, I left the application running at full speed and simply allowed Power Checker to take its measurements.</p>
<p>If your application does have an idle state, perhaps waiting for interaction from the user, the checker will give the system 15 seconds to calm down fully before taking a final 30 second measurement.</p>
<p>Upon completion of this last data collection phase, you will be able to proceed to the results screen within the Intel Power Checker wizard. After all three measurement phases have been completed; a Tool Report File will be generated containing all of the results for later analysis.</p>
<h3 >What data is presented</h3>
<p>The View Results screen of the Intel Power Checker wizard provides basic information about the software assessment. The type of processor in your system and the type and model of the power source that was used are given. Four numerical values for each of the three measurement phases are presented. These values are:</p>
<ul >
<li><b>Elapsed Time:</b> The exact number of seconds that each of the phases lasted.</li>
<li><b>Energy Consumption:</b> The rate that the battery was discharged during each of the three phases.</li>
<li><b>Average C3 State Residency:</b> The percentage of time that the system was in the C3 state during the data collection period.</li>
<li><b>Platform Timer Period:</b> The number of milliseconds that the platform timer collected</li>
</ul>
<p><img src="http://software.intel.com/file/42410" /></p>
<p>Typical results would hopefully show a larger percentage of time spent in the C3 State Residency for the application idle time measurement (the middle of the three columns on the View Results screen). As my puzzle solving application was still computing as much as it did in the active execution measurement step, this was not the case for my results. This is atypical for the intended type of applications Intel Power Checker assumes will be measured. Thus, the C3 State Residency values provided by the tool for the idle application were not valid for my particular application.</p>
<p>The name of the report file and the directory to which it will be found are listed on the View Results screen.</p>
<h2>Some Caveats</h2>
<p>Below are some things you should consider before and during a measurement run using Intel Power Checker.</p>
<ul >
<li>Before you start using Intel Power Checker, be sure your chosen workload will run for at least 30 seconds from the point you wish to measure power consumption. In my case, I required a data set that would force the application to run for at least 75 seconds (30 for active measurement, 15 for idle setup, and 30 for idle measurement) plus the time I needed to click boxes and find my application in the process list. Since I ran the application on several different numbers of threads, I needed to be sure that the fastest execution time was still large enough to get all the timings steps completed during a Intel Power Checker run.</li>
<li>Upon starting Intel Power Checker, the checker may first report that the platform timer period is invalid. In this case, some currently running (background) process has changed the default and it will be up to the user to determine which currently running application has changed the value. Once you have identified the culprit you must stop this process or service before restarting Intel Power Checker. If you are unsure about which active process is preventing Intel Power Checker from starting, you will need to turn off processes one at a time and try Intel Power Checker until the error message doesn’t come up. </li>
<li>Instructions on the Step 3 screen ask you not to touch the keyboard or mouse. If you are measuring an interactive application or you must interact with the application to generate activity for the full 30 seconds, you will need to touch the keyboard and/or mouse. If possible, a workload that can forego interactivity and still compute for the 30 seconds of measurement time would be best. However, if interaction by the user is part of how the application is utilized, interfacing through peripherals will give you a more accurate measure of the overall energy consumption for typical application usage.</li>
<li>A data file is created during each phase of the Intel Power Checker assessment to hold the current information. If you cancel the assessment in any of the three phases then a data file will not be created for that phase. After all three phases have been completed, a Tool Report File, in XML format, will be generated containing all of the results. You can find the name of the report file and where it is located on the View Results screen.</li>
<li>The “Submit Results” button on the View Results screen is optional and only intended for members of the <a href="http://software.intel.com/partner/overview">Intel® Software Partner Program</a> to submit their measurement results to the program. If you are not a member, do not submit your results. Simply click on the “Close” button after you have examined the results compiled by Intel Power Checker.</li>
</ul>
<h2>Some Results</h2>
<p>The purpose of this article is not to determine the best scenario for running my Akari solver application in the most energy efficient way. You will want to do this for your application, though, and this article has given you the background on Intel Power Checker to determine if this checker can help you quantify the current power consumption of your application. Also, as you make modifications to the application you will be able to determine if those changes improve the energy efficiency or cause your application to suck more power than before.</p>
<p>In addition to the average C3 State Residency percentage, the checker delivers the total number of Joules expended during the 30 seconds of execution time measured. From this I can compute the average Watts for execution parts of the application. I have found that a better metric for comparing different applications or different runs of the same application is milliwatt hours (mWh). You need the total execution time of the execution portion of the application to compute this value. Since Intel Power Checker only measures activity in 30 second segments, you will need to have some timing data available, which I happened to have for the different runs I made of my Akari application.</p>
<p>I found significant differences when running with and without Hyper-Threading Technology (HT) turned on. Also, if the platform was running on battery (DC) power or from the wall socket (AC) power, a difference in execution time and power usage was evident. For example, when running with HT on and a full complement of four threads on the 4 logical cores in my system, I saw the AC power run 1.19X faster that when running the same workload on DC power. However, the former run took 1.15X more power.</p>
<p>Comparing results between runs on DC power versus AC power is a not a good comparison, especially in this case. The power source is detected by the system and the processor is allowed to run with Intel® Turbo Boost Technology at a higher frequency if the platform is using external power. Even so, you may need to be concerned about power consumption of your application in both power source circumstances and you will need to run measurement experiments within each setup to gauge how well your application modifications affect overall power consumption.</p>
<h3 >System Requirements</h3>
<p>You can use Intel Power Checker on a laptop or netbook based on Intel® Core™ processor or Intel® Atom™ processor technology. A desktop with an external power meter or a desktop that is capable of providing the power consumption information can also be analyzed. A Java* Runtime Environment (JRE) (version 6 update 11 or higher) is also required to run the checker. Supported operating systems are Microsoft Windows* XP (Service Pack 3), Microsoft Windows Vista* (Service Pack 2), Microsoft Windows* 7 (Service Pack 1 [32-bit and 64-bit]), and Microsoft Windows* Server 2008 R2.</p>
<h3 >Download link</h3>
<p>To download the Intel Power Checker installation package, go to the following link:</p>
<p><a href="http://software.intel.com/partner/app/software-assessment">http://software.intel.com/partner/app/software-assessment/</a>. Click on the Intel Power Checker tab to move down to the download link.</p>
<h3 >Other supporting links</h3>
<p>There is a video demonstration of using Intel Power Checker, “A Look at Intel Power Checker,” at the link: <a href="http://software.intel.com/en-us/videos/channel/intel-software-partner-program/a-look-at-the-intel-power-checker/1127786023001">http://software.intel.com/en-us/videos/channel/intel-software-partner-program/a-look-at-the-intel-power-checker/1127786023001</a>. Dave Valdovinos and Taylor Kidd, both from Intel, show off the GUI wizard as it measures the power performance of a game-like application.</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/using-intel-power-checker-to-measure-the-energy-performance-of-a-compute-intensive-application/</link>
      <pubDate>Mon, 12 Mar 2012 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/using-intel-power-checker-to-measure-the-energy-performance-of-a-compute-intensive-application/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/using-intel-power-checker-to-measure-the-energy-performance-of-a-compute-intensive-application/</guid>
      <category>Mobility</category>
      <category>Parallel Programming</category>
      <category>Intel® AppUp(SM) Developer Community</category>
      <category>Intel Software Network communities</category>
      <category>Intel SW Partner program</category>
      <category>Intel Software Network communities</category>
      <category>Game Development</category>
      <category>Power Efficiency</category>
      <category>Intel® vPro™ Developer Community</category>
      <category>Resources For Software Developers</category>
      <category>Ultrabook</category>
      <category>Server Developer Community</category>
    </item>
    <item>
      <title>Using Intel Cluster Checker to check that MPI applications will properly run over InfiniBand</title>
      <description><![CDATA[ <p class="MsoNormal">One of the benefits of Intel Cluster Checker is that it acts as an application proxy. If the tool passed, then there is a high probability of an MPI application running properly.<o:p></o:p></p>
<p class="MsoNormal">To ensure this, the following exhaustive steps are enforced by Intel Cluster Checker test modules:<o:p></o:p></p>
<p class="MsoListParagraphCxSpFirst" > </p>
<ol>
<li><span >·<span > </span></span><span >Check that base libraries and their uniformity (<b>base_libraries</b>)</span></li>
<li><span >·<span > </span></span><span >Check that MPI tools have consistent paths (<b>mpi_consistency</b>)</span></li>
<li><span >·<span > </span></span><span >Check that per-node MPI jobs can do Hello World independently (<b>intel_mpi_rt</b>)</span></li>
<li><span >·<span > </span></span><span >Check that a global Hello World is successfully executed across compute nodes (<b>intel_mpi_rt_internode</b>)</span></li>
<li><span >·<span > </span></span><span >Runs Intel MPI Benchmarks such as Ping Pong to check available latency and bandwidth (<b>imb_pingpong_intel_mpi</b>)</span></li>
<li><span >·<span > </span></span><span >Stress the communication system by running the HPCC benchmark (<b>hpcc</b>)</span></li>
</ol>&lt;!--[if !supportLists]--&gt;<o:p></o:p>
<p> </p>
<p class="MsoListParagraphCxSpMiddle" ><o:p></o:p></p>
<p class="MsoListParagraphCxSpMiddle" ><o:p></o:p></p>
<p class="MsoListParagraphCxSpMiddle" ><o:p></o:p></p>
<p class="MsoListParagraphCxSpMiddle" ><o:p></o:p></p>
<p class="MsoListParagraphCxSpLast" ><o:p></o:p></p>
<p class="MsoNormal">If the tool reports something, then an MPI application might have issues to complete their work.<o:p></o:p></p>
<p class="MsoNormal">These steps will even catch potential timeouts due wrong configuration on the network stack; and most important, bad cabling or down hardware interfaces. However, if the cluster uses InfiniBand adapters then there is a known issue to be aware of. The global MPI check can hang as any other MPI application will do if InfiniBand is not correctly configured and online.<o:p></o:p></p>
<blockquote>
<p class="MsoNormal"><span >Intel(R) MPI Library Runtime Environment (All nodes), (intel_mpi_rt_internode, 1.8.....................................................</span><span >^C</span></p>
<p class="MsoNormal"><span >Caught signal INT, cleaning before termination.<o:p></o:p></span></p>
</blockquote>
<p class="MsoNormal">With InfiniBand setups, the configuration of Intel Cluster Checker must define openib and dat_conf as dependencies of intel_mpi_rt_internode. This action will ensure that the InfiniBand devices are properly detected and healthy. openib check hardware devices, and dat_conf the DAPL software interface.<o:p></o:p></p>
<blockquote>
<p class="MsoNormal">&lt;intel_mpi_rt_internode&gt;<o:p></o:p></p>
<p class="MsoNormal">&lt;add_dependency&gt;dat_conf&lt;/add_dependency&gt;<o:p></o:p></p>
<p class="MsoNormal">&lt;add_dependency&gt;openib&lt;/add_dependency&gt;<o:p></o:p></p>
<p class="MsoNormal">&lt;/intel_mpi_rt_internode&gt;<o:p></o:p></p>
</blockquote>
<p class="MsoNormal">This decision cannot be done automatically as choosing were to use or not the low latency, high bandwidth capabilities of InfiniBand during the check is at discretion of the user. For instance, the administrator may want to double check that an Ethernet fabric can be properly used to run MPI applications.<o:p></o:p></p>
<p class="MsoNormal">Be aware that this manual requirement may be lifted in the near future.<o:p></o:p></p> ]]></description>
      <link>http://software.intel.com/en-us/articles/using-intel-cluster-checker-to-check-that-mpi-applications-will-properly-run-over-infiniband/</link>
      <pubDate>Tue, 07 Feb 2012 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/using-intel-cluster-checker-to-check-that-mpi-applications-will-properly-run-over-infiniband/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/using-intel-cluster-checker-to-check-that-mpi-applications-will-properly-run-over-infiniband/</guid>
      <category>Parallel Programming</category>
      <category>Intel® Cluster Ready</category>
      <category>Tools</category>
      <category>Intel Software Network communities</category>
      <category>Intel Software Network communities</category>
      <category>Resources For Software Developers</category>
      <category>Server Developer Community</category>
    </item>
    <item>
      <title>Ultrabook™ and the Intel® Energy Checker SDK</title>
      <description><![CDATA[ <h2 class="sectionHeading">Abstract</h2>
With the advent of the Ultrabook™<sup>1</sup>, the demand for applications that are power misers continues to rise. The Intel® Energy Checker SDK can be used to instrument an application and collect data to help a developer pinpoint power hungry features that can be optimized for power. This article gives an overview of the Intel Energy Checker SDK and discusses how it can be used to advantage when improving energy usage on an Ultrabook.<br /><br />
<h2 class="sectionHeading">More Work, Less Power</h2>
An Ultrabook™ needs to budget its power consumption very carefully to extend usefulness while running on battery. Therefore, applications that use less energy are preferred. Often, application developers create their program on a desktop system where power/energy consumption is less important than raw performance. Not only should applications be developed to conserve power when active, they should also be developed to minimize energy usage during program idle periods, this is often overlooked and can greatly extend battery life. If power issues are ignored, running a program on an Ultrabook will result in unpleasant surprises for the user. If developers test their application on an Ultrabook system during development, they will gain insight into how well the program runs in a power limited environment. An analysis tool such as the <a href="http://software.intel.com/en-us/articles/intel-energy-checker-sdk/">Intel® Energy Checker SDK</a> can be a powerful companion during the optimization phase for software designed for an Ultrabook.<br /><br />
<h2 class="sectionHeading">Energy Efficency</h2>
Before explaining what Intel Energy Checker SDK contains, a discussion on Energy Efficiency (EE) is in order. This is a term that is used extensively in the Intel Energy Checker SDK. There is no universally accepted definition of EE, so for the purposes of this tool it is defined as:<br />
<p ><em>EE=Work/Energy</em></p>
<em>Work</em> is defined as the amount of “<em>useful work</em>” done by a software application. There is no concise, easy definition of the term <em>useful work</em> either, as what is considered <em>useful work</em> in one program may be quite different in another application. The developer is required to make that determination. For example, one might consider the areas of a movie player program where it provides the customer value (such as decoding the movie) as useful work whereas areas of the program that are accessing resources, waiting on input, or performing synchronization would not.<br /><br />
<h2 class="sectionHeading">Code Instrumentation</h2>
The first step in using Intel Energy Checker SDK to help determine an application’s EE is to create and use “counters” in the software to determine quantities of “useful work”. A counter is defined as a 64-bit (8 byte) variable that keeps a running total of how many times a particular event occurs. In the “C” language, this becomes an unsigned long long data type. A developer can create one or more counters during the initialization portion of the software. Next, a container for the counters can be created, called a “Productivity Link” (PL)<sup>2</sup>. Each PL holds up to 512 counters, and up to 10 different PL’s can be open at one time, but most software will require far smaller numbers of counters and PL’s.<br /><br />During the application runtime, values can be written to any counter in the PL, based on the developer’s requirements. Intel Energy Checker SDK can collect the information from the PL’s in order to determine how much work was done.<br /><br />
<h2 class="sectionHeading">Energy Consumed</h2>
The second part of finding the EE of a software application is to measure how much energy was consumed while the program was running. To do this, Intel Energy Checker SDK uses two tools which are included in the SDK download: Energy Server (ESRV) and Temperature Server (TSRV). ESRV is used to monitor energy and power consumption as reported by external power tools while TSRV monitors temperature related information as reported by environmental probes. ESRV and TSRV counters can be accessed by any program using the Intel Energy Checker API. In addition to the counters created by the developer to determine quantities of work, the developer will want to add counters to collect information from ESRV and possibly TSRV. There are three different ways to set up ESRV:<br /><br /><ol>
<li>Use a power meter to collect actual “platform energy and power” information.<br /><br />There are several different power meters that work with the Intel Energy Checker SDK. Please consult the <em>Intel® Energy Checker SDK User Guide</em> included in the download or found on the <a href="http://software.intel.com/en-us/articles/intel-energy-checker-sdk/">Intel® Energy Checker SDK page</a> to determine which power meters will work and how they should be attached to the test system.<br /></li>
<li>Use <a href="http://software.intel.com/en-us/articles/intel-power-gadget/">Intel® Power Gadget</a> to collect “processor energy and power” usage information on 2nd Generation Intel Core™ processor family. External power meters can also be used which report platform power together with Intel Power Gadget that provides processor power.The blog Accessing Intel® Power Gadget From Intel® Energy Checker SDK by Intel engineer Jun De Vega discusses how to enable Intel® Power Gadget with Intel® Energy Checker.<br /></li>
<li>Choose to use the simulation method which will use the CPU utilization percentage returned from the OS. This method does not require a hardware probe. The Intel Energy Checker SDK offers this method as an option for all processors (rather than just the 2nd Generation Intel Core processor family as with the Intel Power Gadget) in order for enable the user who does not have a power meter. Included in the SDK is a support library for accessing this metric.</li>
</ol>
<p ><img src="http://software.intel.com/file/41168" /><br /><br /><strong>Figure 1:</strong> Conceptualized drawing of Intel Energy Checker setup with Instrumented Application, Power Meter and Environmental probes attached</p>
<h2 class="sectionHeading">Intel Energy Checker Extras</h2>
There are two companion tools that are bundled with the Intel Energy Checker SDK in addition to those already mentioned. The PL GUI Monitor is a user interface that displays Productivity Link (PL) counters in a running program that has already been instrumented with the Intel Energy Checker API. The PL CSV Logger<sup>3</sup> is an application that can collect and write PL counters to a CSV file for later analysis in a variety of spreadsheet applications.<br /><br />Included with the Intel Energy Checker SDK is the <em>Intel® Energy Checker SDK Companion Application User Guide</em> that discusses the features and capabilities of both of these tools.<br /><br />
<p ><img src="http://software.intel.com/file/41169" /><br /><br /><strong>Figure 2:</strong> PL GUI Monitor running while a picture is being rendered</p>
The entire Intel Energy Checker SDK includes other build, scripting, interoperability, and monitoring tools to help developers instrument code and collect energy metrics.<br /><br />A white paper entitled “<em>How Green Is Your Software?</em>” is available for download from the SDK site. This paper discusses approaches for making software power efficient. Look for it in the “Code, Resources and Documentation” section of the <a href="http://software.intel.com/en-us/articles/intel-energy-checker-sdk/">Intel Energy Checker SDK page</a>. Several blogs about Intel Energy Checker that were written by Intel Engineer Jamel Tayeb will also be helpful:<br /><br /><a href="http://software.intel.com/en-us/blogs/2010/04/15/using-the-intel-energy-checker-sdk-at-home/?wapkw=(Energy+Checker)">Using the Intel® Energy Checker SDK at Home</a><br /><br /><a href="http://software.intel.com/en-us/blogs/2010/02/19/creating-a-simple-device-library-for-intel-energy-checker-sdk/?wapkw=(Energy+Checker)">Creating a Simple Device Library for Intel® Energy Checker SDK</a><br /><br /><a href="http://software.intel.com/en-us/blogs/2010/03/30/measuring-the-energy-consumed-by-a-command-using-the-intel-energy-checker-sdk/?wapkw=(Energy+Checker)">Measuring the energy consumed by a command using the Intel® Energy Checker SDK</a><br /><br />All of these resources allow a developer to get started in gathering helpful information.<br /><br />
<h2 class="sectionHeading">Optimizing Applications for Ultrabooks</h2>
Once a program has been instrumented to collect counter information and an energy collection plan is in place (either simulation or power meter), the setup is complete. The developer will then be able to gather information about the application’s energy usage profile and to incorporate optimizations to improve results.<br /><br />There are several areas of optimization the Ultrabook developer can select for improvements:<br /><br />
<div >Consider modifying the application to be aware of the power status and changing usage to reduce energy consumption when the system is on battery.<br /><br />Check the hardware and software system power management possibilities to choose a balanced power setting. This could be a recommended setting suggested in application documentation.<br /><br />Reduce power usage while the application is actively running or doing work. Compute intensive parts of the program will likely benefit from multi-threading and vectorization techniques.<br /><br />Reduce power usage while the application is idle. Being able to minimize the timer tick rate or setting up periodic actions to happen within the same wakeup period are examples of how to reduce idle application power usage.</div>
<br /><br />
<h2 class="sectionHeading">Summary</h2>
With the growth of Ultrabook devices, it will benefit program designers and developers to take a look at ways to save energy while providing a great user experience on an Ultrabook. Intel Energy Checker SDK can provide the means to identify the key areas of focus and confirm the positive results achieved after optimization. Long live Ultrabook!<br /><br />
<h2 class="sectionHeading">About the Author</h2>
<img src="http://software.intel.com/file/41170"  /> Judy Hartley is a Software Applications Engineer who has been working in the Software and Services Group since 2005. She has contributed to many software products and written about her experiences through blogs and whitepapers. Recently Judy has been working on Graphics and Power tools and training for future Intel processors.<br /><br  />
<hr />
<br /><sup>1</sup> Ultrabook is a trademark of Intel Corporation in the U.S. and/or other countries.<br /><br /><sup>2</sup> A Productivity Link is a term used by Intel Energy Checker to represent an arbitrary or logical collection of counters.<br /><br /><sup>3</sup> CSV is the acronym for Comma Separated Values.<br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/articles/ultrabook-and-the-intel-energy-checker-sdk/</link>
      <pubDate>Tue, 24 Jan 2012 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/ultrabook-and-the-intel-energy-checker-sdk/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/ultrabook-and-the-intel-energy-checker-sdk/</guid>
      <category>Mobility</category>
      <category>What If Experimental Software</category>
      <category>Tools</category>
      <category>Intel Software Network communities</category>
      <category>Intel SW Partner program</category>
      <category>Code &amp; Downloads</category>
      <category>Power Efficiency</category>
      <category>Resources For Software Developers</category>
      <category>Ultrabook</category>
    </item>
    <item>
      <title>How to Automate Static Security Analysis with Intel(R) C++ Compiler for Linux*</title>
      <description><![CDATA[ <p>Automate the static security analysis check done by the Intel(R) C++ Compiler for Linux. Static security analysis is the process of finding errors and security weaknesses in software through detailed analysis of source code.<br /><br />An automated quality gate like this one can notably reduce code reviews efforts, and of course will decrease the likely of having bugs and security threats found once the product is in production. <br /><br />To automate the static security analysis as a quality gate in any project, execute the check without graphical user interface which requires human interaction.</p>
<p> </p>
<p>In the case of legacy projects, ask the developers to submit new code only if they reduce the number of findings.<br />In the case of coding from scratch, allow no findings before uploading new code in your repository.<br /><br />When enabling the check (<strong>-diag-enable sc3</strong>) and compiling the code, a new folder will be created where the findings will be stored using a custom XML format.</p>
<blockquote>
<p>$ file rXsc/data.X/rXsc.pdr<br />rXsc/data.X/rXsc.pdr: XML document text</p>
</blockquote>
<br />The xmlstar* package can be used to easily list the findings and the associated location information (file, line and function). The package provides a command line tool to process XML documents.<br /><br /><a href="http://xmlstar.sourceforge.net/">http://xmlstar.sourceforge.net</a><br /><br />The following line can be used to verify that no findings are found before proceeding with the usual development cycle. <br /><br />
<blockquote>
<p>$ xml sel -t -m /diags/diag -v "concat(message/thread/stacktrace/loc/file, ':', message/thread/stacktrace/loc/line, ':', sc_verbose)" -n rXsc/data.0/rXsc.pdr <br />/home/$USER/work/$PROD/src/pool.c:157:pool.c(157): warning #12178: this value of "ret" isn't used in the program<br />/home/$USER/work/$PROD/src/pool.c:186:pool.c(186): error #12192: unreachable statement<br />/home/$USER/work/$PROD/src/pool.c:216:pool.c(216): warning #12135: procedure "pool_done" is never caled</p>
</blockquote>
<p> </p> ]]></description>
      <link>http://software.intel.com/en-us/articles/how-to-automate-static-security-analysis-with-intelr-c-compiler-for-linux/</link>
      <pubDate>Fri, 13 Jan 2012 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/how-to-automate-static-security-analysis-with-intelr-c-compiler-for-linux/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/how-to-automate-static-security-analysis-with-intelr-c-compiler-for-linux/</guid>
      <category>Tools</category>
      <category>Intel Software Network communities</category>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Resources For Software Developers</category>
    </item>
    <item>
      <title>Intel Cluster Checker 1.8 Execution Time</title>
      <description><![CDATA[ <p>This article shows reference times for a full execution on different node counts, similar to the one required for Intel Cluster Ready architecture compliance. It is expected that a simple interpolation of the provided values will help to roughly estimate execution time during troubleshooting.<br /><br />The executed command line is shown below; it uses an almost empty configuration file, only having the node list file location. This configuration selects default values for all checks. MPI-related tests and benchmarks are executed over the best available messaging fabric.<br /><br />$ cluster-check config.xml --certification<br /><br />On a reference system the wall time was:<br /><br />64 nodes: 2097 seconds (about 0.58 hours)<br />128 nodes: 2825 seconds (about 0.78 hours)<br />256 nodes: 5655 seconds (about 1.57 hours)<br />320 nodes: 6915 seconds (about 1.92 hours)<br /><br /><img height="284" width="306" src="http://software.intel.com/file/39168" alt="walltime.png" title="walltime.png" /><br /><br />This complete check covers different tests: hardware and software uniformity, health and functional wellness behavior, individual node and cluster wide performance, etc. Check the <a href="http://software.intel.com/en-us/articles/intel-cluster-ready-document-library/">product documentation</a> to find out how to run a different set of test modules; in order to have a lighter or deeper coverage with a reduced or increased execution time, respectively.<br /><br />If your system has hundred of nodes check this <a href="http://software.intel.com/en-us/articles/running-intel-cluster-checker-in-big-clustered-systems/">article </a>for more details.<br />The details of the system used to gather reference data can be found <a href="http://software.intel.com/en-us/articles/performance-tools-for-software-developers-use-of-intel-mkl-in-hpcc-benchmark/?wapkw=(mkl+hpcc)">here</a>.<br /><br />Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/intel-cluster-checker-18-execution-time/</link>
      <pubDate>Sat, 08 Oct 2011 20:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/intel-cluster-checker-18-execution-time/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/intel-cluster-checker-18-execution-time/</guid>
      <category>Software Products General</category>
      <category>Tools</category>
      <category>Intel Software Network communities</category>
      <category>Intel® Cluster Checker Knowledge Base</category>
      <category>Intel® Cluster Ready Knowledge Base</category>
    </item>
    <item>
      <title>Using Intel® TBB in network applications: Network Router emulator</title>
      <description><![CDATA[ <p><b>Introduction</b></p>
<p>Intel® Threading Building Blocks is used in wide range of applications. If performance makes sense and multi core platform is used, TBB is good thing to be added to C++ program. Network applications are usually highly-loaded as they process huge amount of traffic and processing time constraints are high. This article is intended to show how TBB can be used in network packet processing software, improving its productivity and processing time.</p>
<p>For a sample project I've created a simplified Network Router emulator. Network Router is a device that routes and transmits IP (Internet Protocol) packets in local area network (LAN). It connects several PCs, provides them access to Internet and internal network. The device has several internal network interfaces and one external.</p>
<p>The sample project emulates Network Router logic. It provides the following functionality:</p>
<ul>
<li>Input packets from file - the application is just a model so there is no need for real interconnection with network interface. Reading from file emulates real reading from network interface.</li>
<li>NAT - Network Address Translation. The router has only one external IP address, but packets should be delivered to several internal devices behind the router. NAT allows port and IP mapping from external to internal and vice versa.</li>
<li>IP routing - delivering packets to appropriate router NIC (Network Interface Controller) according to destination IP.</li>
<li>Bandwidth management - some traffic is real time and it's critical to deliver these packets as quick as possible (e.g. voice over IP). The VoIP protocols maintain telephone conversation and delays would degrade quality. The router can prioritize these critical packets so they can be processed quicker.</li>
</ul>
<p>I've created two versions of Network Router: serial and parallel. The latter uses Intel® Threading Building Blocks. I'll describe how TBB was used in the project and will provide performance results of the program parallelization.</p>
<p><b>Network Router implementation</b></p>
<p>Network router emulator gets packets from file and processes them. Packet processing includes Bandwidth management, NAT translation and IP routing. Packets are processed by several program modules. These processing modules are ordered sequentially, like in assembly line. This is common composition of packet processing application. Input file is a text file, each line represents one IP packet. There is separate thread that reads packets by big chunks.</p>
<p>Intel® TBB has tbb::pipeline class that provides high level framework for such kind of program structure. It has filters that process packets on each stage. Each packet goes through the pipeline and is processed step by step by its filters. One packet is processed sequentially - from first filter to second, than third, etc. However processing of one packet is independent from another, so filters can operate in parallel.</p>
<p ><br />Network Router scheme<br /><img height="256" width="531" src="http://software.intel.com/file/36534"  /></p>
<p><br /><br />Main function:</p>
<pre name="code" class="cpp">#include &lt;iostream&gt; 
#include &lt;sstream&gt;
#include &lt;fstream&gt;
#include &lt;vector&gt;
#include &lt;algorithm&gt;
#include &lt;ittnotify.h&gt;
#include &lt;tbb/pipeline.h&gt;
#include &lt;tbb/concurrent_hash_map.h&gt;
#include &lt;tbb/atomic.h&gt;
#include &lt;tbb/concurrent_queue.h&gt;
#include &lt;tbb/compat/thread&gt;
// Redirects calls to "new" and "delete" to TBB thread safe allocators
#include &lt;tbb/tbbmalloc_proxy.h&gt;

using namespace tbb;
using namespace std;

class bandwidth_manager_t;
class network_adress_translator_t;
class ip_router_t;
class compute_t;
typedef vector&lt;packet_trace_t&gt; packet_chunk_t;

int chunk_size = 1600;
concurrent_queue&lt;packet_chunk_t&gt; chunk_queue;
atomic&lt;bool&gt; stop_flag;

int main(int argc, char* argv[])
{
	ip_addr_t external_ip;
	nic_t external_nic;	
	nat_table_t nat_table;	// NAT table   
	ip_config_t ip_config;	// Router network configuration 					
	int ntokens = 24;	
	
	get_args (argc, argv);	
    ifstream config_file (config_file_name);

    if (!config_file) {
        cerr &lt;&lt; "Cannot open config file " &lt;&lt; config_file_name &lt;&lt; "\n";
        exit (1);
    }		
	if (! initialize_router (external_ip, external_nic, 
                            ip_config, config_file)) exit (1);	
	
	thread input_thread(input_function);

	// packet processing objects
	bandwidth_manager_t bwm;	
	network_adress_translator_t nat(external_ip, external_nic, nat_table);
	ip_router_t ip_router(external_ip, external_nic, ip_config);		

__itt_resume();
	bool stop_pipeline = false;	
	
	parallel_pipeline(ntokens,		
		make_filter&lt;void, packet_chunk_t*&gt;(		// Input filter
			filter::parallel,
			[&amp;](flow_control&amp; fc)-&gt; packet_chunk_t*{				
				
				if (stop_pipeline){					
					fc.stop();
				}				
				packet_chunk_t* packet_chunk = new packet_chunk_t(chunk_size);
					
				if(!chunk_queue.try_pop(*packet_chunk)){				
					if (stop_flag) {
						stop_pipeline = true;
					}
				}				
				return packet_chunk;
			}
		)&amp;	// Bandwidth manager filter
		make_filter&lt;packet_chunk_t*, packet_chunk_t*&gt;(		
			filter::parallel,
			[&amp;](packet_chunk_t* packet_chunk)-&gt; packet_chunk_t*{								
				
				for(int i=0; i&lt;packet_chunk-&gt;size(); i++){
					packet_trace_t packet;
					packet = (*packet_chunk)[i];				
					
					if (packet.nic == empty){
						break;
					}
					else{
						bwm.prioritize(packet);									
						compute_t compute;
						compute.work();						
					}										
				}
				std::sort(packet_chunk-&gt;begin(), packet_chunk-&gt;end(),
							packet_comparator);
				return packet_chunk;	
			}
		)&amp;	// NAT filter
		make_filter&lt;packet_chunk_t*, packet_chunk_t*&gt;(	
			filter::parallel,
			[&amp;](packet_chunk_t* packet_chunk)-&gt; packet_chunk_t*{

				for(int i=0; i&lt;packet_chunk-&gt;size(); i++){	
					packet_trace_t packet;

					packet = (*packet_chunk)[i];					
					if (packet.nic == empty)
						break;
					else{				
						nat.map(packet);
						compute_t compute;
						compute.work();	
					}
				}				
				return packet_chunk;
			}
		)&amp;	// IP routing filter
		make_filter&lt;packet_chunk_t*, packet_chunk_t*&gt;(		
			filter::parallel,
			[&amp;](packet_chunk_t* packet_chunk)-&gt; packet_chunk_t*{			

				for(int i=0; i&lt;packet_chunk-&gt;size(); i++){						
					packet_trace_t packet;
					packet = (*packet_chunk)[i];
					
					if (packet.nic == empty)
						break;
					else{				
						ip_router.route(packet);
						compute_t compute;
						compute.work();	
					}
				}				
				return packet_chunk;
			}
		)&amp;	// Output filter
		make_filter&lt;packet_chunk_t*, void&gt;(	
			filter::parallel,
			[&amp;](packet_chunk_t* packet_chunk){														
				
				for(int i=0; i&lt;packet_chunk-&gt;size(); i++){						
					packet_trace_t packet;
					packet = (*packet_chunk)[i];	
					compute_t compute;
					compute.work();	

					if (packet.nic == empty)
						break;
				}	
				// No output is required , just drop packets
				delete packet_chunk; 
			}
		)
	);	
__itt_pause();

	cout &lt;&lt; "\nAll packets are processed\n\n";		
	return 0;
}</pre>
<br />
<p>First part is "preparation" - creating objects, reading command line, opening files and initializing. Configuration file contains router interfaces info. Objects bwm, nat and ip_router are packet processing objects. They use containers nat_table and ip_config for storing NAT and IP tables.</p>
<p>The core component of Network Router is pipeline. It is implemented using tbb::parallel_pipeline() function, that takes number of tokens and list of filters as arguments. The element of work that is passed through the pipeline is of type packet_chunk_t. Parameter ntokens controls maximum number of concurrently processed elements. It has value 24 because the project was tested on 24-core machine and making it bigger wouldn't make an effect.</p>
<p>Pipeline filters perform some work execution, particularly packet processing in this application. Filters can be serial or parallel. This mode is controlled by filter parameter that is filter::parallel for all filters. This means that any filter can process some elements at the same time.</p>
<p>First filter extracts packet chunk from chunk_queue and passes it to second filter. Second filter performs bandwidth management operations on each packet from chunk. bwm module assigns priorities to packets according to protocol. Then packets in chunk are sorted by priority. This allows critical traffic to be processed as early as possible.  Subsequent filters make NAT mapping and IP routing. Last filter is output, but for simplicity real output is not done. Packets are just dropped.</p>
<p>Packet chunk is used as pipeline token because it's big enough. If single packets were passed through pipeline there would be too much transitions between threads, and overhead would be bigger than positive effect.</p>
<p>The __itt_resume() and __itt_pause() functions are used by Intel® VTune<sup>TM</sup> Amplifier XE that was used for performance measurements. These API functions mark the beginning and the end of area of interest.</p>
<p>Object compute of type compute_t makes workload for CPU. It just performs additional computations to simulate computing in real systems. The application doesn't perform the entire job needed for processing and routing packets in real life network equipment. It is just model framework of real application, so there is not enough CPU usage. Method compute_t:: work()starts computing "N Queens" algorithm.</p>
<p>Input file opening and reading is a job of separate thread. It is instantiated using std::thread class that is a part of new upcoming C++ 11 standard.</p>
<p><b>Serial implementation</b></p>
<p>To understand effect from parallelization a serial version was created. It has similar structure. The only difference is that parallel_pipeline is replaced with simple while loop.</p>
<p >Network router serial scheme<br /><br /><img height="248" width="459" src="http://software.intel.com/file/36533" /></p>
<p>While loop (replacing parallel_pipeline):</p>
<pre name="code" class="cpp">__itt_resume();
	bool stop = false;

	while (!stop){
		packet_chunk_t packet_chunk(chunk_size);
		
		if(!chunk_queue.try_pop(packet_chunk)){				
			if (stop_flag) {
				stop = true;
			}
		}		
		
		for(int i=0; i &lt; packet_chunk.size(); i++){
			packet_trace_t packet = packet_chunk[i];;			
			bwm.prioritize(packet);	
			compute_t compute;
			compute.work();									
		}
		std::sort(packet_chunk.begin(), packet_chunk.end(), packet_comparator);
		for(int i=0; i &lt; packet_chunk.size(); i++){
			packet_trace_t packet = packet_chunk[i];				
			nat.map(packet);
			compute_t compute;
			compute.work();		
			ip_router.route(packet);				
			compute.work();							
			compute.work();								
		}
	}
__itt_pause();</pre>
<p><br />There are four calls of compute.work() - the same number as in TBB version. This is going to be the most CPU time consuming function, so it's fair to have same number of calls to it.</p>
<p><b>Data structures</b></p>
<p>Input file has the following format:</p>
<p class="code">eth3 104.44.44.10 10.230.30.03 4003 5003 ftp<br />eth3 104.44.44.10 10.230.30.03 4003 5003 rtp<br />eth0 134.77.77.30 104.44.44.10 2004 4003 sip<br />eth3 104.44.44.10 10.230.30.03 4003 5003 http</p>
<p>Each line represents one packet. It has network interface, source, destination IP and port, protocol. Packet is stored in packet_trace_t structure:</p>
<pre name="code" class="cpp">typedef struct {
	nic_t nic;			// network interface where packet arrived
	ip_addr_t destIp;		// destination IP
	ip_addr_t srcIp;		// source IP
	port_t destPort;		// destination port
	port_t srcPort;		// source port 
	protocol_t protocol;	// protocol type (rtp, ftp, http, sip, etc)
	int priority;			// packet priority
} packet_trace_t;
</pre>
<br />NAT table and IP configuration table are stores in tbb::concurrent_hash_map. Packet chunk is stored in std::vector and chunk queue is of type tbb::concurrent_queue:<br /><br />
<pre name="code" class="cpp">typedef concurrent_hash_map&lt;port_t, address*, string_comparator&gt; nat_table_t; 
typedef concurrent_hash_map&lt;ip_addr_t, nic_t, string_comparator&gt; ip_config_t; 
typedef vector&lt;packet_trace_t&gt; packet_chunk_t;
concurrent_queue&lt;packet_chunk_t&gt; chunk_queue;
</pre>
<br />Input file reading is made by separate thread that executes input_function. The input_function opens file and reads it. Reading is performed by chunks that are passed to chunk queue. TBB containers are thread-safe, so main thread can read from the chunk queue at the same time without making additional synchronization manually. Input thread function:<br /><br />
<pre name="code" class="cpp">void input_function(){	
    ifstream in_file (in_file_name);
    if (!in_file) {
        cerr &lt;&lt; "Cannot open input file " &lt;&lt; in_file_name &lt;&lt; "\n";
        exit (1);
    }
	stop_flag = false;	
	
	while(in_file.good()){			
		packet_chunk_t packet_chunk(chunk_size);
								
		for(int i=0; i&lt;chunk_size; i++){
			packet_trace_t packet;
			in_file &gt;&gt; packet;					
			packet_chunk[i] = packet;			
		}
		chunk_queue.push(packet_chunk);			
	}
	stop_flag = true;
}</pre>
<br />
<p><b>Performance measurements</b></p>
<p>The goals of this project were to achieve good performance and scalability by using TBB. For measurements the following setup was used:</p>
<p>CPU: 4 processors Intel® Xeon X7460, 2,66 Ghz, 24 physical cores total <br />RAM: 16 GB <br />OS: Microsoft Windows Server® Enterprise 2008 SP2 <br />Workload: input file: 113405 packets (5,1 MB) <br />Measurement tool: Intel® VTune<sup>TM</sup> Amplifier XE 2011 <br />Analysis type: Concurrency with default settings</p>
There were performed two tests: for serial and for parallel versions. Below are summaries from the two analyses. Left is for serial and right is for TBB versions:<br /><br />
<p ><img height="326" width="599" src="http://software.intel.com/file/36538" /></p>
<br />
<p>It's seen that CPU time is similar. This is sum of CPU times of all cores of the system. But elapsed time is very different. This is clock time that the application takes for processing. In serial version it is near the value of overall CPU time. In TBB version it is 19 times less. So the application worked 19 times faster.</p>
CPU usage for serial version:<br /><br />
<p ><img height="265" width="766" src="http://software.intel.com/file/36535" /></p>
<br />CPU usage for TBB version:<br /><br />
<p ><img height="258" width="770" src="http://software.intel.com/file/36536" /></p>
<br /><br />
<p>Average number of utilized cores for TBB version is 20.5 and most of the processing time all 24 were used. This demonstrates that application is scalable enough and can use almost all cores on multi-core system.</p>
Bottom-up view of serial application shows that almost all the time is spent for computing module simulating real workload:<br /><br />
<p ><img height="298" width="856" src="http://software.intel.com/file/36537" /></p>
<br /><br />In TBB version picture is very similar, main hotspot is the same compute_t::do_work method. However it's mostly indicated with green that means good CPU utilization. Also there are more functions in the list because of using TBB constructions:<br /><br />
<p ><img height="424" width="770" src="http://software.intel.com/file/36540" /></p>
<br /><br />
<p>The results provided show good performance results for TBB-based application. However keep in mind the following conditions:</p>
<p>1) There were used Amplifier XE API functions __itt_resume() and __itt_pause() that bound measured area. The result show performance of tbb::parallel_pipeline for TBB version and while loop for serial version. Measurements of overall application work will give a little bit different results.</p>
<p>2) Simulated job was used to utilize CPU. The compute_t class computes algorithm of "N queens" task. Real processing is different.  If there would be not enough job for CPU, file input would consume relatively more time. So in real application scalability and performance gain can be worse.</p>
<p><strong>Conclusion</strong></p>
This sample project shows possibility of using TBB in composing Network packet processing applications and applicability of tbb::pipeline. These approaches can be applied in IP routing switches, telecommunication servers (VoIP telephony, video conferencing), various gateways and proxies, etc.  Like any hardly-loaded application network software can win from enabling multi-threading. And it is simple and effective to use Intel® Threading Building Blocks for managing parallelism in your application.
<div><br /></div>
<div>The full project source code:</div>
<div><a target="_blank" href="http://software.intel.com/file/36623">NetworkRouter.cpp</a></div> ]]></description>
      <link>http://software.intel.com/en-us/articles/network-router-emulator/</link>
      <pubDate>Mon, 23 May 2011 13:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/network-router-emulator/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/network-router-emulator/</guid>
      <category>Parallel Programming</category>
      <category>Tools</category>
      <category>Intel Software Network communities</category>
      <category>Intel Software Network communities</category>
    </item>
    <item>
      <title>Estimating FLOPS using Event Based Sampling (EBS)</title>
      <description><![CDATA[ <p>The FLOPS (or flops or flop/s) is an acronym for <b>fl</b>oating point <b>op</b>erations per <b>s</b>econd and is a measure heavily used in high performance computing. The FLOPS is a common way of measuring the performance and computational capabilities of a given microprocessor.</p>
<p>In this article, you will find out how hardware based Event Based Sampling (EBS) technology can help developers estimate the floating point operations per second executed by their applications. FLOPS will refer to 32 bit and 64 bit floating point operations and the operations will be either addition or multiplication (computational).</p>
<p>The Intel® VTune<sup>TM</sup> Amplifier XE is a performance analysis tool, which can help the software developers analyze their application to identify algorithmic and microarchitectural performance issues. The VTune<sup>TM</sup> Amplifier XE uses the processor's Performance Monitoring Unit (PMU) to sample processor events and some of these processor events can be used to statically sample the number of computational floating point operations at execution.</p>
<p ><b><span ></span></b><b><span ></span></b></p>
<p ><img title="fig1.png" alt="fig1.png" src="http://origin-software.intel.com/file/34526" /><br />Figure 1: Scalar processing vs. SIMD (<span >S</span>ingle <span >I</span>nstruction <span >M</span>ultiple <span >D</span>ata) processing</p>
<p > </p>
<p ><b><span ><img title="fig2.png" alt="fig2.png" src="http://origin-software.intel.com/file/34527" /> </span></b></p>
<p ><b><span ></span></b><b><span ></span></b></p>
<p >Figure 2: Intel® Architecture integer, floating point, MMX and SSE (Streaming SIMD Extensions) registers.</p>
<p >Note: The figure doesn't show the latest AVX extension and registers.</p>
<p><b><span ></span></b></p>
<p>As Figure 1 and 2 demonstrate, floating point operations can be performed on legacy x87 registers or on SSE registers, depending on how the compiler generates the code. If the floating point instructions are executed on SSE registers, then they can be either scalar or packed operations. Table 1 (below) gives the PMU event names which can be used to statistically estimate the computational floating point operations executed by the hardware. It is a good idea to keep in mind that not all the executed instructions, hence counted by these events, are retired due to speculative nature of the architecture. Therefore, it is possible to experience overcounting of these events.</p>
<p> </p>
<table border="1" cellpadding="0" cellspacing="0">
<tbody >
<tr >
<td rowspan="2"  valign="top" width="150">
<p ><b><span >Processor Generation</span></b></p>
</td>
<td colspan="3"  valign="top" width="549">
<p ><b><span >Processor Event Names</span></b></p>
</td>
</tr>
<tr >
<td  valign="top" width="153">
<p ><b>FP  operations using legacy x87 </b></p>
</td>
<td colspan="2"  valign="top" width="396">
<p ><b>FP operations using SIMD</b></p>
</td>
</tr>
<tr >
<td rowspan="4"  valign="top" width="150">
<p>Intel® Core<sup>TM</sup> 2 processor family (Intel®  Core<sup>TM</sup> 2 Duo/Quad, etc)</p>
<p> </p>
</td>
<td rowspan="4"  valign="top" width="153">
<p >X87_OPS_RETIRED.ANY<b><span ></span></b></p>
</td>
<td  valign="top" width="123">
<p>Packed 64bit</p>
</td>
<td  valign="top" width="273">
<p>SIMD_COMP_INST_RETIRED.PACKED_DOUBLE</p>
</td>
</tr>
<tr >
<td  valign="top" width="123">
<p>Packed 32bit</p>
</td>
<td  valign="top" width="273">
<p>SIMD_COMP_INST_RETIRED.PACKED_SINGLE</p>
</td>
</tr>
<tr >
<td  valign="top" width="123">
<p>Scalar 64bit</p>
</td>
<td  valign="top" width="273">
<p>SIMD_COMP_INST_RETIRED.SCALAR_DOUBLE</p>
</td>
</tr>
<tr >
<td  valign="top" width="123">
<p>Scalar 32bit</p>
</td>
<td  valign="top" width="273">
<p>SIMD_COMP_INST_RETIRED.SCALAR_SINGLE</p>
</td>
</tr>
<tr >
<td rowspan="4"  valign="top" width="150">
<p>Intel® Core<sup>TM</sup> architecture (Intel® Core<sup>TM</sup> i7, i5, i3; a.k.a Nehalem)</p>
</td>
<td rowspan="4"  valign="top" width="153">
<p >FP_COMP_OPS_EXE.x87</p>
</td>
<td  valign="top" width="123">
<p>Packed 64bit</p>
</td>
<td  valign="top" width="273">
<p>FP_COMP_OPS_EXE.SSE_DOUBLE_PRECISION</p>
</td>
</tr>
<tr >
<td  valign="top" width="123">
<p>Packed 32bit</p>
</td>
<td  valign="top" width="273">
<p>FP_COMP_OPS_EXE.SSE_SINGLE_PRECISION<b></b></p>
</td>
</tr>
<tr >
<td  valign="top" width="123">
<p>Scalar 64bit</p>
</td>
<td  valign="top" width="273">
<p>FP_COMP_OPS_EXE.SSE_FP_SCALAR<b></b></p>
</td>
</tr>
<tr >
<td  valign="top" width="123">
<p>Scalar 32bit</p>
</td>
<td  valign="top" width="273">
<p>FP_COMP_OPS_EXE.SSE_FP_SCALAR<b></b></p>
</td>
</tr>
<tr >
<td rowspan="4"  valign="top" width="150">
<p>2<sup>nd</sup> Generation Intel® Core<sup>TM</sup> architecture (a.k.a SandyBridge)</p>
</td>
<td rowspan="4"  valign="top" width="153">
<p><b><span ></span></b></p>
<div >FP_COMP_OPS_EXE.X87<br /></div>
</td>
<td  valign="top" width="123">
<p>Packed 64bit</p>
</td>
<td  valign="top" width="273">
<p>FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE</p>
</td>
</tr>
<tr >
<td  valign="top" width="123">
<p>Packed 32bit</p>
</td>
<td  valign="top" width="273">
<p>FP_COMP_OPS_EXE.SSE_PACKED_SINGLE<b></b></p>
</td>
</tr>
<tr >
<td  valign="top" width="123">
<p>Scalar 64bit</p>
</td>
<td  valign="top" width="273">
<p>FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE<b></b></p>
</td>
</tr>
<tr >
<td  valign="top" width="123">
<p>Scalar 32bit</p>
</td>
<td  valign="top" width="273">
<p>FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE<b></b></p>
</td>
</tr>
</tbody>
</table>
<p ><b><span >Table 1:</span></b> PMU events are used to count the computational floating point operations at execution.<b><span ></span></b></p>
<p ><b><span >Note:</span></b> This table does not include the event names used to sample AVX FP operations</p>
<p>The VTune<sup>TM</sup> Amplifier XE can use any of the events or all of them at the same time to estimate the flops achieved by an application. In order to measure the elapsed time, the CPU_CLK_UNHALTED (a.k.a clockticks) event can be used. If the processor frequency is constant during the measuring period, you can use the clockticks event to calculate the elapsed wall clock time. Please keep in mind that the CPU_CLK_UNHALTED event name might vary by the processors architecture.</p>
<p>Alternatively, CPU_CLK_UNHALTED.REF, which counts the number of reference cycles and is not affected by thread frequency changes, can be used. The difference between the reference clocktick event and clocktick event is that even if a thread enters the halt state (by running the HLT instruction), the reference clocktick event continues to count as if the thread is continuously running at the maximum frequency.</p>
<p><b><span >Estimating FLOPS </span></b></p>
<p>The FLOPS formula can be given as follows:</p>
<blockquote >
<p><b>FLOPS </b>= ((number of FP ops / clock) * number of total computational FP ops) / Elapsed Time</p>
<p><b>Elapsed Time = </b>CPU_CLK_UNHALTED / Processor-Frequency / Number-of-Cores<b>. <br /></b>Note: The cores with non zero CPU_CLK_UNHALTED event count needs to be considered for this formula.<b></b></p>
</blockquote>
<p>To demonstrate how EBS technology can be used to estimate the FLOPS, a simple multi-threaded matrix multiplication will be used. Each thread in the thread pool executes the following code.</p>
<p> </p>
<blockquote >
<p>double a[NUM][NUM];</p>
<p>double b[NUM][NUM];</p>
<p>double c[NUM][NUM];</p>
<p>...</p>
<p>slice = (unsigned int) tid;</p>
<p>from  = (slice * NUM) / NUM_THREADS;</p>
<p>to    = ((slice + 1) * NUM) / NUM_THREADS;</p>
<p> </p>
<p>for(i = from; i &lt; to; i++) {</p>
<p >for(j = 0; j&lt; NUM; j++) {</p>
<p >for(k = 0; k &lt; NUM; k++) {</p>
<p >// 2 fp ops / iteration: 1 add, 1 multiply<br />c[i][j] += a[i][k] * b[k][j];</p>
<p >}</p>
<p >}</p>
<p>}</p>
<p>...</p>
</blockquote>
<p> </p>
<p>The application also reports the flops measured by dividing the total FP operations ( 2 / iteration * NUM * NUM * NUM) with the elapsed time. The elapsed time only includes matrix multiplication part and doesn't include the initialization and thread creation overhead.</p>
<p>In order to collect samples for the relevant code section <i>__itt_pause()</i> (pauses the collection) and <i>__itt_resume()</i> (resumes the collection) APIs are used. Please refer to VTune<sup>TM</sup> Amplifier XE documentation on how to use the user APIs.</p>
<p>VTune<sup>TM</sup> Amplifier XE can be configured as follows on Intel® Core<sup>TM</sup> i7 (x980) based system (3.33GHz, 6 core + Hyper Threading enabled):</p>
<p> </p>
<p ><img title="fig3.png" alt="fig3.png" src="http://origin-software.intel.com/file/34636" /></p>
<p > </p>
<p><br /><b><span >Using x87 Registers</span></b></p>
<p>The sample application is compiled in released mode (optimization level set to 0x) on a Windows* system using Visual Studio</p>
<p>The application reports the following when analyzed under VTune<sup>TM</sup> Amplifier XE.</p>
<p><img title="fig4.png" alt="fig4.png" src="http://origin-software.intel.com/file/34529" /></p>
<p><br />The results below give us insight on how the compiler generated the code.  In this run, we can clearly see that we only collected samples on FP operations using x87.</p>
<p ><img title="fig5.png" alt="fig5.png" src="http://origin-software.intel.com/file/34530" /></p>
<p>If we plug the numbers into the formula:</p>
<blockquote>
<p><b>MFLOPS Formula</b> = FP_COMP_OPS_EXE.FP<b> </b>/ 1x10<sup>6</sup> / Elapsed Time</p>
<p><b>Elapsed time</b> = CPU_CLK_UNHALTED.THREAD / Processor-Frequency / Number-of-Cores</p>
</blockquote>
<p> </p>
<blockquote>
<p>Elapsed Time = 607,652,000,000.00 / 3.33 x 10<sup>9 </sup>/ 12 = 15.206 secs</p>
<p>MFLOP = 18,470,000,000.00 / 1x10<sup>6</sup>/ 15.206 secs = <b><span >1,214.652 MFLOPS</span> </b><sup></sup></p>
</blockquote>
<p> </p>
<p><b><span >Using SSE registers</span></b></p>
<p>Now, let's look at the same application when SSE registers are used.  If we compile the application using Intel® compiler version 12.0, we see the following results under the VTune<sup>TM</sup> Amplifier XE.</p>
<p><img title="fig6.png" alt="fig6.png" src="http://origin-software.intel.com/file/34531" /></p>
<p ><img title="fig7.png" alt="fig7.png" src="http://origin-software.intel.com/file/34532" /></p>
<p><br /><br />One thing you will notice right away in the new result displayed is the difference in the function names where the samples are happening.  In the earlier example, we were getting the samples in matrixMultiply function, but now we see the samples in threadPool function.  This is due to inlining (for more information: <a href="http://en.wikipedia.org/wiki/Inline_expansion">http://en.wikipedia.org/wiki/Inline_expansion</a>). Drilling down into the threadPool makes this clear.</p>
<p ><img title="fig8.png" alt="fig8.png" src="http://origin-software.intel.com/file/34533" /></p>
<p> </p>
<p>We multiply FP_COMP_OPS_EXE.SSE_DOUBLE_PRECISION event by <b>2</b> because <b>two packed double precision floating operations can be performed</b> on 128 bit XMM registers in every clock. For single precision floating point operations, the total count for packed single precision floating operations needs to be multiplied by 4.</p>
<blockquote>
<p><b>MFLOPS Formula</b> = 2 * FP_COMP_OPS_EXE.SSE_DOUBLE_PRECISION / 1x10<sup>6 </sup>/ Elapsed Time</p>
<p><b>Elapsed time</b> = CPU_CLK_UNHALTED.THREAD / Processor-Frequency / Number-of-Cores</p>
</blockquote>
<blockquote>
<p>Elapsed time = (66,178,000,000 / 3.33 x10<sup>9</sup> / 12 ) =  1.656 secs</p>
<p>MFLOPS = 2 * FP_COMP_OPS_EXE.SSE_DOUBLE_PRECISION / 1 x 10<sup>6 </sup>/ 1.656 secs =  <b><span >11,053.140 MFLOPS</span></b></p>
</blockquote>
<p> </p> ]]></description>
      <link>http://software.intel.com/en-us/articles/estimating-flops-using-event-based-sampling-ebs/</link>
      <pubDate>Fri, 04 Feb 2011 15:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/estimating-flops-using-event-based-sampling-ebs/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/estimating-flops-using-event-based-sampling-ebs/</guid>
      <category>Parallel Programming</category>
      <category>Intel Software Network communities</category>
      <category>Intel® VTune™ Performance Analyzer for Linux* Knowledge Base</category>
      <category>Intel® VTune™ Performance Analyzer for Windows* Knowledge Base</category>
    </item>
    <item>
      <title>Overview of Summary Statistics (SS) in Intel® MKL V10.3</title>
      <description><![CDATA[ <br />
<p><b>Introduction : <br /></b><br />The data volumes which researchers and engineers analyze nowadays in their daily tasks are huge and continue to increase. The size of datasets which are generated each day for solution of research or engineering problems can achieve hundred gigabytes, while the whole data array sizes have crossed terabyte threshold. To analyze the data in timely fashion and to enable faster and more accurate decision making, one needs a better tool stack. It includes not only the most powerful computer, but also the best software tools to effectively explore all capacities of the computer.<br /><br />Intel® Math Kernel Library (MKL) team identified a set of statistical computationally intensive algorithms which are used on the different stages of the data analysis, and optimized those algorithms for the x86 multi-core platforms. Those statistical functions are available in Intel® MKL 10.3 under Vector Statistical Library (VSL) chapter and are referred as Summary Statistics. VSL Summary Statistics provides the instruments for the fast, accurate and reliable data analysis on multi-core CPU. <br /><br /><br /><b>Functionality : <br /></b><br />VSL Summary Statistics contains the set of the computational blocks that helps to get insight into the structure of the dataset from the different perspectives. This dataset is store in matrix form. It contains number of observations, sequence of n vectors; and each observation has the value of p variables (or p dimension of the task). Using VSL Summary Statistics algorithms, you can compute the basic statistical estimates like moments or quantiles for your observation matrix, estimate dependencies in the dataset, or even process the incomplete or noised data which are related to imperfect measurement process. <br />Based on the type of the estimates and the structure of the dataset we classify the Summary Statistics algorithms in the following way: <br /><br />       Computation of the basic statistical estimates for observations matrix:<br />           - Moments, skewness, kurtosis, variation coefficient, quantiles and order statistics<br />       Analysis of the datasets that have missing points<br />           - Restoring the basic statistical characteristics in presence of missed observations <br />       Analysis of the noised dataset, the one that contains artifacts (outliers)<br />           - Detection of outliers, robust (to noise) estimates of the covariance matrix and mean<br />       Analysis of the dependencies <br />           - Variance-covariance/correlation matrix, partial variance-covariance/correlation matrix, <br />              pooled/group variance-covariance/correlation matrix<br />           - Parameterization (stabilization) of the correlation matrix<br /><br />Typically, the algorithms for support of missing values and detection of outliers can be used for so called data cleaning when the initial analysis of the raw data that became available as result of a research experiment or a measurement of an object, or a sociological survey.<br /><br />As soon as you clean the raw data, complete the initial analysis of the raw data, and apply some other steps like the data transformation using other MKL routines, you can use the rich set of the Summary Statistics estimators to learn the structure of the data array. With basic statistical estimates like mean or standard deviation you will obtain the first impression about your data. The advanced analysis would involve computation of the quantiles for understanding the distribution properties of the data, variance-covariance matrices - for revealing the dependencies between the components of the observed random vector, or computation of the robust version of the basic estimates.<br /><br />The algorithms for estimation of dependencies are the important building blocks in the algorithms for signal processing, gene analysis, real-time analysis of the financial data, and portfolio management problems. In addition to “classical” covariance you will find more covariance algorithms in the library like pooled/group covariance which is used in the linear discriminant analysis for face recognition or gene expression level estimation and partial covariance used underneath filtering problems. In the real-life problems which involve intensive real-data based computation of the covariance matrix you may run in the situation when the matrix losses the property of positive semi-definiteness (PSD) due to spurious observations. In some problems you may also want to modify the entries of the matrix to integrate additional information about the object, and thus can impact on PSD. To help you to quickly restore PSD property of the correlation matrix the library provides the tool for parameterization of correlation matrix.<br /><br />While more and more advanced methodologies for data collection (like microarray approach for the gene expression) are designed and become available, the process for object measurement is not free of spurious data and artifacts. To minimize impact of the noise on the final statistical conclusions you may use the robust methods available in the library. <br /><br />There is a number of the problems in which the size of the raw data is huge and can’t fit into memory of the computer or the observation matrix become available in blocks only (like in astronomy or in real-time trading systems). Some of Summary Statistics routines are designed in the way that helps to easily address progressive, block-based analysis. In particular, you may provide the data, block-by block, to the algorithm for computation of covariance and, eventually, after processing the last data block will have the covariance estimate for the whole dataset. For another estimate, quantile, the library provides the version for streaming data: for price of pre-defined estimation error and one pass over the data you will obtain the distribution structure of the whole dataset without a need to return back to the processed chunks of the data.<br /><br />Another important feature of the VSL Summary Statistics - the opportunity to store the observation matrix and results of the computations in one of the predefined formats: all algorithms of the library process the dataset held in raw-major and column-major format; based on the request of the user, the covariance estimators return the result in full or packed storage format.<br /><br /><br /><b>Usage Model : <br /></b><br />API of VSL Summary Statistics is similar to that of other MKL features (Convolution / Correlation, FFT, RNGs). The minimal set of entry points into the library, scalability, and support of Object-Oriented Programming – the major drivers behind the library API. <br />The key element of Summary Statistics API is task object which is a data structure that holds the parameters related computation of Summary Statistics. All Summary Statistics operations are done in the context of this task. The typical Summary Statistics usage model consists of four steps shown on the scheme below:<br /><br />          Step 1: task construction<br />                     status = vsldSSNewTask(&amp;task, &amp;p, &amp;n, &amp;x_storage, x, w, indices);<br />          Step 2: Editing the task parameters<br />                     status = vsldSSEditTask(task, VSL_SS_ED_MEAN, mean );<br />          Step 3: Computation of statistical estimates<br />                     status = vsldSSCompute(task, VSL_SS_MEAN, VSL_SS_METHOD_1PASS);<br />          Step 4: Destroying the task<br />                     status = vslSSDeleteTask(&amp;task );<br /><br />Here are three examples on how to use VSL Summary Statistics functionality in MKL:<br /><br /><br /><b>Example 1 : In-memory datasets </b><br /><br />This example for estimation of the mean using one-pass method shows the typical four-stage approach of using VSL Summary Statistics functionality. <br /><br />First, it creates a new task by passing the task descriptor, problem size (dimension of the random vector p and number of the observed vectors n), observation matrix, and its matrix storage format. Summary Statistics usage model admits assigning weights to each observation in the dataset, which are not negative numbers; in the example below, the null pointer to the array of weights is passed to the constructor, thus, the default weights equal to 1 are assigned to the observations and are used in the computations. Second, it initializes the task parameter, necessary for the solution of the specific problem of sample mean estimation by means of registering the array of result VSL_SS_ED_MEAN. Third, it computes the mean using one pass method VSL_SS_METHOD_1PASS. Finally, it calls the task destructor to release the memory resources.<br /><br /><br />#define DIM 1 /* Task dimension */<br />#define N 1000 /* Number of observations */<br />int main()<br />{<br />    VSLSSTaskPtr task; /* SS task descriptor */<br />    double x[N]; /* Array for dataset */<br />    double mean; /* Array for mean estimate */ <br />    double* w = 0; /* Null pointer to array of weights, default weight equal to one will be used in the computation */ <br />    MKL_INT p, n, xstorage;<br />    int status;<br /><br />    /* Initialize variables used in the computation of mean */<br />    p = DIM;<br />    n = N;<br />    xstorage = VSL_SS_MATRIX_STORAGE_ROWS;<br />    mean = 0.0;<br />    <br />    /* Step 1 - Create task */<br />    status = vsldSSNewTask( &amp;task, &amp;p, &amp;n, &amp;xstorage, x, w, 0 );<br /><br />    /* Step 2- Initialize task parameters */<br />    status = vsldSSEditTask( task, VSL_SS_ED_MEAN, &amp;mean );<br /><br />    /* Step 3 - Compute the mean estimate using SS one-pass method */<br />    status = vsldSSCompute(task, VSL_SS_MEAN, VSL_SS_METHOD_1PASS );<br /><br />    /* Step 4 - deallocate task resources */<br />    status = vslSSDeleteTask( &amp;task );<br /><br />    return 0;<br />}<br /><br /><b>Example 2 : Out-of-memory datasets </b><br /><br />The example below is extension of the previous scheme to the case of out-of-memory datasets. While the example follows the same four-stage scheme, there are few modifications to the previous example done to support processing of out-of-memory dataset</p>
<ul>
<li>First, the example registers the auxiliary array in the library intended for holding accumulated weight and accumulated squared   weight which are just a sum of the observation weights and sum of the squares of the observation weights processed so far. It is necessary for the correct processing of the next available data block.</li>
<li>Second, the example contains call to the function responsible for obtaining the next data block</li>
</ul>
<p>The rest computations remain the same. As soon as the last data block is processed you will have the mean estimate for the  whole dataset.<br /><br />#define DIM 1 /* Task dimension */<br />#define N 1000 /* Number of observations */<br />#define BLOCKN 100 /* Number of blocks */<br />int main()<br />{<br />    VSLSSTaskPtr task; /* SS task descriptor */<br />    double x[N]; /* Array for dataset */<br />    double mean; /* Array for mean estimate */<br />    double W[2]; /* Array of accumulated weights */ <br />    MKL_INT p, n, xstorage  <br />    int status, block_idx;<br /><br />    /* Initialize variables used in the computation of mean */<br />    p = DIM; n = N;<br />    mean = 0.0;<br />    W[0] = 0.0; W[1] = 0.0;<br />    xstorage = VSL_SS_MATRIX_STORAGE_ROWS;<br /><br />    /* Create task */<br />    status = vsldSSNewTask( &amp;task, &amp;p, &amp;n, &amp;xstorage, x, 0, 0 );<br /><br />    /* Initialize task parameters */<br />    status = vsldSSEditTask( task, VSL_SS_ED_MEAN, &amp;mean );<br />    status = vsldSSEditTask( task, VSL_SS_ED_ACCUM_WEIGHT, W );<br /><br />    /* Compute the mean estimate for block-based data using SS one-pass method */<br />    for( block_idx = 0; block_idx++; )<br />   {<br />          status = vsldSSCompute( task, VSL_SS_MEAN, VSL_SS_METHOD_1PASS );<br />          if ( block_idx &gt;= NBLOCK )<br />                break;<br />          else<br />                GetNextDataBlock( x, N ); } <br /><br />     /* De-allocate task resources */<br />     status = vslSSDeleteTask( &amp;task );<br />     return 0;<br />   }<br /><br /><b>Example 3 : Do several estimations simultanenously </b><br /><br />This example shows how to compute several statistical estimates for same data set. Let us assume that we need mean, variance, variation coefficient, and covariance in the statistical analysis. The computational scheme is the same and similar to those considered in the previous examples. We just need to modify those parameters in the task which are associated with the estimates we are interested in, that is we need to register the pointers to the memory which would hold the results and storage format for covariance matrix. In the code below, we use the editor for Moments vsldSSEditMoments, the editor for covariance vsldSSEditCovCor , and the generic editor for variation vsldSSEditTask to specify task parameters. We also use statement consisted four kind of estimations that will be use in the compute stage.<br /><br /><br />#define DIM 2 /* Task dimension */<br />#define N 1000 /* Number of observations */<br />int main()<br />{<br />    VSLSSTaskPtr task; /* SS task descriptor */<br />    double x[DIM*N]; /* Dataset array */<br />    double mean[DIM],variation[DIM], r2m[DIM], c2m[DIM]; <br />    /* mean/variation/2nd raw/2nd central moments */<br /><br />    double cov[DIM*DIM];<br />    double* w = 0;<br />    MKL_INT p, n, xstorage, covstorage;<br />    int status;<br /><br />    /* Initialize variables */<br />    p = DIM; n = N;<br />    xstorage = VSL_SS_MATRIX_STORAGE_ROWS;<br />    covstorage = VSL_SS_MATRIX_STORAGE_FULL;<br /><br />    /* Create task */ errcode = vsldSSNewTask( &amp;task, &amp;p, &amp;n, &amp;xstorage, x, w, 0 );<br /><br />    /* Edit task parameters */<br />    errcode = vsldSSEditTask( task, VSL_SS_ED_VARIATION, variation );<br />    errcode = vsldSSEditMoments( task, mean, r2m, 0, 0, c2m, 0, 0 );<br />    errcode = vsldSSEditCovCor( task, mean, cov, &amp;covstorage, 0, 0 );<br /><br />    /* Computation of several estimates using 1PASS method */<br />    estimates = VSL_SS_MEAN|VSL_SS_2C_MOM|VSL_SS_COV| VSL_SS_VARIATION;<br />    status = vsldSSCompute( task, estimates, VSL_SS_METHOD_1PASS );<br /><br />    /* De-allocate task resources */<br />    status = vslSSDeleteTask( &amp;task );<br />    return 0;<br />}<br /><br />Note that some statistical estimators compute additional estimates if they are not explicitly requested by the user, e.g., the algorithm for covariance estimation also computes mean estimate. In order to provide the correct processing of such estimates you need to allocate the memory for those additional estimates, and to register pointers to this memory in the library. See [SSApp] for more details.<br /><br /><br /><b>Summary : </b><br /><br />In this article, we describe the Summary Statistics Library functionality, usage model and examples under Intel® MKL VSL. The examples provided in the article demonstrate the ease of use of VSL Summary Statistics in different cases, e.g., processing of out-of-memory data, and simultaneous estimation of several statistical estimates. The additional information about API, usage models are available in [MKLMan] and [SSApp].<br /><br /><br /><b>Bibliography : </b><br /><br />[MKLMan] Intel® MKL Manual, http://software.intel.com/en-us/articles/intel-math-kernel-library-documentation/ <br /><br />[SSApp]Summary Statistics Application Notes, <br /><a href="http://software.intel.com/en-us/articles/intel-math-kernel-library-documentation/">http://software.intel.com/en-us/articles/intel-math-kernel-library-documentation/</a></p> ]]></description>
      <link>http://software.intel.com/en-us/articles/overview-of-summary-statistics-ss-in-intel-mkl-v103/</link>
      <pubDate>Tue, 14 Dec 2010 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/overview-of-summary-statistics-ss-in-intel-mkl-v103/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/overview-of-summary-statistics-ss-in-intel-mkl-v103/</guid>
      <category>Financial Services Industry</category>
      <category>Intel Software Network communities</category>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Cluster Toolkit for Linux* Knowledge Base</category>
      <category>Intel® Cluster Toolkit for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    </item>
    <item>
      <title>Difference of code analysis approaches in compilers and specialized tools</title>
      <description><![CDATA[ <p><img align="left" src="http://www.viva64.com/external-pictures/habr51-01.png"/>Compilers and third-party static code analyzers have one common task: to detect dangerous code fragments. However, there is a great difference in the types of analysis performed by each kind of these tools. I will try to show you the differences between these two approaches (and explain their source) by the example of the Intel C++ compiler and PVS-Studio analyzer.</p>
<p>This time, it is the Notepad++ 5.8.2 project that we chose for the test.</p>
<h2>Notepad++</h2>
<p>At first a couple of words about the project we have chosen. <a href="http://notepad-plus-plus.org/">Notepad++</a> is an open-source and free source code editor that supports many languages and appears a substitute for the standard Notepad. It works in the Microsoft Windows environment and is released under the GPL license. What I liked about this project is that it is written in C++ and has a small size - just 73000 lines of code. But what is the most important, this is a rather accurate project - it is compiled by presence of the /W4 switch in the project's settings and /WX switch that makes analyzers treat each warning as an error.</p>
<h2>Static analysis by compiler</h2>
<p>Now let's study the analysis procedure from the viewpoints of a compiler and a separate specialized tool. The compiler is always inclined to generating warnings after processing only very small local code fragments. This preference is a consequence of very strict performance requirements imposed on the compiler. It is no coincidence that there exist tools of distributed project build. The time needed to compile medium and large projects is a significant factor influencing the choice of development methodology. So if developers can get a 5% performance gain out of the compiler, they will do it.</p>
<p>Such optimization makes the compiler solider and actually such steps as preprocessing, building <a href="http://www.viva64.com/terminology/Abstract_syntactical_tree.html">AST</a> and code generation are not so distinct. For instance, I may say relying on some indirect signs that Visual C++ uses different preprocessor algorithms when compiling projects and generating preprocessed "*.i" files. The compiler also does not need (it is even harmful for it) to store the whole AST. Once the code for some particular nodes is generated and they are no more needed, they get destroyed right away. During the compilation process, AST may never exist in the full form. There is simply no need for that - we parse a small code fragment, generate the code and go further. This saves memory and cache and therefore increases speed.</p>
<p>The result of this approach is "locality" of warnings. The compiler consciously saves on various structures that could help it detect higher-level errors. Let's see in practice what local warnings Intel C++ will generate for the Notepad++ project. Let me remind you that the Notepad++ project is built with the Visual C++ compiler without any warnings with the /W4 switch enabled. But the Intel C++ compiler certainly has a different set of warnings and I also set a specific switch /W5 [Intel C++]. Moreover, I would like to have a look at what the Intel C++ compiler calls "remark".</p>
<p>Let's see what kinds of messages we get from Intel C++. Here it found four similar errors where the CharUpper function is being handled (Look at the note at the end of text). Note the "locality" of the diagnosis - the compiler found just a very dangerous type conversion. Let's study the corresponding code fragment:</p>
<pre name="code" class="cpp">wchar_t *destStr = new wchar_t[len+1];
...
for (int j = 0 ; j &lt; nbChar ; j++)
{
  if (Case == UPPERCASE)
    destStr[j] =
      (wchar_t)::CharUpperW((LPWSTR)destStr[j]);
  else
    destStr[j] =
      (wchar_t)::CharLowerW((LPWSTR)destStr[j]);
}</pre>
<p>Here we see strange type conversions. The Intel C++ compiler warns us: "#810: conversion from "LPWSTR={WCHAR={__wchar_t} *}" to "__wchar_t" may lose significant bits". Let's look at the CharUpper function's prototype.</p>
<pre name="code" class="cpp">LPTSTR WINAPI CharUpper(
  __inout  LPTSTR lpsz
);</pre>
<p>The function handles a string and not separate characters at all. But here a character is cast to a pointer and some memory area is modified by this pointer. How horrible. </p>
<p>Well, actually this is the only horrible issue detected by Intel C++. All the rest are much more boring and are rather inaccurate code than error-prone code. But let's study some other warnings too.</p>
<p>The compiler generated a lot of #1125 warnings:</p>
<p>"#1125: function "Window::init(HINSTANCE, HWND)" is hidden by "TabBarPlus::init" -- virtual function override intended?"</p>
<p>These are not errors but just poor naming of functions. We are interested in this message for a different reason: although it seems to involve several classes for the check, the compiler does not keep special data - it must store diverse information about base classes anyway, that is why this diagnosis is implemented.</p>
<p>The next sample. The message "#186: pointless comparison of unsigned integer with zero" is generated for the meaningless comparisons:</p>
<pre name="code" class="cpp">static LRESULT CALLBACK hookProcMouse(
  UINT nCode, WPARAM wParam, LPARAM lParam)
{
  if(nCode &lt; 0)
  {
    ...
    return 0;
  }
...
}</pre>
<p>The "nCode &lt; 0" condition is always false. It is a good example of good local diagnosis. You may easily find an error this way.</p>
<p>Let's consider the last warning by Intel C++ and get finished with it. I think you have understood the concept of "locality".</p>
<pre name="code" class="cpp">void ScintillaKeyMap::showCurrentSettings() {
  int i = ::SendDlgItemMessage(...);
  ...
  for (size_t i = 0 ; i &lt; nrKeys ; i++)
  {
    ...
  }
}</pre>
<p>Again we have no error here. It is just poor naming of variables. The "i" variable has the "int" type at first. Then a new "i" variable of the "size_t" type is defined in the "for()" operator and is being used for different purposes. At the moment when "size_t i" is defined, the compiler knows that there already exists a variable with the same name and generates the warning. Again, it did not require the compiler to store any additional data - it must remember anyway that the "int i" variable is available until the end of the function's body.</p>
<h2>Third-party static code analyzers</h2>
<p>Now let's consider specialized static code analyzers. They do not have such severe speed restrictions since they are launched ten times less frequently than compilers. The speed of their work might get tens of times slower than code compilation but it is not crucial: for instance, the programmer may work with the compiler at day and launch a static code analyzer at night to get a report about suspicious fragments on the morning. It is quite a reasonable approach.</p>
<p>While paying with slow-down for their work, static code analyzers can store the whole code tree, traverse it several times and store a lot of additional information. It lets them find "spreaded" and high-level errors.</p>
<p>Let's see what the <a href="http://www.viva64.com/pvs-studio/">PVS-Studio</a> static analyzer can find in Notepad++. Note that I am using a pilot version that is not available for download yet. We will present the new free general-purpose rule set in 1-2 months within the scope of PVS-Studio 4.00.</p>
<p>Surely, the PVS-Studio analyzer finds errors that may be referred to "local" like in case of Intel C++. This is the first sample:</p>
<pre name="code" class="cpp">bool _isPointXValid;
bool _isPointYValid;
bool isPointValid() {
  return _isPointXValid &amp;&amp; _isPointXValid;
};</pre>
<p>The PVS-Studio analyzer informs us: "V501: There are identical sub-expressions to the left and to the right of the '&amp;&amp;' operator: _isPointXValid &amp;&amp; _isPointXValid".</p>
<p>I think the error is clear to you and we will not dwell upon it. The diagnosis is "local" because it is enough to analyze one expression to perform the check.</p>
<p>Here is one more local error causing incomplete clearing of the _iContMap array:</p>
<pre name="code" class="cpp">#define CONT_MAP_MAX 50
int _iContMap[CONT_MAP_MAX];
...
DockingManager::DockingManager()
{
  ...
  memset(_iContMap, -1, CONT_MAP_MAX);
  ...
}</pre>
<p>Here we have the warning "V512: A call of the memset function will lead to a buffer overflow or underflow". This is the correct code:</p>
<pre name="code" class="cpp">memset(_iContMap, -1, CONT_MAP_MAX * sizeof(int));</pre>
<p>And now let's go over to more interesting issues. This is the code where we must analyze two branches simultaneously to see that there is something wrong:</p>
<pre name="code" class="cpp">void TabBarPlus::drawItem(
  DRAWITEMSTRUCT *pDrawItemStruct)
{
  ...
  if (!_isVertical)
    Flags |= DT_BOTTOM;
  else
    Flags |= DT_BOTTOM;
  ...
}</pre>
<p>PVS-Studio generates the message "V523: The 'then' statement is equivalent to the 'else' statement". If we review the code nearby, we may conclude that the author intended to write this text:</p>
<pre name="code" class="cpp">if (!_isVertical)
  Flags |= DT_VCENTER;
else
  Flags |= DT_BOTTOM;</pre>
<p>And now get brave to meet a trial represented by the following code fragment:</p>
<pre name="code" class="cpp">void KeyWordsStyleDialog::updateDlg() 
{
  ...
  Style &amp; w1Style =
    _pUserLang-&gt;_styleArray.getStyler(STYLE_WORD1_INDEX);
  styleUpdate(w1Style, _pFgColour[0], _pBgColour[0],
    IDC_KEYWORD1_FONT_COMBO, IDC_KEYWORD1_FONTSIZE_COMBO,
    IDC_KEYWORD1_BOLD_CHECK, IDC_KEYWORD1_ITALIC_CHECK,
    IDC_KEYWORD1_UNDERLINE_CHECK);

  Style &amp; w2Style =
    _pUserLang-&gt;_styleArray.getStyler(STYLE_WORD2_INDEX);
  styleUpdate(w2Style, _pFgColour[1], _pBgColour[1],
    IDC_KEYWORD2_FONT_COMBO, IDC_KEYWORD2_FONTSIZE_COMBO,
    IDC_KEYWORD2_BOLD_CHECK, IDC_KEYWORD2_ITALIC_CHECK,
    IDC_KEYWORD2_UNDERLINE_CHECK);

  Style &amp; w3Style =
    _pUserLang-&gt;_styleArray.getStyler(STYLE_WORD3_INDEX);
  styleUpdate(w3Style, _pFgColour[2], _pBgColour[2],
    IDC_KEYWORD3_FONT_COMBO, IDC_KEYWORD3_FONTSIZE_COMBO,
    IDC_KEYWORD3_BOLD_CHECK, IDC_KEYWORD3_BOLD_CHECK,
    IDC_KEYWORD3_UNDERLINE_CHECK);

  Style &amp; w4Style =
    _pUserLang-&gt;_styleArray.getStyler(STYLE_WORD4_INDEX);
  styleUpdate(w4Style, _pFgColour[3], _pBgColour[3],
    IDC_KEYWORD4_FONT_COMBO, IDC_KEYWORD4_FONTSIZE_COMBO,
    IDC_KEYWORD4_BOLD_CHECK, IDC_KEYWORD4_ITALIC_CHECK,
    IDC_KEYWORD4_UNDERLINE_CHECK);
  ...
}</pre>
<p>I can say that I am proud of our analyzer PVS-Studio that managed to find an error here. I think you have hardly noticed it or just have skipped the whole fragment to see the explanation. Code review is almost helpless before this code. But the static analyzer is patient and pedantic: "V525: The code containing the collection of similar blocks. Check items '7', '7', '6', '7' in lines 576, 580, 584, 588".</p>
<p>I will abridge the text to point out the most interesting fragment:</p>
<pre name="code" class="cpp">styleUpdate(...
  IDC_KEYWORD1_BOLD_CHECK, IDC_KEYWORD1_ITALIC_CHECK,
  ...);
styleUpdate(...
  IDC_KEYWORD2_BOLD_CHECK, IDC_KEYWORD2_ITALIC_CHECK,
  ...);
styleUpdate(...
  IDC_KEYWORD3_BOLD_CHECK, !!! IDC_KEYWORD3_BOLD_CHECK !!!,
  ...);
styleUpdate(...
  IDC_KEYWORD4_BOLD_CHECK, IDC_KEYWORD4_ITALIC_CHECK,
  ...);</pre>
<p>This code was most likely written by the Copy-Paste method. As a result, it is IDC_KEYWORD3_BOLD_CHECK which is used instead of IDC_KEYWORD3_ITALIC_CHECK. The warning looks a bit strange reporting about numbers '7', '7', '6', '7'. Unfortunately, it cannot generate a clearer message. These numbers arise from macros like these:</p>
<pre name="code" class="cpp">#define IDC_KEYWORD1_ITALIC_CHECK (IDC_KEYWORD1 + 7)
#define IDC_KEYWORD3_BOLD_CHECK (IDC_KEYWORD3 + 6)</pre>
<p>The last cited sample is especially significant because it demonstrates that the PVS-Studio analyzer processed a whole large code fragment simultaneously, detected repetitive structures in it and managed to suspect something wrong relying on heuristic method. This is a very significant difference in the levels of information processing performed by compilers and static analyzers.</p>
<h2>Some figures</h2>
<p>Let's touch upon one more consequence of "local" analysis performed by compilers and more global analysis of specialized tools. In case of "local analysis", it is difficult to make it clear if some issue is really dangerous or not. As a result, there are ten times more false alarms. Let me explain this by example.</p>
<p>When we analyzed the Notepad++ project, PVS-Studio generated only 10 warnings. 4 messages out of them indicated real errors. The result is modest, but general-purpose analysis in PVS-Studio is only beginning to develop. It will become one of the best in time.</p>
<p>When analyzing the Notepad++ project with the Intel C++ compiler, it generated 439 warnings and 3139 remarks. I do not know how many of them point to real errors, but I found strength to review some part of these warnings and saw only 4 real issues related to CharUpper (see the above description).</p>
<p>3578 messages are too many for a close investigation of each of them. It turns out that the compiler offers me to consider each 20-th line in the program (73000 / 3578 = 20). Well, come on, it's not serious.  When you are dealing with a general-purpose analyzer, you must cut off as much unnecessary stuff as possible.</p>
<p>Those who tried the <a href="http://www.viva64.com/viva64-tool/">Viva64</a> rule set (included into PVS-Studio) may notice that it produces the same huge amount of false alarms. But we have a different case there: we must detect all the suspicious type conversions. It is more important not to miss an error than not to produce a false alarm. Besides, the tool's settings provide a flexible filtering of false alarms.</p>
<h2>UPDATE: Note</h2>
<p>It turned out that I had written a wrong thing here. There is no error
in the sample with CharUpperW but nobody corrected me. I noticed it
myself when I decided to implement a similar rule in PVS-Studio.</p>

<p>The point is that <a
href="http://msdn.microsoft.com/en-us/library/ms647474(VS.85).aspx">CharUpperW</a>
can handle both strings and individual characters. If the high-order
part of a pointer is zero, the pointer is considered a character and
not pointer any more. Of course, the WIN API interface in this place
disappointed me by its poorness, but the code in Notepad++ is correct.</p>

<p>By the way, it turns out now that Intel C++ has not found any errors
at all.</p>
 ]]></description>
      <link>http://software.intel.com/en-us/articles/difference-of-code-analysis-approaches-in-compilers-and-specialized-tools/</link>
      <pubDate>Sun, 31 Oct 2010 10:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/difference-of-code-analysis-approaches-in-compilers-and-specialized-tools/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/difference-of-code-analysis-approaches-in-compilers-and-specialized-tools/</guid>
      <category>Intel Software Network communities</category>
    </item>
  </channel></rss>
