<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Wed, 23 May 2012 11:53:15 -0700 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/articles/game-development/type/technical-article/feed/" rel="self" type="application/rss+xml" />
    <title>Intel Software Network articles Feed</title>
    <link>http://software.intel.com/en-us/articles/game-development/type/technical-article/</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>Using Intel® Power Checker to measure the energy performance of a compute-intensive application </title>
      <description><![CDATA[ <p>Intel® Power Checker provides developers with a quick and easy way to evaluate the idle power efficiency of their applications on mobile platforms with Intel® Core™ processor or Intel® Atom™ technology running the Microsoft Windows* operating system. Any compiled language application, especially those designed to run on technology based on Intel® products and Java Framework applications can be analyzed by Intel Power Checker. The checker can be used with or without a supported external power meter.</p>
<p>The Intel Power Checker 2.0 now supports measurement both on battery and with the system plugged into an external AC power source. External power measurement is only supported on Intel® Second Generation Core processors and if the Intel® Power Gadget software has been installed.</p>
<p>For this article, I took a very compute-intensive parallel application that I wrote to solve instances of the logic puzzle Akari. The code uses a backtracking algorithm to explore how to place light bulbs onto a grid under constraints dictated by the rules of the puzzle and the layout of the puzzle instance. Potentially millions of independent tasks can be generated by the code as the solution space is searched by threads executing those tasks. This solution method is eminently scalable to a large number of threads and is able to keep many cores running at peak speed for a sustained amount of time.</p>
<h2>How to Use Intel Power Checker</h2>
<p>The Intel Power Checker provides a GUI wizard that leads you through the four steps of power analysis. These four steps in the checker are described below. Before starting the assessment, be sure to know which section of your application (a workload) you want to be measured, as the Power Checker will only measure a 30 second execution interval. (If you want to measure the entire execution workload, you should try some other tool, like Intel Power Gadget.) Your workload could be a compute-intensive portion or an I/O-intense section or just some point in execution that typifies the majority of expected usage.</p>
<h3 >Step 1: Specifying the Power Meter device</h3>
<p>If you have an external power meter attached to your test system, you can select the model being used on the first screen of the wizard. The default is that no external device is being used. For this default case, Intel Power Checker will determine if the system is capable of providing power consumption data and if the correct power driver, EzPwr.sys, is installed. (The driver is part of the default installation of <a href="http://software.intel.com/en-us/articles/intel-power-gadget/">Intel Power Gadget</a>.)</p>
<h3 >Step 2: Measure System Baseline</h3>
<p>The first measurement that the Intel Power Checker will perform is on the next screen within the wizard. This is to measure the baseline power consumption of the hardware without your application running. Prior to this measurement phase any unnecessary processes such as operating system updates, Windows Indexing Service, virus scans, media players, and internet browsers should have been shut down. In other words, to get the most accurate results you should make your test system as idle as possible and ensure that nothing will become a foreground process during your measurement runs.</p>
<p>Once you have a quiescent system, click the “Start” button to begin this phase of the testing. The Intel Power Checker waits 15 seconds to allow the system to come to an idle state before starting the measurements. You need to be sure to position your mouse and the keyboard out of reach, or keep your hands away from them, to avoid any stray contact that might trigger some response from the platform. After the pause, the checker will observe the system for 30 seconds in this idle state. A progress bar will show the time remaining in each part of this phase. Once the baseline data collection is complete, click the “Next” button to proceed to the next phase.</p>
<h3 >Step 3: Measure Active Application</h3>
<p>Before you are taken to the next screen in the wizard, you are instructed to start the application you are interested in measuring. Start up your application and click the “OK” button to advance the GUI to the next screen. Once you have reached the Step 3 screen, use the scroll bar to locate your application in the process list and click on that line to select it. If your application is not listed, click the “Refresh List” button so that your application’s process will be available to select. In addition, you can use the “Apply Filter” button to narrow down the list in order to find your application’s process quickly. .After selecting your application from the list, click “Next” to move on to the data collection for this phase. Before starting the assessment, be sure your application has reached the desired point of measurement. If there are some initial setup computations that are not of interest, you will need to get past this point before letting Intel Power Checker begin measurement. For my Akari application, there is very little setup time. It was typically in the thick of computation by the time I had gotten to the point of selecting the process from the list.</p>
<p>As soon as I could, I clicked the “Start” button to begin capturing measurement data. Since this is one of the crucial power measurements for your application, always begin capturing data <b>after</b> the workload or critical section has begun and make sure this active execution will run longer than the 30 seconds needed to complete the measurement time.</p>
<h3 >Step 4: Measure Idle Application</h3>
<p>The final phase is to measure your application’s idle power consumption. This is another important phase of energy efficiency measurement of an application since your application must not only do efficient computation, but also not waste energy when sitting idle.</p>
<p>This step doesn’t make much sense within my compute-intensive application since there is no idle state of the application. Once you start the application on a given puzzle instance, it simply computes all legal solutions in parallel and then ends. As (multiple) solutions are found, they are printed out by the thread that found it. If there are no solutions, a message is printed just before the application terminates. This latter case describes the workload I used for my tests. Because you must have your application running in “idle” mode for this step, I left the application running at full speed and simply allowed Power Checker to take its measurements.</p>
<p>If your application does have an idle state, perhaps waiting for interaction from the user, the checker will give the system 15 seconds to calm down fully before taking a final 30 second measurement.</p>
<p>Upon completion of this last data collection phase, you will be able to proceed to the results screen within the Intel Power Checker wizard. After all three measurement phases have been completed; a Tool Report File will be generated containing all of the results for later analysis.</p>
<h3 >What data is presented</h3>
<p>The View Results screen of the Intel Power Checker wizard provides basic information about the software assessment. The type of processor in your system and the type and model of the power source that was used are given. Four numerical values for each of the three measurement phases are presented. These values are:</p>
<ul >
<li><b>Elapsed Time:</b> The exact number of seconds that each of the phases lasted.</li>
<li><b>Energy Consumption:</b> The rate that the battery was discharged during each of the three phases.</li>
<li><b>Average C3 State Residency:</b> The percentage of time that the system was in the C3 state during the data collection period.</li>
<li><b>Platform Timer Period:</b> The number of milliseconds that the platform timer collected</li>
</ul>
<p><img src="http://software.intel.com/file/42410" /></p>
<p>Typical results would hopefully show a larger percentage of time spent in the C3 State Residency for the application idle time measurement (the middle of the three columns on the View Results screen). As my puzzle solving application was still computing as much as it did in the active execution measurement step, this was not the case for my results. This is atypical for the intended type of applications Intel Power Checker assumes will be measured. Thus, the C3 State Residency values provided by the tool for the idle application were not valid for my particular application.</p>
<p>The name of the report file and the directory to which it will be found are listed on the View Results screen.</p>
<h2>Some Caveats</h2>
<p>Below are some things you should consider before and during a measurement run using Intel Power Checker.</p>
<ul >
<li>Before you start using Intel Power Checker, be sure your chosen workload will run for at least 30 seconds from the point you wish to measure power consumption. In my case, I required a data set that would force the application to run for at least 75 seconds (30 for active measurement, 15 for idle setup, and 30 for idle measurement) plus the time I needed to click boxes and find my application in the process list. Since I ran the application on several different numbers of threads, I needed to be sure that the fastest execution time was still large enough to get all the timings steps completed during a Intel Power Checker run.</li>
<li>Upon starting Intel Power Checker, the checker may first report that the platform timer period is invalid. In this case, some currently running (background) process has changed the default and it will be up to the user to determine which currently running application has changed the value. Once you have identified the culprit you must stop this process or service before restarting Intel Power Checker. If you are unsure about which active process is preventing Intel Power Checker from starting, you will need to turn off processes one at a time and try Intel Power Checker until the error message doesn’t come up. </li>
<li>Instructions on the Step 3 screen ask you not to touch the keyboard or mouse. If you are measuring an interactive application or you must interact with the application to generate activity for the full 30 seconds, you will need to touch the keyboard and/or mouse. If possible, a workload that can forego interactivity and still compute for the 30 seconds of measurement time would be best. However, if interaction by the user is part of how the application is utilized, interfacing through peripherals will give you a more accurate measure of the overall energy consumption for typical application usage.</li>
<li>A data file is created during each phase of the Intel Power Checker assessment to hold the current information. If you cancel the assessment in any of the three phases then a data file will not be created for that phase. After all three phases have been completed, a Tool Report File, in XML format, will be generated containing all of the results. You can find the name of the report file and where it is located on the View Results screen.</li>
<li>The “Submit Results” button on the View Results screen is optional and only intended for members of the <a href="http://software.intel.com/partner/overview">Intel® Software Partner Program</a> to submit their measurement results to the program. If you are not a member, do not submit your results. Simply click on the “Close” button after you have examined the results compiled by Intel Power Checker.</li>
</ul>
<h2>Some Results</h2>
<p>The purpose of this article is not to determine the best scenario for running my Akari solver application in the most energy efficient way. You will want to do this for your application, though, and this article has given you the background on Intel Power Checker to determine if this checker can help you quantify the current power consumption of your application. Also, as you make modifications to the application you will be able to determine if those changes improve the energy efficiency or cause your application to suck more power than before.</p>
<p>In addition to the average C3 State Residency percentage, the checker delivers the total number of Joules expended during the 30 seconds of execution time measured. From this I can compute the average Watts for execution parts of the application. I have found that a better metric for comparing different applications or different runs of the same application is milliwatt hours (mWh). You need the total execution time of the execution portion of the application to compute this value. Since Intel Power Checker only measures activity in 30 second segments, you will need to have some timing data available, which I happened to have for the different runs I made of my Akari application.</p>
<p>I found significant differences when running with and without Hyper-Threading Technology (HT) turned on. Also, if the platform was running on battery (DC) power or from the wall socket (AC) power, a difference in execution time and power usage was evident. For example, when running with HT on and a full complement of four threads on the 4 logical cores in my system, I saw the AC power run 1.19X faster that when running the same workload on DC power. However, the former run took 1.15X more power.</p>
<p>Comparing results between runs on DC power versus AC power is a not a good comparison, especially in this case. The power source is detected by the system and the processor is allowed to run with Intel® Turbo Boost Technology at a higher frequency if the platform is using external power. Even so, you may need to be concerned about power consumption of your application in both power source circumstances and you will need to run measurement experiments within each setup to gauge how well your application modifications affect overall power consumption.</p>
<h3 >System Requirements</h3>
<p>You can use Intel Power Checker on a laptop or netbook based on Intel® Core™ processor or Intel® Atom™ processor technology. A desktop with an external power meter or a desktop that is capable of providing the power consumption information can also be analyzed. A Java* Runtime Environment (JRE) (version 6 update 11 or higher) is also required to run the checker. Supported operating systems are Microsoft Windows* XP (Service Pack 3), Microsoft Windows Vista* (Service Pack 2), Microsoft Windows* 7 (Service Pack 1 [32-bit and 64-bit]), and Microsoft Windows* Server 2008 R2.</p>
<h3 >Download link</h3>
<p>To download the Intel Power Checker installation package, go to the following link:</p>
<p><a href="http://software.intel.com/partner/app/software-assessment">http://software.intel.com/partner/app/software-assessment/</a>. Click on the Intel Power Checker tab to move down to the download link.</p>
<h3 >Other supporting links</h3>
<p>There is a video demonstration of using Intel Power Checker, “A Look at Intel Power Checker,” at the link: <a href="http://software.intel.com/en-us/videos/channel/intel-software-partner-program/a-look-at-the-intel-power-checker/1127786023001">http://software.intel.com/en-us/videos/channel/intel-software-partner-program/a-look-at-the-intel-power-checker/1127786023001</a>. Dave Valdovinos and Taylor Kidd, both from Intel, show off the GUI wizard as it measures the power performance of a game-like application.</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/using-intel-power-checker-to-measure-the-energy-performance-of-a-compute-intensive-application/</link>
      <pubDate>Mon, 12 Mar 2012 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/using-intel-power-checker-to-measure-the-energy-performance-of-a-compute-intensive-application/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/using-intel-power-checker-to-measure-the-energy-performance-of-a-compute-intensive-application/</guid>
      <category>Mobility</category>
      <category>Parallel Programming</category>
      <category>Intel® AppUp(SM) Developer Community</category>
      <category>Intel Software Network communities</category>
      <category>Intel SW Partner program</category>
      <category>Intel Software Network communities</category>
      <category>Game Development</category>
      <category>Power Efficiency</category>
      <category>Intel® vPro™ Developer Community</category>
      <category>Resources For Software Developers</category>
      <category>Ultrabook</category>
      <category>Server Developer Community</category>
    </item>
    <item>
      <title>Debugging OpenCL™ Kernels Using Intel® OpenCL SDK Debugger</title>
      <description><![CDATA[ <h2 class="sectionHeading"><a name="0"></a>Table of Contents</h2>
<span><a  href="http://software.intel.com#1">1. What is Intel® OpenCL SDK Debugger</a><br /><a  href="http://software.intel.com#2">2. Debugging Your OpenCL Kernel with Intel® OpenCL SDK Debugger</a><br /><a  href="http://software.intel.com#3">3. Configuring Intel® OpenCL SDK Debugger</a><br /><a  href="http://software.intel.com#4">4. Troubleshooting the Intel® OpenCL SDK Kernel Debugger</a><br /><br /><a name="1"></a>
<h2 class="sectionHeading">What is Intel® OpenCL SDK Debugger</h2>
<p><b><a href="http://software.intel.com#0">Back to top</a></b></p>
<p>The Intel® OpenCL SDK Debugger is a Microsoft* Visual Studio* 2008 plug-in which enables you to debug into OpenCL kernels using the familiar graphical interface of the Microsoft* Visual Studio* 2008 debugger.</p>
<p><b><i>NOTE:</i></b> The Intel® OpenCL SDK Debugger provides a seamless debugging experience across host and OpenCL code, by supporting host code debugging and OpenCL kernel debugging in a single Microsoft* Visual Studio* debug session. The Intel® OpenCL SDK Debugger works with Microsoft* Visual Studio* 2008 only. Another version of the Microsoft* Visual Studio* is not supported. You must acquire Microsoft* Visual Studio* 2008 separately. For more information, see the Visual Studio* 2008 page at <a href="http://www.microsoft.com/visualstudio/en-us/products/2008-editions/">http://www.microsoft.com/visualstudio/en-us/products/2008-editions/</a>.</p>
<a name="2"></a>
<h2 class="sectionHeading">Debugging Your OpenCL Kernel with Intel® OpenCL SDK Debugger</h2>
<p><b><a href="http://software.intel.com#0">Back to top</a></b></p>
<p><b><i>NOTE:</i></b> To work with the Intel® OpenCL SDK Debugger plug-in, the OpenCL kernel code must exist in a text file separate from the code of the host. Debugging OpenCL code which only appears in a string embedded in the host application is not supported.</p>
<p>To debug an OpenCL kernel, follow these steps:</p>
<ol>
<li>Enable debuging mode in the Intel® OpenCL runtime for compiling the OpenCL code: add the <code>–g</code> flag to the <b>build options</b> string parameter in the <code>clBuildProgram</code> function.</li>
<li>Specify the full path of the file in the <b>build options</b> string parameter to the <code>clBuildProgram</code> function accordingly:</li>
</ol>
<p><code>-s &lt;full path to the OpenCL source file&gt;</code></p>
<p>If there are spaces in the path you should enclose the entire path with double quotes (“”).</p>
<p>For example:</p>
<pre name="code" class="cpp">err = clBuildProgram(
          g_program, 
          0, 
          NULL, 
          “-g -s &lt;full path to the OpenCL source file&gt;”, 
          NULL, 
          NULL);
</pre>
<p>According to the OpenCL standard, many work items execute the OpenCL kernels simultaneously. The Intel® OpenCL SDK Debugger requires to set in advance the global ID of the work item which you want to debug, before debugging session starts. The debugger stops on breakpoints in OpenCL code only when pre-set work item reaches them.</p>
<a name="3"></a>
<h2 class="sectionHeading">Configuring Intel® OpenCL SDK Debugger</h2>
<p><b><a href="http://software.intel.com#0">Back to top</a></b></p>
<p>To configure the Intel® OpenCL SDK Debugger, open the Debugging Configuration window:</p>
<ol>
<li>Run Microsoft* Visual Studio* 2008.</li>
<li>Select <b>Tools</b> &gt; <b>Intel OpenCL SDK Debugger</b>.</li>
</ol><br /><img src="http://software.intel.com/file/38625" /><br /><br />
<p>In the <b>Basic Settings</b> group box:</p>
<ul>
<li>Check the <b>Enable OpenCL Kernel Debugging</b> check box to switch Intel® OpenCL SDK Kernel Debugger on\off.</li>
<li>Enter the appropriate values in the <b>Select Work Items</b> field to select work items.</li>
</ul>
<p>You can select only one work item. The values specify its 3D coordinates.</p>
<p>If a NDRange running in less than 3D (i.e 1D or 2D), you must leave other dimensions at <code>0</code>.</p>
<a name="4"></a>
<h2 class="sectionHeading">Troubleshooting the Intel® OpenCL SDK Kernel Debugger</h2>
<p><b><a href="http://software.intel.com#0">Back to top</a></b></p>
<p><b><i>NOTE: </i></b>The Intel® OpenCL SDK Debugger needs a local TCP/IP port to work correctly. On some occasions, you may encounter a problem for the debugger to use this port, due to a collision with another application or your firewall program.</p>
<p><b><i>NOTE:</i></b> If you receive <i>“Protocol error. If the problem continues, try changing Intel OpenCL kernel debugger port”</i> message, you may need to change the debugging port number and/or change your firewall settings.</p>
<p>To change the debugging port number, do the following:</p>
<ol>
<li>Open <b>OpenCL Debugging Configuration</b> window.</li>
<li>Switch to <b>Advanced Settings</b> group box.</li>
<li>Check the <b>Use Custom Debugging port</b></li>
<li>In the <b>Debugging Port Number</b> field enter the port you need.</li>
</ol>
<p>Intel® OpenCL SDK Kernel Debugger uses port <code>56203</code> by default.</p>
<br /><img src="http://software.intel.com/file/38624" /> <br /><br /><br /><br /><br /><br />
<p>OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.</p>
</span>
<div  id="vc-meta">
<div id="vc-meta-author">
<div>Vadim Kartoshkin</div>
</div>
<div id="vc-meta-pubdate">09-08-2011</div>
<div id="vc-meta-modificationdate">09-08-2011</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product">
<div class="oclsdk">Intel® SDK for OpenCL* Applications</div>
</div>
<div id="vc-meta-category"></div>
<div id="vc-meta-thumb"></div>
<div id="vc-meta-thumb-tout"></div>
<div id="vc-meta-thumb-hero"></div>
<div id="vc-meta-tocenable">no</div>
<div id="vc-meta-abstract">Contents include a description of the Intel® OpenCL SDK Debugger; debugging your OpenCL™ kernel with Intel® OpenCL SDK Debugger; configuring Intel® OpenCL SDK Debugger; and troubleshooting Intel® OpenCL SDK Kernel Debugger.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/debugging-opencl-kernels-using-intel-opencl-sdk-debugger/</link>
      <pubDate>Thu, 08 Sep 2011 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/debugging-opencl-kernels-using-intel-opencl-sdk-debugger/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/debugging-opencl-kernels-using-intel-opencl-sdk-debugger/</guid>
      <category>Parallel Programming</category>
      <category>Visual Computing</category>
      <category>Game Development</category>
      <category>Media</category>
    </item>
    <item>
      <title>Experience and Key Strategies for Load Balancing Games</title>
      <description><![CDATA[ <br />

<h2>By Julien Hamaide</h2>
<h2 class="sectionHeading">Download Article<br /></h2>
<p>Download <a href="http://software.intel.com/file/37857">Load Balancing Games</a> (PDF 800KB)</p>
<p>With the increase in processing unit count, load balancing has become a challenge. Constantly loading all units and using all the available processing power requires specifically designed code. Many resources tell you how to organize or optimize your code and data, but few talk about scheduling and load balancing. This article explains how to effectively dispatch units of work to the available CPUs and adjust the game content to the available computing power.</p>
<h2 class="sectionHeading">SCHEDULING<br /></h2>
<p>Scheduling threads is the operating system’s responsibility. But operating system threads are heavy, and spawning and deleting a thread for each task is overkill. To limit this overhead, games use job systems, which consist of several worker threads that process jobs. The threads are created at the start of the system and are used during the entire game. Jobs can be made of C++ classes or C functions with arguments. Because you don’t use the operating system’s scheduler, however, you must implement a custom scheduler whose role is to dispatch the jobs to all available worker threads. The advantage is that you can set up custom algorithms and adapt them to the game’s needs. The following discussion assumes that the job system implements a dependency manager.</p>
<b>Constraints of a Game Environment</b><br />
<p>Games are real-time software. For each frame, every system must be updated. Every task must be completed on time or the frame rate will decrease. But not all tasks have the same importance. Tasks can be divided into two categories:</p>
<ul >
<li><b>Required tasks in a frame.</b> These tasks are required to complete the current frame. Physics, animation, rendering, and input processing tasks are necessary for correctly displaying the current frame. The constraint on this type of task is a hard constraint: Any delay directly introduces a delay in the frame.</li>
<li><b>Frame-delayed tasks.</b> These tasks are important for the game, but their results are not needed immediately. Pathfinding is a perfect example. If this type of task is delayed, it may not be a problem immediately, because the following frames could absorb the delay. Nonetheless, you must ensure that the task is processed sufficiently early.</li>
</ul>
<b>FIFO Scheduling</b><br />
<p>The simplest type of scheduling, first-in, first-out (FIFO) schedulers dispatch jobs in the order they were added. Although not sufficient for an entire game, FIFO scheduling can act as a first implementation and reference. This algorithm introduces little overhead: It just picks the last element of a list. That said, it doesn’t try to optimize poor scheduling order, either. Nor does it make any difference between frame-needed jobs and frame-delayed jobs.</p>
<p>One important aspect of a scheduler is its behavior over variable processor count. On gaming consoles, the number of available CPUs is known and doesn’t change over time. On a computer, the story is completely different. The number of CPUs is determined at game startup, and different architectures are available.</p>
<p>If there is only one thread, the scheduler executes tasks one after the other, as if the code were a single thread. Because the overhead is low, performance is similar to a single-threaded version. Because the number of CPUs in computers and consoles is growing, however, the scheduler will have random execution times, depending on the order in which the jobs are pushed and how the critical path is processed.</p>
<p>When the number of processors increases, the scheduler tends to reach the optimal time if it does not introduce latency. Jobs will be processed as soon as they are pushed and their dependencies are resolved.</p>
<b>Priority-based Scheduling</b><br />
<p>This scheduler type uses the priority you define to order the jobs. Although based on user input, priority-based scheduling can also be updated to increase performance. User input is kept at minimum granularity, so there is no need to give thousands of priority values. A good set contains these priorities:</p>
<ul >
<li><b>Immediate.</b> Run as soon as a worker thread is ready, or preempt a low-priority thread. This priority is useful if other jobs need this job’s output. For example, an animation could be needed to update the position of a grabbed object.</li>
<li><b>Needed for the end of the frame.</b> This priority is set on jobs that are required but not used before the end of the frame—typically particle system or sound system updates.</li>
<li><b>Low-priority game updates. </b>This priority is for the kind of task you want to execute as quickly as possible but without limiting higher-priority tasks. Examples include pathfinding and saving data to disk.</li>
<li><b>Idle jobs. </b>Those jobs will use idle CPU time. Such jobs can be used to perform background tasks, such as sending statistics to a server or cleaning unused resources.</li>
</ul>
<p>Unfortunately, some problems will appear. Low-priority and idle jobs might never be triggered if the game overwhelms the CPUs. If the target is 60 frames per second (FPS) and the current CPUs are capable of handling 50 FPS, a new frame will start just after the required jobs are handled. New immediate jobs are directly pushed as the new frame begins, preventing lower-priority jobs from being trigged. To avoid this situation, use a dynamic priority system. To supplement user-level priorities, a second priority value is introduced, the value of which is handled by the scheduler itself. For the low-priority scheduling problem, this value can be increased every time a job misses a launch opportunity. Its priority will then become superior to the priority of immediate jobs. The same problem might appear if an important job depends on a low-priority job. In this case, the priority is transferred to the job on which it depends.</p>
<p>This alternate priority value allows you to sort jobs within a priority category and boost some jobs into the upper category. You can compute the value using different parameters, and you can adapt the scheduler to your needs by tweaking that value. Here are a few of the features that you can use to modify job priority:</p>
<ul >
<li><b>The number of dependent jobs.</b> If a job blocks lot of other jobs, boosting its priority can be a huge win.</li>
<li><b>The length of the job.</b> The scheduler can keep statistics about job duration. Longer jobs can be scheduled sooner. If they are scheduled too late, the game might wait for its end with no other jobs to execute.</li>
<li><b>Is the job part of the critical path?</b> The critical path is the longer sequence of jobs in the dependency graph. Those jobs must be scheduled with a high priority.</li>
</ul>
<h2 class="sectionHeading">ARCHITECTURE-SPECIFIC PROBLEMS<br /></h2>
<p>Architectures can vary among computers. The architecture type, the number of CPUs, the number of cores inside each CPU—everything can affect the low-level behavior.</p>
<p>Typical architectures are symmetric multiprocessor (SMP) and nonuniform memory access (NUMA). In SMP, each CPU has equal access to memory. The memory bus is then most likely the bottleneck. For NUMA, each CPU has its own memory attached. Therefore, it accesses its own memory quickly with its own bus, but accessing other CPUs’ memory can be slow, and latency can be high. Each operating system has its own application programming interface (API) to query system setup. Nothing specific can be done for SMP architectures, but things are different in a NUMA architecture: The scheduler must try to assign jobs on processors that have the data in their local memory. In Windows* operating systems, you can query information about your system using functions such as QueryWorkingSetEx to discover which NUMA node memory is allocated to. Figure 1 shows a four-processor NUMA and SMP architecture.</p>
<p ><a href="http://software.intel.com/file/37856"><img  src="http://software.intel.com/file/37856" /></a><br /><b>Figure 1.</b> <i>Typical topology of NUMA and SMP architecture</i></p>
<p ><i><br /></i></p>
<b>Cores of a Single CPU</b><br />
<p>When a CPU is multicore, each core uses a unique arithmetic logic unit (ALU) and Single Instruction, Multiple Data (SIMD) unit. If two computation-expensive jobs are scheduled on each core of a single CPU, the performance will degrade, so the scheduler should avoid scheduling such jobs on the same CPU. Users can tag jobs to be computation expensive, but those jobs are usually easy to spot. Animations or physics jobs are good examples of computation-expensive jobs.</p>
<b>Improve Code Cache Usage</b><br />
<p>Code is just data, and as such must be cached by the CPU. For better performance, the scheduler must schedule jobs using the same code on the same processor.</p>
<p> </p>
<h2 class="sectionHeading">BALANCING THE SYSTEM: ADAPTING TO VARIABLE CPU COUNTS<br /></h2>
<p>If your game targets a wide range of machines, prepare it for multiple resolutions. For example, particle systems can be up-scaled, spawning more particles. Pathfinding can output a smoother path. Animation can use more bones for skeletons. Your game can increase its quality, using all the extra power it can. However, it should also be able to work on low-end machines. Players are used to tweaking the 3D settings—resolution, texture size, shadow definition, and so on—but it would be awkward for a player to tweak the artificial intelligence (AI) update rate or the pathfinder precision. The game must be able to adapt itself to the available power. But how do you prevent the game from running out of resources? What do you sacrifice first?</p>
<b>The AI System</b><br />
<p>Think of the AI system in three parts: seeing, thinking, and acting.</p>
<p>For an AI-driven character, seeing is acquiring all data needed to make a decision, thinking is combining all this data and making decisions, and acting is applying the decision. These three parts might not be updated at the same frequency. The acting layer is frame-required: The character must do something every frame, the new animation pose must be computed, and so on. But for the other two layers, the requirement is lower. There’s no need to update the sensory data each frame, nor does the current decision need to be challenged with the current data. But how do you determine the system’s update frequency?</p>
<p>One solution is to let the scheduler decide. The sensory system creates a first job with a low priority. If processing power is available, the scheduler will proceed with the job in the current frame. If not, the priority will be increased over time. Some frames later, the job will be finally executed. The trick is to schedule the next update within the current job (with some delay to prevent double-updating in the same frame). Using this technique, the system is updated as often as possible but no more. If the update rate is too low, the system can use extrapolation, such as is done in network play.</p>
<p>To implement such a system, two techniques are available:</p>
<ul >
<li><b>Blackboards.</b> Variables such as current decision and first visible enemy are continuous values. They could be updated at each frame. A blackboard allows you to keep the latest value of such variables. When an update job is executed, it writes its result to the blackboard but not instantaneously. The blackboard accumulates all changes during the frame and applies them at the end. The other jobs then never read inconsistent values, as variables don’t change during a frame.</li>
<li><b>Future objects.</b> Requests such as pathfinding are less frequent and usually take more time to process. The use of future objects can help in such cases. The future object allows you to send a request and wait for the result. If the result is not yet available, the code falls back to another behavior. Future objects can contain intermediate results, such as first path segments.</li>
</ul>
<b>The Component System</b><br />
<p>The main advantage of the component system is its modularity. Each entity is made of a component, each with its own data and executing its own task. Cutting the entities’ update code into pieces eases calling them at different frequencies. If your component system is data-oriented, different tables store each type of component, which are then updated in a row. For a given component type, only half of the table can be updated, thus dividing the frequency by two.</p>
<p>Take particular care about the time step: It should be accumulated over the skipped frame. To compute this step, the system keeps the last time steps, and each component stores the frame index it was last updated. Using the current frame index, the system can add time steps from skipped frames.</p>
<b>The Sound System</b><br />
<p>For the sound system, the story is completely different. The update is strict: The game must send its audio data on time or a glitch will be heard. You cannot balance by changing the update frequency. Instead, the idea here is to limit the number of sounds that can be played simultaneously. It’s easy for the scheduler to know what percentage of CPU the sound jobs are using, so the user can decide to limit this percentage to a fixed value. The sound system can then adapt its simultaneous play count using this value. If you use Digital Signal Processing (DSP) effects, the system can limit its simultaneous instances.</p>
<b>The Physics System</b><br />
<p>As most physics systems’ algorithms are based on an integration technique, the time step must be sufficiently small and constant. Some systems require you to break the time step into smaller pieces, and the update runs several times a frame. Users can't decrease the update rate, so the only solution is to decrease the need for CPU time. Several solutions exist:</p>
<ul >
<li>If your gameplay does not depend on the physics system’s behavior, you can change your collision resolution algorithm. A discrete collision algorithm is less precise than continuous collision detection, but it is also less expensive. Changing this algorithm at runtime can be tricky depending on your framework. Generally, this is a start time decision.</li>
<li>Lower the distance at which objects are removed from the physical world.</li>
<li>Lower the distance at which objects do not interact with other objects but only with static geometry. Most physics systems assign two 32-bit integers to their objects. Those bits are considered layers. The first layer defines which layers of an object belong to the collision group; the second layer determines which layers an object intersects with in the collision mask. Use this system to limit collisions between objects when they are far from the camera. The test is fast and easy to set up, and you can update the bits depending on the object’s distance from the camera.</li>
</ul>
<p>Because the first technique is only available at start time, use a heuristic to determine whether to activate it. The other solutions can be enabled at runtime.</p>
<b>The Streaming System</b><br />
<p>The streaming system is responsible for loading data from disk. If the data is compressed, the system will consume some CPU cycles, but you must pay attention to the impact this system can have on the frame rate. A good solution is to use a low priority for the streaming thread. But as in priority-based scheduling, “starving” may occur.</p>
<b>Preparing for the Unexpected</b><br />
<p>A huge difference between game consoles and computers is the existence of foreign processes. When users launch a game, they rarely close all running applications or shut down the operating system’s services. Some processes might wake up and perform some background work. When load balancing your game, you should always keep a bit of security in your timing measurements. On Windows* machines, GetProcessTimes and GetSystemTimes return the times (idle, user, and kernel) of the game process and the whole system. The application can sample those values and extract the time spent in other processes. The mean value of the recent peak is used as a security. On a six-processor machine with a game running at 60 FPS, the total available CPU time during a frame is 6 CPU * 1/60 FPS = 100 ms. If other processes have peaks around 20 ms per frame, the game should limit itself to 80 ms, even if in 90 percent of frames the other processes use less than 5 ms. As explained later, the load balancer integrates this constraint in its design.</p>
<p> </p>
<h2 class="sectionHeading">THE LOAD BALANCER<br /></h2>
<p>You’ve seen different techniques for lowering the required CPU power. But when should the game enable them and in which order? The load balancer is responsible for activating each system in turn. The idea is to degrade or progressively improve each system in turn. Unfortunately, there is no unique solution: The answer really depends on the importance of each system in your game experience. A mood game will surely put more importance on music than a first-person shooter will. So instead of explaining a finished and tuned system, let’s explore the concept and how to tune your game to your needs.</p>
<p>The main load balancing criterion is the frame length (equivalently, the frame rate). The goal is to keep the highest constant frame rate. Even if the game can run at 40 FPS 90 percent of the time and 30 FPS for the last 10 percent, it might be more interesting to keep the game at 30 FPS. Changes in frame rate are more noticeable than a slightly lower frame rate. And keeping a slightly lower frame rate frees some CPU time for streaming and the unexpected.</p>
<p>At the end of each frame, the load balancer receives information about the last frame length and job lengths, then compares this time with previous frame lengths. Table 1 contains the frame length for a given scenario. In this example, the load balancer requires that the update rate be around 50 Hz. The load balancer maintains the list of level of detail (LOD) it can enable or disable (see Table 2). The time each LOD takes is computed at the start of the application and is used as an approximation of the time the tasks required to execute. Every time an LOD is enabled, its time is updated (using a moving mean). When the load balancer detects that the frame length has decreased several frames in a row, it inspects its lists and activates the next LODs. You provide this list, as it depends on what constitutes the more important details you want to keep. The load balancer will always activate those details in order. But when should the game start decreasing the frame rate?</p>
<table  border="1">
<caption><b>Table 1. </b><i>Example of Load Balancer Behavior</i><b><br /></b><i><br /></i></caption> 
<tbody>
<tr >
<th>Frame index</th><th>Frame length</th><th>Instantaneous frame rate</th><th>Time waiting</th><th>Load balancer order</th>
</tr>
<tr>
<td>1</td>
<td>20 ms</td>
<td>50 Hz</td>
<td>0 ms</td>
<td>Don't change</td>
</tr>
<tr>
<td>2</td>
<td>21 ms</td>
<td>47.6 Hz</td>
<td>0 ms</td>
<td>Don't change</td>
</tr>
<tr>
<td>3</td>
<td>19 ms</td>
<td>52.6 Hz</td>
<td>1 ms</td>
<td>Don't change</td>
</tr>
<tr>
<td>4</td>
<td>20 ms</td>
<td>50 Hz</td>
<td>0 ms</td>
<td>Don't change</td>
</tr>
<tr>
<td>5</td>
<td>17 ms</td>
<td>58.8 Hz</td>
<td>3 ms</td>
<td>Don't change</td>
</tr>
<tr>
<td>6</td>
<td>17 ms</td>
<td>58.8 Hz</td>
<td>3 ms</td>
<td>Don't change</td>
</tr>
<tr>
<td>7</td>
<td>14 ms</td>
<td>71.4 Hz</td>
<td>6 ms</td>
<td>Increase quality</td>
</tr>
<tr>
<td>8</td>
<td>18 ms</td>
<td>55.5 Hz</td>
<td>2 ms</td>
<td>Don't change</td>
</tr>
<tr>
<td>9</td>
<td>18 ms</td>
<td>55.5 Hz</td>
<td>2 ms</td>
<td>Increase quality</td>
</tr>
<tr>
<td>10</td>
<td>20 ms</td>
<td>50 Hz</td>
<td>0 ms</td>
<td>Don't change</td>
</tr>
</tbody>
</table>
<p>The load balancer can access other information in the list, as well: the minimum frame rate required to disable a feature. In this example, the developer decided that it’s more important to have precise pathfinding than to have a game at 50 Hz. But if the frame rate were to decrease below 30 FPS, the feature could be disabled. At startup, the load balancer targets a high frame rate but runs at the lowest LOD. The system adds a new LOD each frame until the system finds its equilibrium, preventing the system from stalling the first few frames.</p>
<table  border="1">
<caption><b>Table 2.</b> <i>Example of an LOD List</i></caption> 
<tbody>
<tr >
<th>LOD Description</th><th>Estimated time if activated</th><th>Frame rate requred</th>
</tr>
<tr>
<td>Divide sensory update rate by 2</td>
<td>1 ms</td>
<td>50 FPS</td>
</tr>
<tr>
<td>Divide face emotion component update rate by 2</td>
<td>1 ms</td>
<td>50 FPS</td>
</tr>
<tr>
<td>Decrease available sound channels by 5</td>
<td>0.5 ms</td>
<td>50 FPS</td>
</tr>
</tbody>
</table>
<p>This technique allows a precise definition of the degradation of the game. Once the system is tuned, it is fully automatic, maintaining the influence of each LOD on the frame length. The frame rate is also linked to LOD selection. Although the design and the code are simple, the results are good. To tweak the system, an edit-and-continue rule system can come in handy. The ability to dynamically simulate lower-end machines (by limiting the number of CPUs) is also important. With those tools in hand, the rule list can be created and tuned in hours. This system was used in an unreleased game and is currently used in a game that will be released in 2012.</p>
<h2 class="sectionHeading">CONCLUSION<br /></h2>
<p>The scheduler is a complex system. As the conductor, it should be adapted to the score your game is playing. There is no unique solution, but you should select the best algorithm based on the code and data statistics. The algorithm should also be able to adapt to different architecture types. The key to success is analyzing, adapting, and profiling. But a scheduler alone won’t solve the problem. The load balancer is there to finish the work. It serves as an LOD coordinator, linking definition and quality to frame rate. You then have everything you need to shape the best experience for a wide range of machines.</p>
<p> </p>
<h2 class="sectionHeading">About the Author<br /></h2>
<p>Julien Hamaide graduated as a multimedia electrical engineer at the Faculté Polytechnique de Mons (Belgium) at the age of 21. After two years of working on speech and image processing at TCTS/Multitel and three years leading a team on next-generation consoles at Elsewhere Entertainment, Julien started his own studio, called Fishing Cactus (www.fishingcactus.com). He has published several articles and spoken about such topics as 10Tacle's movement and interaction system and multithreading applied to AI. He now leads development at Fishing Cactus.</p>
<p> </p>
<h2 class="sectionHeading">RESOURCES<br /></h2>
<p>Amdahl, G. (1967). “Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities.” AFIPS Conference Proceedings (30):483–485.</p>
<p>Alexandrescu, A. (2001). Modern C++ design: generic programming and design patterns applied.</p>
<p>Hamaide, J. (2008). “Multithread Job and Dependency System.” Game Programming Gems 7.</p>
<p>Leiserson, C. What the $#@! is Parallelism, Anyhow? http://software.intel.com/en-us/articles/what-the-is-parallelism-anyhow-1.</p>
<p>Leiserson, C. Are Determinacy-Race Bugs Lurking in YOUR Multicore Application? http://software.intel.com/en-us/articles/are-determinacy-race-bugs-lurking-in-your-multicore-application.</p>
<p>Multiple Processors. http://msdn.microsoft.com/en-us/library/ms684251.aspx.</p>
<div  id="vc-meta">
<div id="vc-meta-author">
<div>Julien Hamaide</div>
</div>
<div id="vc-meta-pubdate">07-22-2011</div>
<div id="vc-meta-modificationdate">07-22-2011</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product"></div>
<div id="vc-meta-category">
<div>Performance Analysis</div>
</div>
<div id="vc-meta-thumb"></div>
<div id="vc-meta-thumb-tout"></div>
<div id="vc-meta-thumb-hero"></div>
<div id="vc-meta-tocenable">no</div>
<div id="vc-meta-abstract">This article details how to effectively dispatch units of work to the CPUs and adjust the game content to the available computing power. With increases in processing unit count, load balancing has become a challenge. While many resources discuss how to optimize your code and data to solve this glaring issue, this articles delves into how to properly schedule and load balance your work.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/load-balancing-games/</link>
      <pubDate>Fri, 22 Jul 2011 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/load-balancing-games/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/load-balancing-games/</guid>
      <category>Visual Computing</category>
      <category>Game Development</category>
    </item>
    <item>
      <title>MLAA: Efficiently Moving Antialiasing from the GPU to the CPU</title>
      <description><![CDATA[ <h2 class="sectionHeading">Download Article and Source Code<br /></h2>
Download <a href="http://software.intel.com/file/37411">MLAA: Efficiently Moving Antialiasing from the GPU to the CPU</a> (PDF 1.2MB)<br />Visit <a href="http://software.intel.com/en-us/articles/mlaa/">MLAA Sample Page</a> to download source.<br /><br /><br />
<h2 class="sectionHeading">Introduction</h2>
<p>Efficient antialiasing techniques are an important tool of high-quality, real-time rendering. MSAA (Multisample Antialiasing) is the standard technique in use today, but comes with some serious disadvantages:</p>
<ul>
<li>Incompatibility with deferred lighting, which is used more and more in real-time rendering;</li>
<li>High memory and processing cost, which makes its use prohibitive on some widely available platforms (such as the Sony Playstation* PS3* [Perthuis 2010]). This cost is also directly linked to the complexity of the scene rendered;</li>
<li>Inability to smooth non-geometric edges unless used in conjunction with alpha to coverage.</li>
</ul>
<p>A new technique developed by Intel Labs called Morphological Antialiasing (MLAA) [Reshetov 2009] addresses these limitations. MLAA is an image-based, post-process filtering technique which identifies discontinuity patterns and blends colors in the neighborhood of these patterns to perform effective antialiasing. It is the precursor of a new generation of real-time antialiasing techniques that rival MSAA [Jimenez et al., 2011] [SIGGRAPH 2011].</p>
<p>This sample is based on the original, CPU-based MLAA implementation provided by Reshetov, with improvements to greatly increase performance. The improvements are:</p>
<ul>
<li>Integration of a new, efficient, easy-to-use tasking system implemented on top of Intel® Threading Building Blocks (Intel® TBB).</li>
<li>Integration of a new, efficient, easy to use pipelining system for CPU onloading of graphics tasks.</li>
<li>Improvement of data access patterns through a new transposing pass.</li>
<li>Increased use of Intel® SSE instructions to optimize discontinuities detection and color blending.</li>
</ul>
<br />
<h2 class="sectionHeading">The MLAA algorithm</h2>
<p>In this section we present an overview of how the MLAA algorithm works; cf. [Reshetov 2009] for the fully detailed explanation. Conceptually, MLAA processes the image buffer in three steps:</p>
<ol>
<li>Find discontinuities between pixels in a given image.</li>
<li>Identify U-shaped, Z-shaped, L-shaped patterns.</li>
<li>Blend colors in the neighborhood of these patterns.</li>
</ol>
<p>The first step (find discontinuities) is implemented by comparing each pixel to its neighbors. Horizontal discontinuities are identified by comparing a pixel to its bottom neighbor, and vertical discontinuities by comparing with the neighbor on the right. In our implementation we compare color values, but any other approach that suits the application’s specificities is perfectly valid.</p>
<p>At the end of the first step, each pixel is marked with the horizontal discontinuity flag and/or the vertical discontinuity flag, if such discontinuities are detected. In the next step, we “walk” the marked pixels to identify discontinuity lines (sequences of consecutive pixels marked with the same discontinuity flag), and determine how they combine to form L-shaped patterns, as illustrated below:</p>
<p ><img  title="Figure 1" alt="Figure 1" src="http://software.intel.com/file/37399" /> <br /><b>Figure 1:</b><i> MLAA processing of an image, with Z, U, and L-shapes shown on the original image on the left</i></p>
<p>The third and final step is to perform blending for each of the identified L-shaped patterns.</p>
<p>The general idea is to connect the middle point of the primary segment of the L-shape (horizontal green line in the figure below) to the middle point of the secondary segment (vertical green line—the connection line is in red). The connection line splits each pixel into two trapezoids; for each pixel the area of the corresponding trapezoid determines the blending weights. For example, in the figure below, we see that the trapezoid’s area for pixel c5 is 1/3; so the new color of c5 will be computed as 2/3 * (color of c5) + 1/3 * (color of c5’s bottom neighbor).</p>
<p ><img  src="http://software.intel.com/file/37400" /> <br /><b>Figure 2:</b> <i>Computing blending weights</i></p>
<p>In practice, to ensure a smooth silhouette look, we need to minimize the color differences at the stitching positions of consecutive L-shapes. To achieve this, we slightly adjust the connection points on the L-shape segments around the middle position based on the colors of the pixels around the stitching point.</p>
<p> </p>
<h2 class="sectionHeading">Sample usage</h2>
<p>The camera can be moved around the scene using drag-and-click, and the mouse wheel can be used to zoom in or out. In addition, there are three blocks of controls on the right side of the sample’s window:</p>
<p ><img  src="http://software.intel.com/file/37401" /> <br /><b>Figure 3:</b><i> A screenshot of the sample in action</i></p>
<p>The first block controls the rendering of the scene:</p>
<ul>
<li><i>Pause Scene</i> toggles the scene’s animations on/off;</li>
<li><i>Show zoom box</i> toggles the <i>zoom box</i> feature on/off, which allows the user to take a closer look at a pixel area to better compare the antialiasing techniques. The area of interest can be changed by right-clicking to the new area to examine;</li>
<li><i>Scene complexity</i> simulates the effect of increasing scene complexity by using overdraw. The value (between 1 and 100, adjusted by the slider) indicates how many times the scene is rendered per frame (with overdraw).</li>
</ul>
<p>The second block selects which antialiasing technique to apply: MLAA, MSAA (4x) or no antialiasing. This is of course to allow comparison of the techniques’ performance and quality (the default choice is “MLAA”);</p>
<p>The last block of controls is only available if the antialiasing technique used is MLAA, and controls how the algorithm should be run:</p>
<ul>
<li><i>Copy/Map/Unmap only</i> copies the color buffer from GPU memory to CPU memory and back, but won’t perform any MLAA processing between the two copy operations. This allows measurement of the impact of the copy operations on the overall performance of the entire algorithm;</li>
<li><i>CPU tasks pipelining</i> turns the pipelining system on/off for CPU onloading of graphics tasks (the default is on) so that it is easy to see the benefit of pipelining;</li>
<li><i>Show found edges</i> runs the first part of the algorithm (find discontinuities between pixels), but the blending passes are replaced by a debug pass, where a pixel is:    
<ul >
<li>changed to solid green if a horizontal discontinuity has been found with its neighbor;</li>
<li>to solid blue if a vertical discontinuity has been found with its neighbor;</li>
<li>changed to solid red if both horizontal and vertical discontinuities have been found with its neighbors;</li>
<li>unchanged if no discontinuities have been found.</li>
</ul>
</li>
</ul>
<p ><img  src="http://software.intel.com/file/37402" /> <br /><b>Figure 4.</b><i> Gear tower scene with MLAA and the ”show found edges” option enabled</i></p>
<p ><i><br /></i></p>
<h2 class="sectionHeading">Sample Architecture</h2>
<p>Without the pipelining optimizations, the sequence of events for each frame is:</p>
<p> </p>
<ol >
<li >
<div ><b>ANIMATE AND RENDER TEST SCENE</b></div>
</li>
<li >
<div ><b>MLAA STAGE (if MLAA is enabled)</b> 
<ul >
<li>
<div >Copy the color buffer where the scene was rendered to a staging buffer</div>
</li>
<li>
<div >Map the staging buffer for CPU-side access.</div>
</li>
<li>
<div >MLAA post-processing (the staging buffer is both input and output)</div>
</li>
<li>
<div >Unmap the staging buffer</div>
</li>
<li>
<div >Copy staging buffer back to the GPU-side color buffer</div>
</li>
<li>
<div >Render the zoombox (if applicable)</div>
</li>
</ul>
</div>
</li>
<li >
<div ><b>RENDER THE SAMPLE’s UI, PRESENT FRAME</b></div>
</li>
</ol>
<p>Except for the “Perform the MLAA post-processing work” step (and again not considering the pipeline for now), all of these steps are implemented using standard Microsoft DirectX* methods.</p>
<p> </p>
<h2 class="sectionHeading">The Tasking API</h2>
<p>By nature the MLAA algorithm is easy to parallelize. For both the discontinuities detection and blending passes, we can process the color buffer in independent chunks (blocks of contiguous rows or columns). MLAA can be implemented using a task-based solution that automatically makes full use of all available CPU cores, while keeping the code core count agnostic</p>
<p>This sample uses a simple C-based tasking API that is implemented on top of the Intel® Threading Building Blocks (Intel® TBB) scheduler. This wrapper API was created to simplify the integration of the technique into existing codebases which already expose a similar tasking API (e.g. cross-platform game engines). An added benefit is the increased readability of the main source file <i>MLAA.cpp</i>.</p>
<p>The two important functions of the wrapper APIs are:</p>
<pre name="code" class="cpp">    BOOL CreateTaskSet(
        TASKSETFUNC                 pFunc,      //  Function pointer to the 
                                                //  Taskset callback function
        VOID*                       pArg,       //  App data pointer (can be NULL)
        UINT                        uTaskCount, //  Number of tasks to create 
        TASKSETHANDLE*              pDepends,   //  Array of TASKSETHANDLEs that 
                                                //  this taskset depends on.  The 
                                                //  taskset will not be scheduled
                                                //  until all tasksets in this list
                                                //  complete.
        UINT                        uDepends,   //  Count of the depends list
        OPTIONAL LPCSTR             szSetName,  //  [Optional] name of the taskset
                                                //  the name is used for profiling
        OUT TASKSETHANDLE*          pOutHandle  //  [Out] Handle to the new taskset
        );
</pre>
<p>And:</p>
<pre name="code" class="cpp">    //  WaitForSet will yield the main thread to the tasking system and return
    //  only when the taskset specified has completed execution.
    VOID WaitForSet(
        TASKSETHANDLE               hSet        // Taskset to wait for completion
        );
</pre>
<p>The callback function has the following signature:</p>
<pre name="code" class="cpp">typedef VOID (*TASKSETFUNC)(VOID* TaskInfo, INT iContext, UINT uTaskId, UINT uTaskCount);</pre>
<p> </p>
<p>MLAA requires a dependency graph of three consecutive tasksets:</p>
<ul>
<li>The first taskset finds the discontinuities between pixels in the color buffer;</li>
<li>The second taskset performs the horizontal blending pass, and depends on the completion of the first taskset for the discontinuities information;</li>
<li>The third taskset performs the vertical blending pass, which depends on the completion of the first taskset for the discontinuities information, but also on the completion of the second taskset because of the transpose optimization (cf. corresponding section for details).</li>
</ul>
<p>This dependency graph is expressed in <i>MLAA.cpp</i> as a sequence of three calls to <code>CreateTaskSet;</code> the taskset callback functions (implemented in <i>MLAAPostProcess.cpp</i>) being respectively <code>MLAAFindDiscontinuitiesTask</code>, <code>MLAABlendHTask</code>, <code>MLAABlendVTask</code>.</p>
<p>If pipelining is not enabled, we wait for the last taskset to complete by calling <code>WaitForSet</code> with the handle of the last taskset. When the call returns, the MLAA work for the frame is complete. Things are just a little bit more complicated when using the pipeline.</p>
<p> </p>
<h2 class="sectionHeading">The CPU/GPU pipeline</h2>
<p>To get the maximum performance from our implementation, we have to keep both CPU and GPU sides fully utilized. Due to the data dependencies between the tasksets, full utilization can be achieved by interleaving the processing of multiple frames, as shown in the diagram below:</p>
<p ><img  src="http://software.intel.com/file/37410" /><br /><b>Figure 5:</b><i> Frames moving through the pipeline. The red blocks on the CPU main thread timeline illustrate that the main thread will take MLAA work if the application gets CPU-bound.</i></p>
<p ><i><br /></i></p>
<p>In other words, we have to build a pipelining system, where a pipeline is a sequence of CPU stages (workloads) and GPU stages, and able to run multiple instances of said pipeline at the same time.</p>
<p>In our case, each instance of the pipeline corresponds to the processing of one frame. We have three steps:</p>
<p>Step 1 (GPU stage):</p>
<ul>
<li>Animate and render test scene;</li>
<li>Copy color buffer to staging buffer (using asynchronous GPU-side CopyResource).</li>
</ul>
<p>Step 2 (CPU stage):</p>
<ul>
<li>Map the staging buffer;</li>
<li>Perform the MLAA post-processing work (staging buffer is both input and output).</li>
</ul>
<p>Step 3 (GPU stage):</p>
<ul>
<li>Unmap staging buffer and copy it back to the GPU-side color buffer;</li>
<li>Finish frame rendering (zoombox, UI) and present.</li>
</ul>
<p>To implement this concept, we designed a simple Pipeline class. Each stage is represented by a <code>PipelineFunction</code> structure that specifies the function to be called, and the stage type. The callback function must have the following signature:</p>
<pre name="code" class="cpp">typedef void (*PPIPELINEFUNC)( UINT uInstance, ID3D11DeviceContext* pContext )</pre>
<p> </p>
<p>Where <code>uInstance</code> indicates which instance of the pipeline the function is being called from. A GPU stage uses a DirectX* query (of type <code>D3D11_QUERY_EVENT</code>) to signal its completion, but a CPU stage has to explicitly call the Pipeline’s class <code>CompleteCPUWait</code> method to signal completion:</p>
<pre name="code" class="cpp">    //  CompleteCPUWait signals an instance that expects an CPU wait in order
    //  to complete. When the Pipeline reaches that instance, it will be complete
    //  for that stage in the pipeline.
    VOID
    CompleteCPUWait(
        UINT                    uInstance );
</pre>
<p>Creating the pipeline instances requires a single call to the Init method. In our case, the code is:</p>
<pre name="code" class="cpp">	PipelineFunction Func[ 2 ];
	Func[0].Type = PIPELINE_EVENT_COMPLETE;
	Func[0].pFunc = AnimateAndRenderScene;
	Func[1].Type = PIPELINE_CPU_COMPLETE;
	Func[1].pFunc = MLAAPostProcessing;

	g_Pipeline.Init(g_NumPipelineInstances, ARRAYSIZE( Func ), Func, g_pPipelineQueries );
</pre>
<p>Where g_NumPipelineInstances is the number of pipeline instances to create (3 in this case).</p>
<p>To run the pipelines, we call the <code>Start</code> method each frame:</p>
<pre name="code" class="cpp">    //  Start is used to execute the Pipeline.  Processing will continue until
    //  an instance in the pipeline completes.  The index of that instance is
    //  return.  PIPELINE_INVALID is returned if the Pipeline was set to flush
    //  and there are no more instances left in the Pipeline.
    UINT
    Start(
        ID3D11DeviceContext*    pContext,       //  Context to issue Event query on.
        
        BOOL                    bFlush = FALSE  //  If bFlush is TRUE, the Pipeline
                                                //  will not create new instances.
                                                //  Specify TRUE if you want to flush
                                                //  the Pipeline.
        );
</pre>
<p>Because the <code>Start</code> method returns the index of a completed instance of the pipeline, the third and last step doesn’t have to be called through the <code>Pipeline</code> class. The code of the last step executes right after the call to the <code>Start</code> method, indexing the data structures with the index returned by <code>Start</code>.</p>
<p>Pipelining does not rely on the tasking API <code>WaitForSet</code> call, since it is blocking and so does not allow pipelining to occur. The solution is to use a <i>completion taskset</i>—that is, a task that depends on the completion of all the MLAA tasks, and whose only work will be to call the <code>CompleteCPUWait</code> method.</p>
<p> </p>
<h2 class="sectionHeading">Intel® SSE optimizations</h2>
<p>The first pass of the MLAA algorithm identifies discontinuities between pixels. Conceptually, each pixel checks its color and compares it with its bottom neighbor (when looking for horizontal discontinuities) or right neighbor (when looking for vertical discontinuities) [Reshetov09, section 2.2.1].</p>
<p>In this sample, we have kept the simple discontinuity detection kernel of the reference implementation. A discontinuity exists between two pixels if at least one of their RGB color components differs by at least 16 (on the 0-255 scale of the RGBA8 format).</p>
<p>This definition works well in the sample and allows a very compact and efficient SIMD implementation. However, more complex approaches are possible, and sometimes necessary, to get the best possible results. For example, a pixel’s luminance could be used instead of directly comparing color values, and a variable threshold recomputed each frame to take into account the scene’s overall luminance and contrast [Luminance]. Depth values could be used to assist with edge detection or any kind of custom data to exclude specific zones of the color buffer from being processed.</p>
<p>Because this step is independent from the rest of the algorithm, and we are running on the CPU, any data from the program can be used directly as input data. The only limits to the detection algorithm are:</p>
<ul>
<li>The fact that the algorithm must output a bit for each pixel indicating whether a discontinuity is detected;</li>
<li>The tradeoff between performance and quality/complexity;</li>
<li>The programmer’s imagination.</li>
</ul>
<p>As with the original reference implementation [Reshetov09, section 2.4], we work with a RGBA8 render target format and the “vertical discontinuity” and “horizontal discontinuity” bit flags computed by MLAA are stored in-place in the two high bits of the pixel data (i.e., the two high bits of the alpha component which is unaffected by the MLAA blending operations). This keeps the implementation simple while helping with memory footprint, and allowing optimizations in the next steps of the algorithms (cf. “the transposing optimization” section below).</p>
<p>Because each pixel is 32 bits of data, and each pixel can be processed independently of the others in this step, we can process 4 pixels at a time using Intel® SSE intrinsics. The detection code is very short.</p>
<pre name="code" class="cpp">	
	//---------------------------------------------------------------------------------------
	// Given a row of pixels, check for color discontinuities between a pixel and its
	// neighbor. Intel SSE implementation, so we process all 4 pixels in the row at a time.
	// If a discontinuity is detected, the corresponding flag is set in the alpha component of the pixel.
	//---------------------------------------------------------------------------------------
	inline void ComparePixelsSSE(__m128i&amp; vPixels0, __m128i vPixels1, const INT EdgeFlag)
	{
	// Each byte of vDiff is the absolute difference between a pixel's color channel value, and the one from its neighbor.
		__m128i vDiff = _mm_sub_epi8(_mm_max_epu8(vPixels0, vPixels1), 
									 _mm_min_epu8(vPixels0, vPixels1));
	// We are only interested if the diff. is &gt;16, and not interested in alpha differences.
		const INT Threshold = 0x00F0F0F0;
		__m128i vThresholdMask = _mm_set1_epi32(Threshold);
		vDiff = _mm_and_si128(vDiff, vThresholdMask);
	// Each of the four lanes of the selector is 0 if no discontinuity detected RGB, 0xFFFFFFFF else.
		__m128i vSelector = _mm_cmpeq_epi32(vDiff, _mm_setzero_si128());
</pre>
<p>The reference implementation then proceeds to update the alpha component of each pixel, but used inefficient serial code to do so. We can optimize this sequence by using the following Intel® SSE code:</p>
<pre name="code" class="cpp">	
	__m128i vEdgeFlag = _mm_set1_epi32(EdgeFlag);
	vEdgeFlag = _mm_andnot_si128(vSelector, vEdgeFlag);
	// vEdgeFlag now contains EdgeFlag for the pixels where a discontinuity was detected, 0 otherwise.
	vPixels0 = _mm_or_si128(vPixels0, vEdgeFlag);
</pre>
<p>We also replaced the <code>MixColors</code> function (which computes a linear interpolation of two colors) to use a full Intel® SSE implementation.</p>
<p> </p>
<h2 class="sectionHeading">The Transposing Optimization</h2>
<p>The next task of the algorithm is to find “discontinuity lines”, i.e., sequences of consecutive pixels which are marked with the same discontinuity flag (horizontal flag for horizontal blending pass, vertical flag for vertical blending pass) while walking rows and columns of our color buffer [Reshetov09, section 2.1].</p>
<p>Because discontinuity lines tend to be short, intuition suggests (and profiling data confirms) that the most expensive part of this operation is scanning <i>between</i> discontinuity lines, i.e., scanning the large areas of consecutive pixels where no discontinuity flag is set.</p>
<p>The good news is that we can check 4 pixels at a time using the <code>_mm_movemask_ps</code> Intel® SSE intrinsic when the following conditions are met:</p>
<ol>
<li>We are scanning 4 pixels stored at consecutive addresses;</li>
<li>The address of the first pixel is “Intel® SSE-aligned” (16 bytes alignment);</li>
<li>The discontinuity flag is stored as the high bit of the 32 bits of pixel data.</li>
</ol>
<p>During the horizontal blending pass, (1) is true (we scan horizontal rows of pixels in the color buffer represented as a 2D linear array of pixel data); (2) is true almost all the time (remember that 16 bytes alignment is equivalent to “the index of the starting pixel in the buffer is a multiple of 4” as the buffer is properly aligned); and (3) is true as we chose bit 31 to represent the horizontal discontinuity flag.</p>
<p>If all conditions are true, we compute the flags:</p>
<pre name="code" class="cpp">	__m128 PixelFlags = *(reinterpret_cast&lt;__m128*&gt;(WorkBuffer + OffsetCurrentPixel));
	// Creates a 4-bit mask from the most significant bits
	int HFlags = _mm_movemask_ps(PixelFlags);	
</pre>
<p>And five outcomes are possible depending on the value of HFlags:</p>
<ul>
<li>0 (the most common case by far): no discontinuity flag set in this group of 4 pixels, move to the next group of 4 pixels.</li>
<li>Bit 0 is set: discontinuity line starts at first pixel of the group.</li>
<li>Bit 1 is set: discontinuity line starts at second pixel of the group.</li>
<li>Bit 2 is set: discontinuity line starts at third pixel of the group.</li>
<li>Bit 3 is set: discontinuity line starts at fourth pixel of the group.</li>
</ul>
<p>This optimization is part of the reference implementation and works well for the horizontal blending pass, but was impossible to apply for the vertical blending pass. As the code scans the buffer vertically, (1) is false (adjacent pixels in a column are not stored at consecutive addresses), and (3) is false as well (the vertical discontinuity flag is stored at bit 30 of the pixel data).</p>
<p>The problem with (3) is easy to work around by introducing a simple “shift left by one bit” operation if we are processing the vertical flags, transforming the code above to:</p>
<pre name="code" class="cpp">	int Shift = (EdgeFlag == EdgeFlagV) ? 1 : 0;
	__m128 PixelFlags = *(reinterpret_cast&lt;__m128*&gt;(WorkBuffer + OffsetCurrentPixel));
	PixelFlags = _mm_castsi128_ps( _mm_slli_epi32(_mm_castps_si128(PixelFlags), Shift) );
	int HFlags = _mm_movemask_ps(PixelFlags);
</pre>
<p>But (1) is still a problem. In addition, vertical scanning is extremely cache-unfriendly. Because of both of these issues, the vertical blending pass is about 3 times (300%!) as expensive as the horizontal blending pass in the reference implementation (as shown by profiling data).</p>
<p>Our solution to this issue is to make the vertical blending pass use the cache and Intel® SSE-friendly data access patterns of the horizontal blending pass by considering the color buffer as a matrix of pixels and transposing it between passes:</p>
<ul>
<li>Perform horizontal blending pass</li>
<li>Transpose the (horizontally blended) color buffer</li>
<li>Perform vertical blending pass</li>
<li>Transpose back color buffer</li>
</ul>
<p>This way the code for both blending passes is exactly the same (which adds the advantage of simplicity/readability), the only difference being which flag to scan for, and we benefit from all the optimizations and cache-friendliness of the horizontal pass.</p>
<p>In practice, the transpose operations are not implemented as separate passes/tasksets, but as the last part of their respective blending pass. This allows us to benefit from the cache “warmness”.</p>
<p>The two transpose operations are not free. The cost is the extra code execution time, one extra work buffer and a synchronization point between the horizontal and vertical passes (we must wait for <i>all</i> horizontal tasks to be done before we can start <i>any</i> of the vertical tasks, because we must wait for the color buffer to be fully transposed before starting the vertical pass work).</p>
<p>Even with these extra costs, the overall performance is significantly better than the previous approach. As expected, profiling data shows that both blending passes have equivalent performance.</p>
<p> </p>
<h2 class="sectionHeading">Performance Results</h2>
<p>MLAA performance was measured on the following two configurations:</p>
<ul>
<li>Code name “Huron River” : Intel® Core™ i7-2820QM Processor (Intel® microarchitecture code name Sandy Bridge, 4 cores 8 threads @2.30 GHz) with GT2 processor graphics, 4 GB of RAM, Windows 7 Ultimate 64-bit Service Pack 1</li>
<li>Code name “Kings Creek”: Intel® Core™ i5-660 Processor (codename “Clarkdale”, 2 cores 4 threads @ 3.33 Ghz), with GMA HD Graphics (codename “Ironlake”), 2 GB of RAM, Windows 7 Ultimate 64-bit Service Pack 1</li>
</ul>
<p>We measured the average frame rendering time of our sample as a function of scene complexity for the different antialiasing settings. We also measured the rendering time for the <i>Copy/Map/Umap only</i> mode to highlight the impact of the color buffer copy operations on the overall performance of the algorithm.</p>
<p> </p>
<h2 class="sectionHeading">Results</h2>
<p>The data shows that for the Huron River machine, the extra cost of MSAA 4x goes up linearly with the scene complexity (in fact, for all resolutions, the frame time when using MSAA 4x to render our test scene appears to be approximately the double of the frame time when no antialiasing is used). In contrast, the cost of MLAA appears more or less constant (around 4 ms/frame at 1280x800). This is consistent with our expectations, as unlike MSAA 4x, MLAA is executed only once per frame, regardless of scene complexity / number of draw calls.</p>
<p>We also observe that except for very low complexity values (i.e. less than ~5) MLAA always outperforms MSAA 4x, regardless of resolution, and that the difference in performance grows with complexity (because as noted above, the cost of MSAA 4x grows linearly with complexity when the cost of MLAA does not).</p>
<p>In the case of the Kings Creek machine, we can’t compare the cost of MLAA vs. MSAA 4x as the latter is not provided by the Ironlake hardware. The goal is then to determine if MLAA allows us to provide software antialiasing as an alternative with acceptable performance. Our measurements at the 1280x800 resolution show that again the cost of MLAA is largely independent of scene complexity, and the average value is ~7.5 ms (discarding the outlier data point at complexity = 100).</p>
<p>Interestingly, if we compare this result to a hypothetical MSAA 4x implementation with approximately the same performance profile than the Huron River one (frame time with MSAA 4x ~ 2x frame time with no antialiasing), we notice that again MLAA would outperform MSAA 4x for almost all complexity values (≥4 in this case).</p>
<p ><img  src="http://software.intel.com/file/37405" /> <br /><b>Table 1.</b><i> Rendering times of our test scene on the King’s Creek machine.</i></p>
<p ><img  src="http://software.intel.com/file/37406" /> <br /><b>Table 2. </b><i>Rendering times of our test scene on the Huron River machine.</i></p>
<p ><img  src="http://software.intel.com/file/37407" /> <br /><b>Figure 6: </b><i>Rendering times of our test scene for each of the antialiasing techniques, as a function of scene complexity. The bottom, flat curve is the difference between the “MLAA with pipeline on” curve, and the “No antialiasing” curve, measuring the cost of using MLAA [Huron River, res. 1280x800]</i></p>
<p ><img  src="http://software.intel.com/file/37408" /> <br /><b>Figure 7:</b><i> Rendering times of our test scene for each of the antialiasing techniques, as a function of scene complexity. The bottom, flat curve is the difference between the “MLAA with pipeline on” curve, and the “No anti aliasing” curve, measuring the cost of using MLAA [Huron River, res. 1600x1200]</i></p>
<p ><img  src="http://software.intel.com/file/37409" /> <br /><b>Figure 8:</b><i> Rendering times of our test scene with MLAA on and off, as a function of scene complexity. The bottom, flat curve is the difference between the “MLAA with pipeline on” curve, and the “No antialiasing” curve, measuring the cost of using MLAA [Kings Creek, res. 1280x800]</i></p>
<h2 class="sectionHeading">References</h2>
<p>[Reshetov 2009] RESHETOV, A. 2009. “<a href="http://visual-computing.intel-research.net/publications/papers/2009/mlaa/mlaa.pdf">Morphological Antialiasing</a>”</p>
<p>[Perthuis 2010] PERTHUIS, C. 2010. MLAA in God of War 3. Sony Computer Entertainment America, PS3 Devcon, Santa Clara, July 2010.</p>
<p>[Jimenez et al., 2011] JIMENEZ, J., MASIA, B., ECHEVARRIA, J., NAVARRO, F. and GUTIERREZ, D. 2011. Practical Morphological Anti-Aliasing. In Wolfgang Engel, ed., GPU Pro 2. AK Peters Ltd.</p>
<p>[SIGGRAPH 2011] JIMENEZ, J., GUTIERREZ D., YANG, J., RESHETOV, A., DEMOREUILLE, P., BERGHOFF, T., PERTHUIS, C., YU, H., MCGUIRE, M., LOTTES, T., MALAN, H., PERSSON, E., ANDREEV, D. and SOUSA T. 2011. Filtering Approaches for Real-Time Anti-Aliasing. In <i>ACM SIGGRAPH 2011 Courses.</i></p>
<p>[Luminance] Definition of luminance for CRT-like devices:</p>
<p>INTERNATIONAL COMMISSION ON ILLUMINATION. 1971. Recommendations on Uniform Color Spaces, Color Difference Equations, Psychometric Color Terms. Supplement No.2 to CIE publication No. 15 (E.-1.3.1), TC-1.3, 1971.</p>
<p>And for LCDs:</p>
<p>ITU-R Rec. BT.709-5. 2008. Page 18, items 1.3 and 1.4</p>
<div  id="vc-meta">
<div id="vc-meta-author">
<div>Alexandre De Pereyra</div>
</div>
<div id="vc-meta-pubdate">07-15-2011</div>
<div id="vc-meta-modificationdate">07-15-2011</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product">
<div class="tbb">Intel® TBB</div>
</div>
<div id="vc-meta-category">
<div>Performance Analysis</div>
<div>Intel® SSE</div>
</div>
<div id="vc-meta-thumb"></div>
<div id="vc-meta-thumb-tout"></div>
<div id="vc-meta-thumb-hero"></div>
<div id="vc-meta-tocenable">no</div>
<div id="vc-meta-abstract">Efficient antialiasing techniques are an important tool of high-quality, real-time rendering. MSAA (Multisample Antialiasing) is the standard technique in use today, but comes with some serious disadvantages (incompatibility with deferred lighting, high memory/processing cost, inability to smooth non-geometric edges without alpha to coverage). Morphological Antialiasing (MLAA) is a new technique developed by Intel Labs to address these limitations. MLAA is an image-based, post-process filtering technique, that identifies discontinuity patterns and blends colors in the neighborhood of these patterns to perform effective antialiasing. This sample is based on the original CPU-based MLAA implementation, with improvements to greatly increase performance.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/mlaa-efficiently-moving-antialiasing-from-the-gpu-to-the-cpu/</link>
      <pubDate>Fri, 15 Jul 2011 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/mlaa-efficiently-moving-antialiasing-from-the-gpu-to-the-cpu/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/mlaa-efficiently-moving-antialiasing-from-the-gpu-to-the-cpu/</guid>
      <category>Visual Computing</category>
      <category>Game Development</category>
      <category>Visual Computing Source</category>
    </item>
    <item>
      <title>&amp;#34;MAXIS*-mizing&amp;#34; Darkspore*: A Case Study of Graphic Analysis and Optimizations in Maxis’* Deferred Renderer</title>
      <description><![CDATA[ <p ><img title="Darkspore*" src="http://software.intel.com/file/37277" /></p>
<p>Download <a href="http://software.intel.com/file/38105">MAXIS-mizing Darkspore Game Performance with Intel® GPA 4.0</a> (PDF 3.4MB)</p>
<h2>Contents</h2>
<ul >
<li><a href="http://software.intel.com#newexperience">A New Gaming Experience Made Possible With Processor Graphics</a></li>
<li><a href="http://software.intel.com#everyone">Darkspore* is Designed for Everyone</a></li>
<li><a href="http://software.intel.com#rendering">Darkspore* Rendering</a></li>
<li><a href="http://software.intel.com#frameanalyzer">Hot Loading Shaders in Frame Analyzer</a></li>
<li><a href="http://software.intel.com#drewfirst">They Drew First (Deferred) Blood</a></li>
<li><a href="http://software.intel.com#otheroptimizations">Other Optimizations for Darkspore*</a></li>
<li><a href="http://software.intel.com#conclusion">Conclusion</a></li>
<li><a href="http://software.intel.com#references">References</a></li>
<li><a href="http://software.intel.com#authors">Authors</a></li>
</ul>
<a name="newexperience"></a>
<h2>A New Gaming Experience Made Possible With Processor Graphics</h2>
<p>Released in early 2011, the 2nd Generation Intel® Core™ processors have fundamentally changed the PC gaming landscape, with processor and graphics merged into a single piece of silicon. The tighter integration of graphics into the processor has revolutionized gaming performance and made your favorite games run and look great on mainstream machines. As a developer, the market for your games has opened up immensely.</p>
<p ><img src="http://software.intel.com/file/37278" /><br />Figure 1: PC Volumes Dwarf Consoles through 2014</p>
<p>Targeting the mainstream is possible and Intel provides you the tools to help achieve success in this market. Intel® Graphics Performance Analyzers (Intel® GPA) 4.0 was released at GDC 2011 with full support for this next generation of processors. Game developers including the Darkspore* team at Maxis* have taken full advantage of Intel® GPA 4.0 features to make sure their game fully utilizes the new processor graphics.</p>
<p><a name="everyone"></a></p>
<h2>Darkspore* is Designed for Everyone</h2>
<p>Darkspore* is best described as an online action RPG where you control a squad of three heroes to fight the Darkspore*, an evil infestation of creatures. The game includes a version of the award winning Spore* Editor where you can fully customize not just basic features of the characters like color, height, textures, but vertices in the characters’ meshes. With Darkspore*, Maxis* is targeting mainstream to high end graphics hardware. At the Game Developers Conference* 2011, David Lee Swenson, the Lead Engineer on Darkspore*, presented how he used Intel® GPA 4.0 to help achieve this goal.</p>
<p><a name="rendering"></a></p>
<h2>Darkspore* Rendering</h2>
<p>For Darkspore*, Maxis* implemented a variation of the light pre-pass renderer described by [Engel 2008]. One main advantage of using a light pre-pass renderer is to decouple lighting from the scene geometry. In a forward renderer, lighting is computed at the time the object is drawn. In a light pre-pass renderer, light is accumulated in an off screen buffer and either applied in a post pass or sampled in a final pass per object. The Darkspore* renderer is composed of three main passes: deferred pass, lighting pass, and final pass. The deferred pass saves the material data for opaque objects in the scene, the light pass saves all lighting calculations into a light buffer, and the final pass uses the results of the previous passes to create the final frame. The deferred pass uses 2 render targets. The first render target is composed of world normals with a gloss term: Normals (RGB) + Gloss (A).</p>
<p ><img src="http://software.intel.com/file/37279" /><br />Figure 2: First render target = world space normals (RGB) + gloss (A)</p>
<p>Depth ([R*256]G), Specular Power (B), and a Toon ID (A) for a toon-style character outline are rendered to the second target of the deferred pass.</p>
<p ><img src="http://software.intel.com/file/37283" /><br />Figure 3: Second render target = Depth ([R*256]G) + Specular Power (B) + ToonID (A)</p>
<p>Following the deferred pass, there’s a lighting pass that renders up to 6 parallel lights as well as cloud shadows and the main shadow. Then, each point or spot light are rendered with gobos and/or shadows to a 16F light buffer: Diffuse (RGB) + Specular (A).</p>
<p ><img src="http://software.intel.com/file/37284" /><br />Figure 4: Light buffer = Diffuse (RGB) + Specular (A)</p>
<p>During the final pass, each object is rendered again sampling the light buffer. Values for areas intended to glow are written to a second target. Then, the glow target is downsampled and blurred back together with particles and post-processing effects like fog, distortion, and the death effect.</p>
<p ><img src="http://software.intel.com/file/37285" /><br />Figure 5: Final pass = Color + Glow + Post FX + Particles + UI</p>
<p>At this point all it needs is the UI and the final frame is complete.</p>
<p ><img src="http://software.intel.com/file/37286" /><br />Figure 6: The final frame</p>
<p><a name="darksporegpa"></a></p>
<h2>Darkspore* and Intel® GPA</h2>
<p>Before Intel® GPA, the Darkspore* team at Maxis* was using a long list of tests and settings to understand and debug game performance. With Intel® GPA, the game can be run with the applicable command line options. Once a frame is captured, it’s just a matter of setting the X and Y axis to the appropriate metrics to get a useful profile of performance.</p>
<p ><img src="http://software.intel.com/file/37287" /><br />Figure 7: Darkspore* with command line options</p>
<p>Since Darkspore* was known to be pixel shader bound, the Darkspore* team typically would set the X axis to the “PS Duration” metric and the Y axis to “GPU Duration”. With these metrics, the taller the bar, the more GPU time it is taking and the wider the bar the more pixel shader time.</p>
<p ><img src="http://software.intel.com/file/37288" /><br />Figure 8: Initial performance graph with four expensive calls</p>
<p>The four calls labeled above (A, B, C, D) are the most expensive in this frame capture. Calls A and C correspond to the deferred and final pass for the blood decals. Call B corresponds to the parallel lights, cloud shadows, and shadows. Call D runs the edge detection that was originally a bigger part of the game but has since been moved to the high spec configuration. Looking at the shaders for the blood decal calls (A and C), a few issues were found and fixed by the Darkspore* team:</p>
<ul>
<li>Tiling of decals was supported but never used</li>
<li>Vectors were normalized that were used only for a cubemap lookup</li>
<li>A Fresnel term was adding very little to the scene, given the fixed camera angle</li>
<li>Alpha test was implemented with both a clip instruction and blending</li>
<li>Normal calculation was overly complex with values that could be moved to the vertex shader</li>
</ul>
<p>After optimizing the decal shader and moving character shading to the high spec, re-profiling Darkspore* only shows Call B that corresponds to parallel lights and shadows. Given the amount of work done in Call B, it is expected to be expensive but grouping all these tasks into one call amortizes the cost of picking up the normal and depth values and recovering the position.</p>
<p ><img src="http://software.intel.com/file/37289" /><br />Figure 9: Re-profiling Darkspore* after optimizations to the decal shader</p>
<p><a name="frameanalyzer"></a></p>
<h2>Hot Loading Shaders in Frame Analyzer</h2>
<p>The Darkspore* team used Frame Analyzer to verify the changes made to the decal shaders had the positive impact they expected as shown above. But, they also took advantage of the hot loading shaders feature of Frame Analyzer to test changes on the fly. Frame Analyzer allows for replacement of shaders in HLSL or shader assembly for selected calls. By hot loading shaders in Frame Analyzer, you can immediately see the performance difference of the changes without having to capture another frame and hopefully recreate enough of the same events in the scene to make it comparable.</p>
<p>Using this hot loading functionality, the Darkspore* team was able to rapidly test changes made to the decal shaders. After loading the modified decal pixel shader for calls A and C, the deferred and final pass draw calls, Frame Analyzer showed a 30% and 24% improvement respectively for these calls.</p>
<p ><img src="http://software.intel.com/file/37290" /><br />Figure 10: 30% improvement on Call A with modified decal shader</p>
<p><a name="drewfirst"></a></p>
<h2>They Drew First (Deferred) Blood</h2>
<p>Looking further at these two calls that draw the blood decals in Frame Analyzer, we can see their volumes if we set the render state to wireframe. The blood decals are effectively writing all of the pixels in the volume because there is no test in place to early out in the pixel pipeline. The stencil buffer can be used to kill pixels in the final pass.</p>
<p ><img src="http://software.intel.com/file/37291" /><br />Figure 11: Blood decals writing to too many pixels</p>
<p>After making the change to the final pass and doing a simple frame rate check at the beginning of the level, there was a noticeable frame rate improvement of about 2 FPS. In Frame Analyzer, this new stencil write test can be set on the decal calls to fully understand the effect:</p>
<ol>
<li>Select both calls A and C (deferred and final pass) and setting STENCILENABLE to true. </li>
<li>Then for call A, set STENCILFUNC to D3DCMP_ALWAYS, the STENCILPASS to D3DSTENCILOP_REPLACE, and STENCILREF to 2. </li>
<li>For call C, set STENCILFUNC to D3DCMP_EQUAL and the STENCILREF to 2.</li>
</ol>
<p ><img src="http://software.intel.com/file/37292" /><br />Figure 12: Blood decals now being stenciled correctly</p>
<p>Now the blood decals are only writing the appropriate pixels and the final pass draw call was improved by 65.1% as reported by Frame Analyzer. All of these rendering changes could be done live and in the same session within Frame Analyzer.</p>
<p><a name="otheroptimizations"></a></p>
<h2>Other Optimizations for Darkspore*</h2>
<p>The Darkspore* team made several miscellaneous optimizations to the trees, terrain, character detail, and render system. The forest levels in the game had lots of trees. Maxis* found all the trees' models had roots below the ground that were being rendered, adding unnecessary polygons that no one ever saw. These were promptly removed saving processing time.</p>
<p ><img src="http://software.intel.com/file/37293" /><br />Figure 13: Dense tree root geometry was drawn but not visible</p>
<p>In Darkspore*, the terrain mixes 4 textures together per pass. The artists have control of which textures are mixed per vertex. Large sections were found to only need one texture instead of mixing four. These triangles that had only one texture were instead rendered with a simpler material. This was a big win in some cases and smaller in others, but always a win.</p>
<p ><img src="http://software.intel.com/file/37294" /><br />Figure 14: Not all landscape triangles require blended textures</p>
<p>With the Spore* Editor, the player can customize their squad creatures with countless combinations of parts and equipment. This results in some very detailed but also very polygon heavy creatures. The wireframe of the high LOD (level of detail) for a player creature shows the high polygon density of the creatures' models. NPCs were found to be at least this dense. Surprisingly, when these creatures got reduced to the one triangle per pixel level, they were hurting both pixel and vertex shader performance because of the way most graphics parts allocate 4 pixels to a "quad" and a minimum of one quad per triangle. This resulted in a waste of 3/4ths of the pixel shader performance.</p>
<p ><img src="http://software.intel.com/file/37295" /><br />Figure 15: Character model with high level of detail</p>
<p>As part of the light pre-pass renderer, normals were originally saved in view space. The view space normals only required two channels but weren't worth the cost of pixel shader instructions to recover the normal. Also, a world space normal was needed for reflective/refractive objects which resulted in a transform in the pixel shader, which would be a really bad place for an extra matrix transform. In the end, it was worth an extra channel to keep the normal in world space and avoid adding the extra matrix transform to the pixel shader.</p>
<p ><img src="http://software.intel.com/file/37296" /><br />Figure 16: World space normal used in reflective/refractive objects</p>
<p><a name="conclusion"></a></p>
<h2>Conclusion</h2>
<p>Using the various features available in Intel® GPA 4.0, the Darkspore* team was able to discover and fix bottlenecks in their graphics pipeline. At the end of the day, many of the optimizations made for mainstream graphics improved the overall gaming experience. With these optimizations in place, Darkspore* runs at well over 30 FPS on the 2nd Generation Intel® Core™ processors.</p>
<p><a name="references"></a></p>
<h2>References</h2>
<p>Engel, Wolfgang. Light Pre-Pass Renderer. March 16, 2008. <a href="http://diaryofagraphicsprogrammer.blogspot.com/2008/03/light-pre-pass-renderer.html">http://diaryofagraphicsprogrammer.blogspot.com/2008/03/light-pre-pass-renderer.html</a></p>
<p><a name="authors"></a></p>
<h2>Authors</h2>
<p><b>David Lee Swenson</b> (Electronic Arts*) is a 20 year software industry veteran. He’s Maxis* lead engineer on new rendering architecture and art pipelines for Darkspore*. Previously, David was responsible for the Spore* environment rendering and terraforming systems. He has also worked on console and PC titles at LucasArts*, 3DO*, Sierra On-Line*, Orion*, and MediaFactory.</p>
<p><b>Omar A Rodriguez</b> (Intel Corporation) a graphics software engineer in the Intel® Software and Services Group, where he supports Intel® graphics solutions in the Visual Computing Software Division. He holds a B.S. in Computer Science from Arizona State University. Omar is not the lead guitarist for the Mars Volta*.</p>
<div  id="vc-meta">
<div id="vc-meta-author">
<div>Omar A. Rodriguez</div>
</div>
<div id="vc-meta-pubdate">06-28-2011</div>
<div id="vc-meta-modificationdate">06-28-2011</div>
<div id="vc-meta-taxonomy">Case Studies</div>
<div id="vc-meta-category-product">
<div class="gpa">Intel® GPA</div>
</div>
<div id="vc-meta-category"></div>
<div id="vc-meta-thumb"></div>
<div id="vc-meta-thumb-tout">http://software.intel.com/file/38561</div>
<div id="vc-meta-thumb-hero"></div>
<div id="vc-meta-tocenable">no</div>
<div id="vc-meta-abstract">This article details the optimization of Darkspore*, a game from Maxis, that enabled it to run at well over 30 fps on 2nd Generation Intel® Core™ processors. Using various features available in Intel® GPA 4.0, the Darkspore* team was able to discover and fix bottlenecks in their graphics pipeline. Many of the optimizations utilized the full features of processor graphics and improved the overall gaming experience for consumers.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/optimizations-in-maxis-deferred-renderer/</link>
      <pubDate>Mon, 27 Jun 2011 23:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/optimizations-in-maxis-deferred-renderer/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/optimizations-in-maxis-deferred-renderer/</guid>
      <category>Visual Computing</category>
      <category>Intel® Graphics Performance Analyzers Knowledge Base</category>
      <category>Intel® Graphics Performance Analyzers (Intel® GPA)</category>
      <category>Game Development</category>
    </item>
    <item>
      <title>Compiler settings for memory error analysis in Intel® Inspector XE*</title>
      <description><![CDATA[ <p><b>Introduction:</b><br />Memory error analysis in Intel® Inspector XE and/or Intel® Parallel Inspector can analyze most native binaries. However, some settings make analysis easier.</p>
<p>For the purposes of this article - when I refer to Intel Inspector XE - I am referring to the memory error analysis within Intel Inspector XE and/or Intel Parallel Inspector</p>
<p><b>Useful Settings for memory error analysis in Intel® Inspector XE:</b></p>
<table cellpadding="0" cellspacing="0" border="1">
<tbody>
<tr>
<td width="90" valign="top">
<p><b>Linux* Switch</b></p>
</td>
<td width="90" valign="top"><strong>Windows* Switch</strong></td>
<td valign="top">
<p><b>Purpose</b></p>
</td>
</tr>
<tr>
<td width="90" valign="top">
<p>-g</p>
</td>
<td width="90" valign="top">
<p>/Zi, /ZI</p>
</td>
<td valign="top">
<p>Highly Recommended.</p>
<p>Intel Inspector XE uses the symbols to associate addresses to source lines.</p>
<p>Additionally using this setting is one of the ways in which memory error analysis filters out false positives.</p>
</td>
</tr>
</tbody>
</table>
<p><b></b></p>
<p><b>Settings which impact memory error analysis in Intel Inspector XE:</b></p>
<table cellpadding="0" cellspacing="0" border="1">
<tbody>
<tr>
<td width="90" valign="top"><strong>Linux Switch</strong></td>
<td width="90" valign="top"><strong>Windows Switch</strong></td>
<td valign="top"><strong>Purpose</strong></td>
</tr>
<tr>
<td width="90" valign="top">
<p>-O0</p>
</td>
<td width="90" valign="top">/Od</td>
<td valign="top">
<p>Recommended for Initial analysis.</p>
<p>Allows Intel Inspector XE to more easily associate errors to the correct source line.</p>
<p>Intel Inspector XE can also analyze optimized binaries, but it is difficult to pinpoint the source code location causing a problem in optimized assembly that does not have specific source lines.</p>
<p>Note: While it is easier to do analysis of binaries built with -O0, it is also important to check for memory errors in your <b>"released"</b> (not -O0) version of your binaries.</p>
</td>
</tr>
</tbody>
</table>
<p><b></b></p>
<p><b>Settings not recommended for use with memory error analysis in Intel® Inspector XE:</b></p>
<table cellpadding="0" cellspacing="0" border="1">
<tbody>
<tr>
<td width="90" valign="top">
<p><b>Linux Switch</b></p>
</td>
<td width="90" valign="top">
<p><b>Windows <br />Switch</b></p>
</td>
<td valign="top">
<p><b>Purpose</b></p>
</td>
</tr>
<tr>
<td width="90" valign="top">
<p>-fmudflap<br />-ftrapuv</p>
</td>
<td width="90" valign="top">
<p>/RTC[su1]</p>
</td>
<td valign="top">
<p>Not Recommended.</p>
<p>Options on the compiler which add functionality similar to Intel Inspector XE can cause Intel Inspector XE to have false positives and false negatives.</p>
<p>-fmudflap switch is known to cause false positives and false negatives with Intel Inspector XE. <br />-ftrapuv is known to cause false negatives.  <br />/RTC[us] initializes uninitialized memory with a bit pattern preventing memory error analysis in Intel® Inspector XE from identifying uninitialized memory errors in your code.  There is some duplication in the kinds of errors that memory error analysis will do at Level 4 with this switch.</p>
<p>Switches such as this may impact performance without adding additional functionality.</p>
<p>These switches may be useful outside Intel Inspector XE - and may potentially catch additional issues that Intel Inspector XE does not find.</p>
<p>-fstack-security-check which add functionality similar to Intel Inspector XE is not known to cause false positives or false negatives with Intel Inspector XE.</p>
</td>
</tr>
<tr>
<td width="90" valign="top">
<p>-tprofile</p>
</td>
<td width="90" valign="top">/Qtprofile</td>
<td valign="top">
<p>Do not use.</p>
<p>This Intel Compiler setting is an alternative method of instrumentation for Intel® Thread Profiler. The instrumentation added by -tprofile is not supported by Intel Inspector XE.</p>
</td>
</tr>
<tr>
<td width="90" valign="top">
<p>-tcheck</p>
</td>
<td width="90" valign="top">/Qtcheck</td>
<td valign="top">
<p>Do not use.</p>
<p>This Intel Compiler setting is an alternative method of instrumentation for Intel® Thread Checker. The instrumentation added by -tcheck is not supported by Intel Inspector XE.</p>
</td>
</tr>
<tr>
<td width="90" valign="top">
<p>-msse4a, <br />-m3dnow</p>
</td>
<td width="90" valign="top">N/A</td>
<td valign="top">
<p>Do not use.</p>
<p>Binaries which use instructions not supported by Intel Processors may cause unknown behaviors in Intel Inspector XE.</p>
</td>
</tr>
<tr>
<td width="90" valign="top">
<p>-debug [keyword]</p>
</td>
<td width="90" valign="top">/debug<br />[keyword]</td>
<td valign="top">
<p>Not Recommended.</p>
<p>Intel Inspector XE works best with -debug full (the default). Other options including parallel, extended, emit-column, expr-source-pos, inline-debug-info, semantic-stepping, &amp; variable-locations are not supported by Intel Inspector XE.</p>
</td>
</tr>
</tbody>
</table>
<p><b></b></p>
<p><b>Settings which have no impact on memory error analysis of Intel Inspector XE:</b></p>
<table cellpadding="0" cellspacing="0" border="1">
<tbody>
<tr>
<td width="250" valign="top">
<p><b>Linux Switch</b></p>
</td>
<td width="250" valign="top"><strong>Windows Switch</strong></td>
<td valign="top">
<p><b>Purpose</b></p>
</td>
</tr>
<tr>
<td width="250" valign="top">
<p>-static<br />-static-libgcc<br />-static-intel<br />-shared-libgcc<br />-openmp-link</p>
</td>
<td width="250" valign="top">
<p>/MDd, /MD, /MT, MTd, Qopenmp-link</p>
</td>
<td valign="top">
<p>These setting directs the compiler to link in various libraries statically or dynamically. These switches impact Intel® Amplifier XE and threading error analysis for Intel Inspector XE. Memory error analysis in Intel Inspector XE works with statically linked libraries.</p>
</td>
</tr>
<tr>
<td width="250" valign="top">
<p>-DTBB_USE_THREADING_TOOLS</p>
</td>
<td width="250" valign="top">
<p>/DTBB_USE_THREADING_TOOLS</p>
</td>
<td valign="top">
<p>Setting TBB_USE_THREADING_TOOLS causes Intel Threading Building Blocks (TBB) to be instrumented. This switch impacts Intel® Amplifier XE and threading error analysis for Intel® Inspector XE. Setting _DEBUG or TBB_USE_DEBUG will in turn set TBB_USE_THREADING_TOOLS</p>
</td>
</tr>
<tr>
<td width="250" valign="top">N/A</td>
<td width="250" valign="top">
<p>/FIXED[:NO]</p>
</td>
<td valign="top">
<p>This setting allows binaries to be instrumented and is not required for Intel Inspector XE.</p>
</td>
</tr>
</tbody>
</table>
<p><b>Notes:</b> <br />1) Memory Error Analysis Level 1 (Memory Leak Detection) requires information in the executable and all shared libs in your application to properly walk the call stack:</p>
<p>a) Frame pointers: Use -fno-omit-frame-pointer.</p>
<p>b) Exception handling information: enabled via -fasynchronous-unwind-tables, -fexceptions, or -O0<br /><br />2) Using Debug versions of the Microsoft C Runtime Libraries (/MDd and /MTd) enables the Microsoft* debug heap manager. see: <a href="http://software.intel.com/en-us/articles/using-the-microsoft-debug-heap-manager-with-memory-error-analysis-of-intel-parallel-inspector/">Using the Microsoft* debug heap manager with memory error analysis of Intel® Parallel Inspector</a>.<br /><br />Note: There are other options which may add frame pointer or Exception handling to your binary as a side effect, Examples: -fexceptions (which is the default for C++).or -O0 . To make sure the executable (and shared libs) have this information, use the objdump -h &lt;binary&gt; command. You should see .eh_frame_hdr section there.</p>
<p><b>More Information:</b></p>
<p>This article addressed the most obvious switches that developers would have concerns over. Most switches will work with Intel® Parallel Inspector and/or Intel Inspector XE - but not every switch combination is tested. If you have information regarding other switches, please add a comment to this article. If you have question regarding a particular switch please submit an issue to the <a href="http://software.intel.com/en-us/forums/intel-inspector-xe/">Intel Inspector XE Forum</a>.</p>
<p><b>Versions:</b><br />Intel® Inspector XE 2011<br />Intel® C++ Compiler 11.X,12.X<br />GCC Compiler 3.4.6 – GCC 4.5.0<br />MS Visual Studio 2005, 2008, 2010</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/compiler-settings-for-memory-error-analysis-in-intel-inspector-xe/</link>
      <pubDate>Tue, 05 Apr 2011 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/compiler-settings-for-memory-error-analysis-in-intel-inspector-xe/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/compiler-settings-for-memory-error-analysis-in-intel-inspector-xe/</guid>
      <category>Tools</category>
      <category>Intel SW Partner program</category>
      <category>Game Development</category>
    </item>
    <item>
      <title>Dynamic Resolution Rendering</title>
      <description><![CDATA[ <link media="screen" href="http://software.intel.com/media/gamedev/css/3302_Intel_VC_01.css?v=11" type="text/css" rel="stylesheet" />
<link media="screen" href="http://software.intel.com/file/23729" type="text/css" rel="stylesheet" />
<table width="100" cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr>
<td valign="top">
<div id="left_container">
<div id="header_content"><a href="http://software.intel.com/en-us/visual-computing/" title="Visual Computing Developer Community"><img height="96" width="727" src="http://software.intel.com/file/20493/" border="0" /></a></div>
<div id="left_content_container2">
<div id="showcase_01">
<div >
<h2>Dynamic Resolution Rendering</h2>
<p>The Dynamic Resolution Rendering sample demonstrates a technique for dynamically adjusting the resolution of the main rendering, with graphical user interface components drawn at the native resolution. This technique can help developers achieve their quality and performance goals on a wide range of hardware and scene profiles. The sample also shows how the addition of scaling filters such as velocity aware temporal spatial anti-aliasing can enhance quality.<br /><br /><br /><a href="http://software.intel.comjavascript:void(0)" onclick="ndownload('http://software.intel.com/file/37830')"><img src="http://software.intel.com/file/25370" border="0" /></a></p>
</div>
<div >
<object height="203" width="360" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" id="v_7525_1989">
<param name="id" value="v_7525_1989" />
<param name="name" value="v_7525_1989" />
<param name="flashvars" value="file=http://software.intel.com/media/videos/4/a/3/e/0/0/9/DRR_Video_360_V2.mp4&amp;autostart=false&amp;bufferlength=5&amp;allowfullscreen=true&amp;plugins=http://software.intel.com/common/swf/listen&amp;title=The+Dynamic+Resolution+Rendering+Sample" />
<param name="allowfullscreen" value="true" />
<param name="src" value="http://software.intel.com/common/swf/mediaplayer.swf" /><embed src="http://software.intel.com/common/swf/mediaplayer.swf" allowfullscreen="true" flashvars="file=http://software.intel.com/media/videos/4/a/3/e/0/0/9/DRR_Video_360_V2.mp4&amp;autostart=false&amp;bufferlength=5&amp;allowfullscreen=true&amp;plugins=http://software.intel.com/common/swf/listen&amp;title=The+Dynamic+Resolution+Rendering+Sample" type="application/x-shockwave-flash" name="v_7525_1989" height="203" width="360" id="v_7525_1989"></embed>
</object>
<center><b>Video:</b> <a href="http://software.intel.com/en-us/videos/the-dynamic-resolution-rendering-sample/">Dynamic Rendering Resolution Walkthough Video <br /></a>(click link to view larger)</center>
<p><br /><b>Read:</b> <a href="http://software.intel.com/en-us/articles/dynamic-resolution-rendering-article/">Dynamic Resolution Rendering <br /></a><b>Blog Post:</b> <a href="http://software.intel.com/en-us/blogs/2011/07/18/dynamic-resolution-rendering-sample-now-live/">Dynamic Resolution Rendering Sample Now Live</a> - Blog post by Doug Binks</p>
</div>
<br clear="all" />
<div>
<table bgcolor="#ffffff" width="100%" cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr>
<td><img height="37" width="531" src="http://software.intel.com/file/25372" /></td>
<td></td>
</tr>
</tbody>
</table>
<table bgcolor="#ffffff" cellpadding="0" bordercolor="#ffffff" cellspacing="6" border="0">
<tbody>
<tr>
<td valign="top">
<div align="center"><a alt="DynamicResolutionRendering_ScreenShot01_web.jpg" href="http://software.intel.com/file/37633" title="DynamicResolutionRendering_ScreenShot01_web.jpg"><img src="http://software.intel.com/file/37632" alt="DynamicResolutionRendering_ScreenShot01_thumb.jpg" title="DynamicResolutionRendering_ScreenShot01_thumb.jpg" /></a></div>
</td>
<td valign="top">
<div align="center"><a alt="DynamicResolutionRendering_ScreenShot02_web.jpg" href="http://software.intel.com/file/37635" title="DynamicResolutionRendering_ScreenShot02_web.jpg"><img src="http://software.intel.com/file/37634" alt="DynamicResolutionRendering_ScreenShot02_thumb.jpg" title="DynamicResolutionRendering_ScreenShot02_thumb.jpg" /></a></div>
</td>
<td valign="top">
<div align="center"><a alt="DynamicResolutionIllustration.jpg" href="http://software.intel.com/file/37638" title="DynamicResolutionIllustration.jpg"><img src="http://software.intel.com/file/37637" alt="DynamicResolutionIllustration_thumb.jpg" title="DynamicResolutionIllustration_thumb.jpg" /> </a></div>
</td>
</tr>
<tr>
<td valign="top">
<table align="center" cellpadding="6" cellspacing="6" border="0">
<tbody>
<tr>
<td valign="top" height="4"><img height="4" width="4" src="http://software.intel.com/media/gamedev/_images/blank.gif" /></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td cellpadding="6" cellspacing="6">
<p><i>A view of the swinging bell with supersampling enabled and dynamic resolution control set to give 30FPS. The geometric temporal anti-aliasing uses velocity aware filtering to reduce ghosting artefacts.</i></p>
</td>
</tr>
</tbody>
</table>
</td>
<td valign="top">
<table align="center" cellpadding="6" cellspacing="6" border="0">
<tbody>
<tr>
<td valign="top" height="4"><img height="4" width="4" src="http://software.intel.com/media/gamedev/_images/blank.gif" /></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td cellpadding="6" cellspacing="6">
<p><i>With dynamic resolution parts of the scene with higher performance can have increased resolutions, yet match the chosen performance criteria. Geometric temporal anti-aliasing gives additional smoothness to the edges.</i></p>
</td>
</tr>
</tbody>
</table>
</td>
<td valign="top">
<div align="center">
<table align="center" cellpadding="6" cellspacing="6" border="0">
<tbody>
<tr>
<td width="161" valign="top" height="4"><img height="4" width="4" src="http://software.intel.com/media/gamedev/_images/blank.gif" /></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td cellpadding="6" cellspacing="6">
<p align="center">An illustration of the basic principle of dynamic resolution.</p>
</td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<br /><br /><!-- start of 3 column table -->
<table width="695" cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr>
<td width="695" rowspan="2" valign="top">
<table width="695" cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr>
<td valign="top"><img height="8" width="697" src="http://software.intel.com/file/22889" /></td>
</tr>
<tr>
<td valign="top" class="panel_bg_02">
<table width="695" cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr>
<td width="12" rowspan="2"><img height="8" width="12" src="http://software.intel.com/media/gamedev/_images/blank.gif" /></td>
<td valign="top" height="4"><img height="8" width="8" src="http://software.intel.com/media/gamedev/_images/blank.gif" /></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
<td valign="top">
<table width="100%" cellpadding="2" cellspacing="0" border="0">
<tbody>
<tr>
<td align="center" width="65%" valign="top" height="19" class="arrow"><span ><b>System Requirements</b></span></td>
<td align="center" width="35%" valign="top" height="19" class="arrow"><span ><b><a href="http://software.intel.com/en-us/articles/code/">Additional Code Samples</a></b></span></td>
</tr>
<tr>
<td align="left" width="33%" valign="top" height="19" class="arrow"></td>
<td align="left" width="33%" valign="top" height="19" class="arrow"></td>
</tr>
<tr>
<td align="left" width="33%" valign="top" height="19"><ol type="1">
<li>CPU: Intel® 2nd Generation Core i3 or better</li>
<li>GFX: uses DX11 graphics API on DX10 (or better) hardware </li>
<li>OS: Microsoft Vista / Windows 7* </li>
<li>MEM: 2 GB of RAM or better </li>
<li>Software: <ol type="1">
<li>DirectX SDK (June 2010 release or later)</li>
<li>Build with Microsoft Visual Studio 2010* or Microsoft Visual Studio 2008*</li>
</ol></li>
</ol>
<p>* Other names and brands may be claimed as the property of others.</p>
</td>
<td align="left" width="33%" valign="top" height="19">
<ul>
<li><a href="http://software.intel.com/en-us/articles/sandy-bridge-samples/" title="Sandy Bridge Samples">Sandy Bridge Samples</a></li>
<li><a href="http://software.intel.com/en-us/articles/game-engine-tasking-animation/">Game Engine Tasking</a></li>
<li><a href="http://software.intel.com/en-us/articles/shadowexplorer/">Shadow Explorer</a></li>
<li><a href="http://software.intel.com/en-us/articles/onloaded-shadows/" title="Onloaded Shadows">Onloaded Shadows</a></li>
<li><a href="http://software.intel.com/en-us/articles/avx-cloth/" title="AVX Cloth">AVX Cloth</a></li>
<li><a href="http://software.intel.com/en-us/articles/gpu-detect-sample/">GPU Detect</a></li>
</ul>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
<!--bottom border for large box-->
<tr>
<td valign="top"><img height="8" width="697" src="http://software.intel.com/media/gamedev/_images/footer-bg-01.gif" /></td>
</tr>
<!--end border-->
</tbody>
</table>
</td>
<td width="10" rowspan="2"><img height="10" width="10" src="http://software.intel.com/media/gamedev/_images/blank.gif" /></td>
</tr>
<tr>
<td></td>
<!--raghava-->
</tr>
<tr>
<td width="344" valign="top"></td>
</tr>
</tbody>
</table>
<!-- end of 3 column table --><br /><br /></div>
</div>
</div>
</td>
<td valign="top" ><!-- RHC -->
<table width="100%" cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr>
<td align="center" width="215">
<table align="center" width="223" cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr>
<td height="4"><img height="4" width="232" src="http://software.intel.com/file/20516" /></td>
</tr>
<tr>
<td>
<table align="center" width="223" cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr>
<td align="center" valign="top"><a href="http://www.intelsoftwaregraphics.com/?lid=5ceakfXf8Ho=&amp;siteid=cqMoF5H/37o="><img height="71" width="223" src="http://software.intel.com/file/20512" alt="Intel Visual Adrenaline" border="0" title="Intel Visual Adrenaline" /></a></td>
</tr>
<tr>
<td valign="top" >
<table width="223" cellpadding="0" cellspacing="0" border="0" >
<tbody>
<tr>
<td width="11" height="8"></td>
<td align="center" width="10"><img height="5" width="5" src="http://software.intel.com/file/20514" /></td>
<td align="left"><a href="http://software.intel.com/en-us/visual-computing/" title="Intel Adrenaline Developer Community" >Developer Community</a></td>
<td width="10"></td>
</tr>
<tr>
<td height="8"></td>
<td align="center"><img height="5" width="5" src="http://software.intel.com/file/20514" /></td>
<td align="left"><a href="http://www.intel.com/cd/software/partner/asmo-na/eng/index.htm" title="Intel Adrenaline Software Partner Program" >Intel® Software Partner Program</a></td>
<td></td>
</tr>
<tr>
<td height="8"></td>
<td align="center"><img height="5" width="5" src="http://software.intel.com/file/20514" /></td>
<td align="left"><a href="http://www.intel.com/Consumer/Game/index.htm" title="Intel Adrenaline Game On" >Game On</a></td>
<td></td>
</tr>
<tr>
<td height="8"></td>
<td align="center"><img height="5" width="5" src="http://software.intel.com/file/20514" /></td>
<td align="left"><a href="http://www.intelsoftwaregraphics.com/?lid=5ceakfXf8Ho=&amp;siteid=cqMoF5H/37o=" title="Intel Adrenaline Showcase" >Showcase</a></td>
<td></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td valign="top" height="7"><img height="7" width="223" src="http://software.intel.com/file/20515" /></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td valign="top" height="4"><img height="6" width="6" src="http://software.intel.com/file/20494" /></td>
</tr>
<tr>
<td></td>
</tr>
</tbody>
</table>
<br /><center>
<table cellpadding="0" cellspacing="0" border="0" id="nav_table">
<tbody>
<tr>
<td>
<table width="190" cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr>
<td width="9"></td>
<td>
<div align="center" ><b>A Scalable 3D <br />Particle System</b><br /><a href="http://software.intel.com/en-us/articles/tickertape/" title="TickerTape"><img src="http://software.intel.com/file/25664/" alt="Download PDF" border="0" /></a><br /><br /><b>Benefits of SIMD</b><br /><a href="http://software.intel.com/en-us/articles/tickertape-part-2/"><img src="http://software.intel.com/file/25665/" alt="Download PDF" border="0" /></a><br /><br /><b>Visual Adrenaline</b><br /><a href="http://software.intel.com/sites/billboard/"><img src="http://software.intel.com/file/25369" alt="Download PDF" border="0" /></a><br /></div>
</td>
<td></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</center><br /><center>
<table cellpadding="0" cellspacing="0" border="1" id="nav_table">
<tbody>
<tr>
<td>
<table width="190" cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr>
<td width="9" class="right_container_hdr2"></td>
<td class="right_container_hdr2"><b>Intel Tools for Unreal Developers <br /><a href="http://software.intel.com/en-us/articles/epic-licenses-tbb-for-ue-licensees/">TBB for Unreal Engine</a></b></td>
<td class="right_container_hdr2"></td>
</tr>
<tr>
<td colspan="3" valign="top" height="4"><img height="4" width="4" src="http://software.intel.com/file/20494" /></td>
</tr>
<tr>
<td width="9" class="right_container_hdr"></td>
<td class="right_container_hdr">
<h4>Related Links</h4>
</td>
<td class="right_container_hdr"></td>
</tr>
<tr>
<td colspan="3" valign="top" height="4"><img height="4" width="4" src="http://software.intel.com/file/20494" /></td>
</tr>
<tr>
<td height="15"></td>
<td valign="middle"><a href="http://www.intel.com/software/graphics" title="Intel Visual Computing Home">Visual Computing Home</a></td>
<td></td>
</tr>
<tr>
<td></td>
<td>
<h3>Intel<sup>®</sup> Technologies</h3>
</td>
<td></td>
</tr>
<tr>
<td></td>
<td valign="top"><a href="http://www.intel.com/software/sandybridge">Sandy Bridge</a><br /><a href="http://software.intel.com/en-us/articles/integrated-graphics/" title="Intel Visual Computing Technologies Integrated Graphics">Graphics</a><br /><a href="http://software.intel.com/en-us/articles/parallel-programming-vc/" title="Intel Visual Computing Technologies Parallel Programming">Parallel Programming</a></td>
<td></td>
</tr>
<tr>
<td colspan="3" valign="top" height="4"><img height="4" width="4" src="http://software.intel.com/file/20494" /></td>
</tr>
<tr>
<td></td>
<td>
<h3>Focus Areas</h3>
</td>
<td></td>
</tr>
<tr>
<td></td>
<td valign="top"><a href="http://software.intel.com/en-us/articles/game-dev/" title="Intel Game Development Focus Area">Game Development</a><br /><a href="http://software.intel.com/en-us/articles/artist-animator/" title="Intel Visual Computing Artist/Animator Focus Area">Artist/Animator</a><br /><a href="http://software.intel.com/en-us/articles/media/" title="Intel Visual Computing Media Focus Area">Media</a></td>
<td></td>
</tr>
<tr>
<td colspan="3" valign="top" height="4"></td>
</tr>
<tr>
<td></td>
<td>
<h3>Develop</h3>
</td>
<td></td>
</tr>
<tr>
<td></td>
<td valign="top"><a href="http://software.intel.com/en-us/articles/tools-vc/" title="Intel Visual Computing Devlopment Tools">Tools</a><br /><a href="http://software.intel.com/en-us/articles/code/" title="Intel Visual Computing Devlopment Code">Code</a></td>
<td></td>
</tr>
<tr>
<td colspan="3" valign="top" height="4"></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</center><!--END right column Content --></td>
</tr>
</tbody>
</table> ]]></description>
      <link>http://software.intel.com/en-us/articles/dynamic-resolution-rendering/</link>
      <pubDate>Thu, 31 Mar 2011 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/dynamic-resolution-rendering/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/dynamic-resolution-rendering/</guid>
      <category>Visual Computing</category>
      <category>Code &amp; Downloads</category>
      <category>Game Development</category>
    </item>
    <item>
      <title>Morphological Antialiasing (MLAA)</title>
      <description><![CDATA[ <link media="screen" href="http://software.intel.com/media/gamedev/css/3302_Intel_VC_01.css?v=11" type="text/css" rel="stylesheet" />
<link media="screen" href="http://software.intel.com/file/23729" type="text/css" rel="stylesheet" />
<table width="100" cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr>
<td valign="top">
<div id="left_container">
<div id="header_content"><a href="http://software.intel.com/en-us/visual-computing/" title="Visual Computing Developer Community"><img height="96" width="727" src="http://software.intel.com/file/20493/" border="0" /></a></div>
<div id="left_content_container2">
<div id="showcase_01">
<div >
<h2>Morphological Antialiasing</h2>
<p>MLAA is an image-based, post-process filtering technique which identifies discontinuity patterns and blends colors in the neighborhood of these patterns to perform effective anti-aliasing. It is the precursor of a new generation of <a href="http://www.iryoku.com/mlaa/">real-time antialiasing techniques that rival MSAA</a>. This sample is based on the original, <a href="http://visual-computing.intel-research.net/publications/papers/2009/mlaa/mlaa.pdf">CPU-based MLAA implementation provided by Reshetov in 2009</a>, with improvements to greatly increase performance. The improvements are:<br /><br />o Integration of a new, efficient, easy-to-use tasking system implemented on top of Intel® Threading Building Blocks (Intel® TBB).<br />o Integration of a new, efficient, easy to use pipelining system for CPU onloading of graphics tasks.<br />o Improvement of data access patterns through a new transposing pass.<br />o Increased use of Intel® SSE instructions to optimize discontinuities detection and color blending.<br /><br /><br /><a href="http://software.intel.comjavascript:void(0)" onclick="ndownload('http://software.intel.com/file/37247')"><img src="http://software.intel.com/file/25370" border="0" /></a></p>
</div>
<div >
<p>
<object height="203" width="360" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" id="v_8600_1986">
<param name="id" value="v_8600_1986" />
<param name="name" value="v_8600_1986" />
<param name="flashvars" value="file=http://software.intel.com/media/videos/8/c/2/4/9/6/7/MLAA.mp4&amp;image=http://software.intel.com/media/videos/8/c/2/4/9/6/7/8c249675aea6c3cbd91661bbae767ff1_player.jpg&amp;autostart=false&amp;bufferlength=5&amp;allowfullscreen=true&amp;plugins=http://software.intel.com/common/swf/listen&amp;title=CPU-Based+MLAA+Implementation+" />
<param name="allowfullscreen" value="true" />
<param name="src" value="http://software.intel.com/common/swf/mediaplayer.swf" /><embed src="http://software.intel.com/common/swf/mediaplayer.swf" allowfullscreen="true" flashvars="file=http://software.intel.com/media/videos/8/c/2/4/9/6/7/MLAA.mp4&amp;image=http://software.intel.com/media/videos/8/c/2/4/9/6/7/8c249675aea6c3cbd91661bbae767ff1_player.jpg&amp;autostart=false&amp;bufferlength=5&amp;allowfullscreen=true&amp;plugins=http://software.intel.com/common/swf/listen&amp;title=CPU-Based+MLAA+Implementation+" type="application/x-shockwave-flash" name="v_8600_1986" height="203" width="360" id="v_8600_1986"></embed>
</object>
</p>
<b>
<p ><b>Video:</b> <a href="http://software.intel.com/en-us/videos/cpu-based-mlaa-implementation/">CPU Based MLAA Implemention - Developer Walkthrough </a>from Alexandre De Pereyra (click to view larger)<br /><br /><br /><b>Read: </b><a href="http://software.intel.com/en-us/articles/mlaa-efficiently-moving-antialiasing-from-the-gpu-to-the-cpu/">MLAA: Efficiently Moving Antialiasing from the GPU to the CPU</a> by Alexandre De Pereyra<br /><b><br />Blog Post:</b> <a href="http://software.intel.com/en-us/blogs/2011/07/18/cpu-morphological-antialiasing-mlaa-sample-now-live/">CPU Morphological Antialiasing (MLAA) sample now live!</a> - Blog post by Josh Doss</p>
</b></div>
<br clear="all" />
<div>
<table bgcolor="#ffffff" width="100%" cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr>
<td><img height="37" width="531" src="http://software.intel.com/file/25372" /></td>
<td></td>
</tr>
</tbody>
</table>
<table bgcolor="#ffffff" cellpadding="0" bordercolor="#ffffff" cellspacing="6" border="0">
<tbody>
<tr>
<td width="350" valign="top">
<div ><a alt="MLAA_Found_Edges.PNG" href="http://software.intel.com/file/37117" title="MLAA_Found_Edges.PNG"><img src="http://software.intel.com/file/37114" alt="MLAA_Found_Edges_thumb.jpg" title="MLAA_Found_Edges_thumb.jpg" /></a></div>
</td>
<td width="350" valign="top">
<div ><a href="http://software.intel.com/file/37119"><img src="http://software.intel.com/file/37116" alt="MLAA_MLAA_ZB_thumb.jpg" title="MLAA_MLAA_ZB_thumb.jpg" /></a></div>
</td>
</tr>
<tr>
<td valign="top">
<table cellpadding="2" cellspacing="0" border="0" >
<tbody>
<tr>
<td valign="top" height="4"><img height="4" width="4" src="http://software.intel.com/media/gamedev/_images/blank.gif" /></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td>
<p><i>Viewing the found edges can be helpful when debugging MLAA. <br /></i></p>
</td>
</tr>
</tbody>
</table>
</td>
<td valign="top">
<table cellpadding="2" cellspacing="0" border="0" >
<tbody>
<tr>
<td valign="top" height="4"><img height="4" width="4" src="http://software.intel.com/media/gamedev/_images/blank.gif" /></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td>
<p><i>The MLAA sample contains a zoom box to compare different anti-aliasing techniques.<br /><br /></i></p>
</td>
</tr>
</tbody>
</table>
</td>
<td valign="top">
<div ></div>
</td>
</tr>
</tbody>
</table>
</div>
<br /><br /><!-- start of 3 column table -->
<table width="695" cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr>
<td width="695" rowspan="2" valign="top">
<table width="695" cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr>
<td valign="top"><img height="8" width="697" src="http://software.intel.com/file/22889" /></td>
</tr>
<tr>
<td valign="top" class="panel_bg_02">
<table width="695" cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr>
<td width="12" rowspan="2"><img height="8" width="12" src="http://software.intel.com/media/gamedev/_images/blank.gif" /></td>
<td valign="top" height="4"><img height="8" width="8" src="http://software.intel.com/media/gamedev/_images/blank.gif" /></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
<td valign="top">
<table width="100%" cellpadding="2" cellspacing="0" border="0">
<tbody>
<tr>
<td width="65%" valign="top" height="19" class="arrow" ><span ><b>System Requirements</b></span></td>
<td width="35%" valign="top" height="19" class="arrow" ><span ><b><a href="http://software.intel.com/en-us/articles/code/">Additional Code Samples</a></b></span></td>
</tr>
<tr>
<td width="33%" valign="top" height="19" class="arrow" ></td>
<td width="33%" valign="top" height="19" class="arrow" ></td>
</tr>
<tr>
<td width="33%" valign="top" height="19" ><ol>
<li>CPU: 2nd Generation Core i5 or better suggested</li>
<li>GFX: DX10 capable graphics card </li>
<li>OS: Microsoft Windows Vista* or Microsoft Windows 7*</li>
<li>MEM: 2 GB of RAM or better </li>
<li>Software: <br />o DirectX SDK (June 2010 release or later)<br />o Build with Microsoft Visual Studio 2008* w/SP1 or Visual Studio 2010*</li>
</ol>
<p>* Other names and brands may be claimed as the property of others.</p>
</td>
<td width="33%" valign="top" height="19" >
<ul>
<li><a href="http://software.intel.com/en-us/articles/sandy-bridge-samples/" title="Sandy Bridge Samples">Sandy Bridge Samples</a></li>
<li><a href="http://software.intel.com/en-us/articles/game-engine-tasking-animation/">Game Engine Tasking</a></li>
<li><a href="http://software.intel.com/en-us/articles/shadowexplorer/">Shadow Explorer</a></li>
<li><a href="http://software.intel.com/en-us/articles/onloaded-shadows/" title="Onloaded Shadows">Onloaded Shadows</a></li>
<li><a href="http://software.intel.com/en-us/articles/avx-cloth/" title="AVX Cloth">AVX Cloth</a></li>
<li><a href="http://software.intel.com/en-us/articles/gpu-detect-sample/">GPU Detect</a></li>
</ul>
<p><a href="http://software.intel.com/en-us/articles/ocean-fog-using-direct3d-10/"></a></p>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
<!--bottom border for large box-->
<tr>
<td valign="top"><img height="8" width="697" src="http://software.intel.com/media/gamedev/_images/footer-bg-01.gif" /></td>
</tr>
<!--end border-->
</tbody>
</table>
</td>
<td width="10" rowspan="2"><img height="10" width="10" src="http://software.intel.com/media/gamedev/_images/blank.gif" /></td>
</tr>
<tr>
<td></td>
<!--raghava-->
</tr>
<tr>
<td width="344" valign="top"></td>
</tr>
</tbody>
</table>
<!-- end of 3 column table --><br /><br /></div>
</div>
</div>
</td>
<td valign="top" ><!-- RHC -->
<table width="100%" cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr>
<td width="215" >
<table width="223" cellpadding="0" cellspacing="0" border="0" >
<tbody>
<tr>
<td height="4"><img height="4" width="232" src="http://software.intel.com/file/20516" /></td>
</tr>
<tr>
<td>
<table width="223" cellpadding="0" cellspacing="0" border="0" >
<tbody>
<tr>
<td valign="top" ><a href="http://www.intelsoftwaregraphics.com/?lid=5ceakfXf8Ho=&amp;siteid=cqMoF5H/37o="><img height="71" width="223" src="http://software.intel.com/file/20512" alt="Intel Visual Adrenaline" border="0" title="Intel Visual Adrenaline" /></a></td>
</tr>
<tr>
<td valign="top" >
<table width="223" cellpadding="0" cellspacing="0" border="0" >
<tbody>
<tr>
<td width="11" height="8"></td>
<td width="10" ><img height="5" width="5" src="http://software.intel.com/file/20514" /></td>
<td ><a href="http://software.intel.com/en-us/visual-computing/" title="Intel Adrenaline Developer Community" >Developer Community</a></td>
<td width="10"></td>
</tr>
<tr>
<td height="8"></td>
<td ><img height="5" width="5" src="http://software.intel.com/file/20514" /></td>
<td ><a href="http://www.intel.com/cd/software/partner/asmo-na/eng/index.htm" title="Intel Adrenaline Software Partner Program" >Intel® Software Partner Program</a></td>
<td></td>
</tr>
<tr>
<td height="8"></td>
<td ><img height="5" width="5" src="http://software.intel.com/file/20514" /></td>
<td ><a href="http://www.intel.com/Consumer/Game/index.htm" title="Intel Adrenaline Game On" >Game On</a></td>
<td></td>
</tr>
<tr>
<td height="8"></td>
<td ><img height="5" width="5" src="http://software.intel.com/file/20514" /></td>
<td ><a href="http://www.intelsoftwaregraphics.com/?lid=5ceakfXf8Ho=&amp;siteid=cqMoF5H/37o=" title="Intel Adrenaline Showcase" >Showcase</a></td>
<td></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td valign="top" height="7"><img height="7" width="223" src="http://software.intel.com/file/20515" /></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td valign="top" height="4"><img height="6" width="6" src="http://software.intel.com/file/20494" /></td>
</tr>
<tr>
<td></td>
</tr>
</tbody>
</table>
<br /><center>
<table cellpadding="0" cellspacing="0" border="0" id="nav_table">
<tbody>
<tr>
<td>
<table width="190" cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr>
<td width="9"></td>
<td>
<div ><b>A Scalable 3D <br />Particle System</b><br /><a href="http://software.intel.com/en-us/articles/tickertape/" title="TickerTape"><img src="http://software.intel.com/file/25664/" alt="Download PDF" border="0" /></a><br /><br /><b>Benefits of SIMD</b><br /><a href="http://software.intel.com/en-us/articles/tickertape-part-2/"><img src="http://software.intel.com/file/25665/" alt="Download PDF" border="0" /></a><br /><br /><b>Visual Adrenaline</b><br /><a href="http://software.intel.com/sites/billboard/"><img src="http://software.intel.com/file/25369" alt="Download PDF" border="0" /></a><br /></div>
</td>
<td></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</center><br /><center>
<table cellpadding="0" cellspacing="0" border="1" id="nav_table">
<tbody>
<tr>
<td>
<table width="190" cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr>
<td width="9" class="right_container_hdr2"></td>
<td class="right_container_hdr2"><b>Intel Tools for Unreal Developers <br /><a href="http://software.intel.com/en-us/articles/epic-licenses-tbb-for-ue-licensees/">TBB for Unreal Engine</a></b></td>
<td class="right_container_hdr2"></td>
</tr>
<tr>
<td colspan="3" valign="top" height="4"><img height="4" width="4" src="http://software.intel.com/file/20494" /></td>
</tr>
<tr>
<td width="9" class="right_container_hdr"></td>
<td class="right_container_hdr">
<h4>Related Links</h4>
</td>
<td class="right_container_hdr"></td>
</tr>
<tr>
<td colspan="3" valign="top" height="4"><img height="4" width="4" src="http://software.intel.com/file/20494" /></td>
</tr>
<tr>
<td height="15"></td>
<td valign="middle"><a href="http://www.intel.com/software/graphics" title="Intel Visual Computing Home">Visual Computing Home</a></td>
<td></td>
</tr>
<tr>
<td></td>
<td>
<h3>Intel<sup>®</sup> Technologies</h3>
</td>
<td></td>
</tr>
<tr>
<td></td>
<td valign="top"><a href="http://www.intel.com/software/sandybridge">Sandy Bridge</a><br /><a href="http://software.intel.com/en-us/articles/integrated-graphics/" title="Intel Visual Computing Technologies Integrated Graphics">Graphics</a><br /><a href="http://software.intel.com/en-us/articles/parallel-programming-vc/" title="Intel Visual Computing Technologies Parallel Programming">Parallel Programming</a></td>
<td></td>
</tr>
<tr>
<td colspan="3" valign="top" height="4"><img height="4" width="4" src="http://software.intel.com/file/20494" /></td>
</tr>
<tr>
<td></td>
<td>
<h3>Focus Areas</h3>
</td>
<td></td>
</tr>
<tr>
<td></td>
<td valign="top"><a href="http://software.intel.com/en-us/articles/game-dev/" title="Intel Game Development Focus Area">Game Development</a><br /><a href="http://software.intel.com/en-us/articles/artist-animator/" title="Intel Visual Computing Artist/Animator Focus Area">Artist/Animator</a><br /><a href="http://software.intel.com/en-us/articles/media/" title="Intel Visual Computing Media Focus Area">Media</a></td>
<td></td>
</tr>
<tr>
<td colspan="3" valign="top" height="4"></td>
</tr>
<tr>
<td></td>
<td>
<h3>Develop</h3>
</td>
<td></td>
</tr>
<tr>
<td></td>
<td valign="top"><a href="http://software.intel.com/en-us/articles/tools-vc/" title="Intel Visual Computing Devlopment Tools">Tools</a><br /><a href="http://software.intel.com/en-us/articles/code/" title="Intel Visual Computing Devlopment Code">Code</a></td>
<td></td>
</tr>
<tr>
<td colspan="3" valign="top" height="4"></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</center><!--END right column Content --></td>
</tr>
</tbody>
</table> ]]></description>
      <link>http://software.intel.com/en-us/articles/mlaa/</link>
      <pubDate>Wed, 30 Mar 2011 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/mlaa/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/mlaa/</guid>
      <category>Visual Computing</category>
      <category>Code &amp; Downloads</category>
      <category>Game Development</category>
    </item>
    <item>
      <title>Free Speedup with Compiler Switches for Fast Math and Intel® Streaming SIMD Extensions</title>
      <description><![CDATA[ <h2 class="sectionHeading">Download Article</h2>
Download <a href="http://software.intel.com/file/34689">Free Speedup with Compiler Switches for Fast Math and Intel® Streaming SIMD Extensions </a>[PDF 238KB]<br /><br /><br />
<h2 class="sectionHeading">Objective</h2>
The intention of this introductory article is to make developers aware of a simple optimization step they can do that doesn't require much effort or specialized knowledge.<br /><br /><br />
<h2 class="sectionHeading">Abstract</h2>
Compilation that can utilize Intel® Streaming SIMD Extensions (Intel® SSE) instructions, available on most x86 CPUs, can improve floating point performance even if the source code is not set up for single instruction multiple data (SIMD) processing. However, the Microsoft Visual Studio* compiler option to enable Intel SSE instructions is not enabled by default. This mini-article describes the simple steps to enable these instructions, as well as showing how to quickly recognize if your code is being properly optimized or not. In a sample floating-point-intensive serial loop, the performance improves 2X with just a recompilation.<br /><br /><br />
<h2 class="sectionHeading">Introduction</h2>
Each generation of Intel® architecture brings hardware improvements and new assembly instructions. The original Intel® Pentium® processor included the x87 math coprocessor, which ran the original floating point math instructions. In 1999, Intel introduced the first generation of Intel SSE with the Intel® Pentium® III processor, and Intel® Streaming SIMD Extensions 2 (Intel® SSE2) with the Intel® Pentium® 4 processor in 2001. Clearly, these instructions have been around for a long time and are available on most CPUs being used today. In addition to the single instruction multiple data (SIMD) parallel instructions, Intel SSE includes a set of <b>serial instructions</b> that are faster than their x87 counterparts.<br /><br />A Windows* application developer using Microsoft Visual Studio* C++ compiler will get their code debugged and working in Debug mode. With features implemented and correctness ensured, the next step is usually to make the application faster by building the program in Release mode. Unlike the Debug configuration, Release settings are generally intended for faster execution. Release mode is not a specific compiler option, but rather a "configuration", or set of compiler options. The specific compiler setting to enable Intel SSE instructions is turned off by default. For the compiler to generate the fastest possible program, it needs to be allowed to emit the best instructions for the job. This is easy to override by using the <i>arch</i> compiler option either on the compile command line or from within Microsoft Visual Studio* by changing the <i>Enable Enhanced Instruction Set</i> option in the project's C/C++ code generation properties dialog. The downside is that older CPUs from the 1990s will not be able to run the binary executable that includes Intel SSE instructions. In practice, many applications target a minimum hardware specification that is already at or above Intel SSE2. All applications that require a dual-core processor can assume Intel® Streaming SIMD Extensions 3 (Intel® SSE3) are available (<a href="http://en.wikipedia.org/wiki/SSE3" target="_blank">http://en.wikipedia.org/wiki/SSE3</a>).<br /><br />The remainder of this article walks through the steps to enable Intel SSE instructions, demonstrates how to check what the compiler is generating, and finally shows the performance improvements that are possible with just serial Intel SSE operations.<br /><br /><br />
<h2 class="sectionHeading">Walkthrough</h2>
Getting the compiler to use Intel SSE and verifying the program (.exe) is using these instructions is easy. Checking the output of the compiler does not require experience with (or even willingness to deal with) assembly programming. If you can distinguish between the letter 'f' and 's' then you have the skills to distinguish the patterns that indicate which architecture features are being used. Navigating to where you need to go can be done with a breakpoint and selecting a right-click menu option.<br /><br />For this walkthrough, we consider a simple example loop that normalizes an array of 3D vectors.<br /><br />
<pre name="code" class="cpp">    for(int i=0; i!=n ;i++)<br />        N[i] = V[i] / magnitude(V[i]);<br /></pre>
Obviously the first thing that needs to be done is to ensure the code is working properly in Debug mode. Assuming this is so, the next step is to compile in Release mode. Next put a breakpoint at this loop, and then run the program by hitting F5. When the running program hits the breakpoint, right-click to bring up the menu and select Go To Disassembly.<br /><br /><img src="http://software.intel.com/file/34690" /><br /><br />If the compiler generated code using x87 instructions, then the disassembly view will appear similar to the following:<br /><br />
<pre name="code" class="cpp">      for(int i=0; i!=n ;i++)<br />  003238C8  mov         esi,dword ptr [ebp+8]<br />  003238CB  push        edi<br />  003238CC  mov         edi,dword ptr [dest]<br />  003238CF  add         esi,8<br />  003238D2  mov         ebx,4000h<br />        N[i] = V[i] / magnitude(V[i]);<br />  003238D7  fld         dword ptr [esi-4]<br />  003238DA  fld         dword ptr [esi-8]<br />  003238DD  fld         dword ptr [esi]<br />  003238DF  fld         st(1)<br />  003238E1  fmulp       st(2),st<br />  003238E3  fld         st(2)<br />  003238E5  fmulp       st(3),st<br />  003238E7  fxch        st(1)<br />  003238E9  faddp       st(2),st<br />  003238EB  fmul        st(0),st<br />  003238ED  faddp       st(1),st<br />  003238EF  fstp        dword ptr [ebp-4]<br />  003238F2  fld         dword ptr [ebp-4]<br />  003238F5  call        _CIsqrt (3255B0h)<br />  003238FA  fstp        dword ptr [ebp-4]<br />  ...<br /></pre>
The assembly instructions beginning with the letter 'f', including fmul, fld, faddp, and fmulp, are legacy pre-Intel Pentium processor x87 math coprocessor instructions. Furthermore, in the 2nd to last line the code, call _CIsqrt, is invoking a function call to compute the square root rather than putting this inline. This sort of assembly is not ideal for high-performance code.<br /><br /><img src="http://software.intel.com/file/34691" /><br /><br />This can be changed by going to the project properties by right clicking on the project file and selecting 'Properties'. In the Properties dialog, expand the C/C++ group and click on Code Generation. Among the options that appear on the right will be Enable Enhanced Instruction Set. Change this setting to Streaming SIMD Extensions 2. Also change the Floating Point Model to Fast to remind the compiler that using only 32 bits (instead of double) will suffice for floating point calculations. Changing these options will require a recompile of all the code to take effect.<br /><br />To inspect the new assembly, rerun the program and hopefully it will stop at the same breakpoint. Note that sometimes the optimizer makes it hard for the debugger to line up the source code with the assembly. If the compiler is aggressive with putting code inline, it might never hit the breakpoint. To work around this, move the breakpoint to where the function gets called. However you manage to view the disassembly, the guts of the loop should now look like:<br /><br />
<pre name="code" class="cpp">        01385E80  movss       xmm0,dword ptr [eax-4]<br />        01385E85  movss       xmm1,dword ptr [eax-8]<br />        01385E8A  movss       xmm2,dword ptr [eax]<br />        01385E8E  movaps      xmm4,xmm0<br />        01385E91  mulss       xmm4,xmm0<br />        01385E95  movaps      xmm3,xmm1<br />        01385E98  mulss       xmm3,xmm1<br />        01385E9C  addss       xmm3,xmm4<br />        01385EA0  movaps      xmm4,xmm2<br />        01385EA3  mulss       xmm4,xmm2<br />        01385EA7  addss       xmm3,xmm4<br />        01385EAB  sqrtss      xmm4,xmm3<br />        01385EAF  movaps      xmm3,xmm5<br />        01385EB2  divss       xmm3,xmm4<br />        01385EB6  mulss       xmm0,xmm3<br />        01385EBA  mulss       xmm1,xmm3<br />        01385EBE  mulss       xmm2,xmm3<br />        01385EC2  movss       dword ptr [p],xmm1<br />        01385EC7  movss       dword ptr [ebp-28h],xmm0<br />        01385ECC  movq        xmm0,mmword ptr [p]<br />        01385ED1  movss       dword ptr [ebp-24h],xmm2<br /></pre>
The code here is using SSE serial instructions. The ss suffix on addss, mulss, and sqrtss indicate add, multiply and sqrt for one (serial) 32bit single precision floating point number. The xmm registers are 128-bit, but only the first 32 bits are actually used.<br /><br /><br />
<h2 class="sectionHeading">Performance Results</h2>
Although the assembly code isn't taking advantage of the full 128 bits that Intel SSE offers, it is still faster than the previous x87 code. Using x87, the runtime is 45 cycles per loop, whereas it only takes about 23 cycles per loop after flipping the fast math and Intel SSE switches on. These results were generated on an Intel® Core™ i7 processor and may differ on other x86 processors. Furthermore, any such results are dependent on the compiler and on how the source code is written. Note that this example was an ideal case showing only the timing difference of this particular loop - not the overall results of the application. Furthermore, not all floating point sections of code will be able to demonstrate this amount of speedup.<br /><br /><img src="http://software.intel.com/file/34692" /><br /><br /><br />
<h2 class="sectionHeading">Conclusion and Further Performance Improvements</h2>
In this introductory article we showed how Release code can be made even faster simply by flipping a switch to enable the compiler to use Intel SSE <b>serial</b> instructions and fast math. The performance benefits occur even without touching the source code. It is such a simple thing to do, feel free to tap your coworker on the shoulder and pass on this tidbit of knowledge.<br /><br />This loop could actually be sped up even more. Rather than using costly sqrt and div instructions, Intel SSE has a much faster approximate inverse square root assembly instruction that may be sufficient, or refined with Newton-Rhapson, depending on how much accuracy is required in the normalized result. Obviously, further benefits come from utilizing the parallel rather than the serial instructions. In other words, it is possible to normalize more than one vector at a time. However, these next steps require some extra programming effort, usage of libraries or header files that have already been SIMD optimized, or a compiler that can automatically vectorize the code. SIMD with Intel SSE is a widely covered topic with many articles and examples available on the web, including the <a href="http://software.intel.com">Intel® Software Network</a>. Follow-up articles to this one will dive deeper into how to effectively use SIMD and talk about the upcoming Intel® Advanced Vector Extensions (Intel® AVX) to x86 that are now available in hardware. (<a href="http://software.intel.com/en-us/avx/">http://software.intel.com/en-us/avx/</a>) In particular, using 256-bit AVX, it is possible to rearrange the data on-the-fly and normalize 8 vectors at a time bringing the runtime average-cost down to 2.7 cycles per vector.
<div  id="vc-meta">
<div id="vc-meta-author">
<div>Stan Melax</div>
</div>
<div id="vc-meta-pubdate">03-07-2011</div>
<div id="vc-meta-modificationdate">03-07-2011</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product"></div>
<div id="vc-meta-category">
<div>Performance Analysis</div>
<div>Intel® AVX</div>
<div>Intel® SSE</div>
</div>
<div id="vc-meta-thumb"></div>
<div id="vc-meta-thumb-tout"></div>
<div id="vc-meta-thumb-hero"></div>
<div id="vc-meta-tocenable">no</div>
<div id="vc-meta-abstract">Compilation that can utilize Intel® Streaming SIMD Extensions (Intel® SSE) instructions, available on most x86 CPUs, can improve floating point performance even if the source code is not set up for single instruction multiple data (SIMD) processing. However, the Microsoft Visual Studio* compiler option to enable Intel SSE instructions is not enabled by default. This mini-article describes the simple steps to enable these instructions, as well as showing how to quickly recognize if your code is being properly optimized or not. In a sample floating-point-intensive serial loop, the performance improves 2X with just a recompilation.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/free-speedup-with-compiler-switches-for-fast-math-and-intel-streaming-simd-extensions/</link>
      <pubDate>Tue, 08 Mar 2011 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/free-speedup-with-compiler-switches-for-fast-math-and-intel-streaming-simd-extensions/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/free-speedup-with-compiler-switches-for-fast-math-and-intel-streaming-simd-extensions/</guid>
      <category>Visual Computing</category>
      <category>Game Development</category>
    </item>
    <item>
      <title>Don&amp;#39;t Spill That Register - Ensuring Optimal Performance From Intrinsics</title>
      <description><![CDATA[ <h2 class="sectionHeading">Download Article</h2>
Download <a href="http://software.intel.com/file/34693">Don't Spill That Register - Ensuring Optimal Performance From Intrinsics</a> [PDF 79KB]<br /><br /><br />
<h2 class="sectionHeading">Objective</h2>
The goal of this article is to help developers ensure their C/C++ code with intrinsics produces the optimal assembly, and to show how to spot unnecessary register spilling. <br /><br /><br />
<h2 class="sectionHeading">Abstract</h2>
Programming with intrinsics can be as optimal as implementing code directly in assembly. Compared to assembly code, C/C++ code using intrinsics is subject to more compilation steps to generate the final code. Compilation in Debug mode, and possibly in Release mode with improperly set compilation flags, may generate code with seemingly unnecessary instructions for copying registers to and from the stack. Register copying can also result from the source code using more __m256 or __m128 variables than the number of corresponding registers available in hardware. From a simple example using intrinsics, this short article shows good and bad assembly produced and then explains what happened and how to avoid it. <br /><br /><br />
<h2 class="sectionHeading">Introduction</h2>
The topics of x86 intrinsics, Intel® Streaming SIMD Extensions (Intel® SSE), and the Intel® Advanced Vector Extensions (Intel® AVX) are discussed in detail online on the Intel Software Network site (<a href="http://software.intel.com/">http://software.intel.com</a>). An intrinsic looks just like a function call in C/C++ code, but the compiler sees it and turns that intrinsic into a single line of assembly. For example, consider the following code: <br /><br />
<pre name="code" class="cpp">         __m128 a = _mm_rsqrt_ss(b);  // a = 1.0f/sqrt(b) approx<br /></pre>
This line of code will cause the compiler to emit an RSQRTSS instruction at this spot. <br /><br />Intrinsics let the programmer do instruction level optimization directly, but without the burden of dealing with register allocation, loop syntax, etc. Developers sometimes ask, "Are intrinsics as optimal as assembly?" The answer is usually yes, or at least close to optimal. Furthermore, code with intrinsics is more future-proof, since code initially written for Intel SSE can be recompiled using Intel AVX. Intel AVX versions normally run faster than their Intel SSE counterparts on the same hardware. Given the ease of use and forward compatibility, intrinsics are the logical choice for optimizing to the hardware. <br /><br />To use intrinsics with the confidence that the program is optimal, it is worthwhile knowing how code gets compiled and be aware of what to look out for. We look at a short intrinsic example and see the corresponding assembly that should result, as well as compilation that generated suboptimal results.<br /><br /><br />
<h2 class="sectionHeading">Generated Assembly</h2>
For this example, we use an intrinsic implementation of the loop s[i]=a*x[i]+y[i], commonly known as "saxpy", and show the code generated. <br /><br />
<pre name="code" class="cpp">  inline void saxpy_simd4(float* S,float _a,const float* X,const float *Y,int n)<br />  {<br />     __m128 a = _mm_set1_ps(_a); <br />     for(int i=0; i!=n ;i+=4)  // process 4 elements at a time<br />     {<br />         __m128 x = _mm_load_ps(X+i);<br />         __m128 y = _mm_load_ps(Y+i);  <br />         __m128 s = _mm_add_ps(_mm_mul_ps(a,x),y);  // a*x + y <br />         _mm_store_ps(S+i, s );<br />     }<br />  }<br /></pre>
When this x86 code gets compiled, it should look something like: <br /><br />
<pre name="code" class="cpp">    (AVX assembly instructions from loop listed only):<br /> 001B4FF0  vmulps      xmm1,xmm0,xmmword ptr xpoints (1B97A0h)[eax]  <br /> 001B4FF8  vaddps      xmm1,xmm1,xmmword ptr ypoints (1B9390h)[eax]  <br /> 001B5000  vmovaps     xmmword ptr dest (1C0440h)[eax],xmm1  <br /> 001B5008  add         eax,10h  <br /> 001B500B  cmp         eax,400h  <br /> 001B5010  jl          saxpy_128+20h (1B4FF0h)  <br /></pre>
This assembly sequence was generated by Microsoft Visual Studio* C++ Compiler 2010 with default release mode settings and /arch:AVX added to the command line. Only the loop instructions within the loop that are repeated many times are shown. Variable xmm0 is initially loaded with the constant a. Clearly the first 3 assembly instructions directly map to the intrinsics in the C++ code. The assembly is actually shorter than the corresponding C code, since the register loading intrinsics have been combined with the vmulps and vaddps instructions. The last 3 instructions correspond to the for i loop. <br /><br />Compiling this small function in another project without optimization flag set resulted in the following assembly:<br /><br />
<pre name="code" class="cpp"> 00B5F7C2  mov         eax,dword ptr [i]  <br /> 00B5F7C5  add         eax,4  <br /> 00B5F7C8  mov         dword ptr [i],eax  <br /> 00B5F7CB  cmp         dword ptr [i],100h  <br /> 00B5F7D2  jge         saxpy_128+122h (0B5F872h)  <br /> 00B5F7D8  mov         eax,dword ptr [i]  <br /> 00B5F7DB  vmovaps     xmm0,xmmword ptr xpoints (0B715A0h)[eax*4]  <br /> 00B5F7E4  vmovaps     xmmword ptr [ebp-1D0h],xmm0  <br /> 00B5F7EC  vmovaps     xmm0,xmmword ptr [ebp-1D0h]  <br /> 00B5F7F4  vmovaps     xmmword ptr [x],xmm0  <br /> 00B5F7F9  mov         eax,dword ptr [i]  <br /> 00B5F7FC  vmovaps     xmm0,xmmword ptr ypoints (0B71180h)[eax*4]  <br /> 00B5F805  vmovaps     xmmword ptr [ebp-1B0h],xmm0  <br /> 00B5F80D  vmovaps     xmm0,xmmword ptr [ebp-1B0h]  <br /> 00B5F815  vmovaps     xmmword ptr [y],xmm0  <br /> 00B5F81A  vmovaps     xmm0,xmmword ptr [x]  <br /> 00B5F81F  vmovaps     xmm1,xmmword ptr [a]  <br /> 00B5F824  vmulps      xmm0,xmm1,xmm0  <br /> 00B5F828  vmovaps     xmmword ptr [ebp-190h],xmm0  <br /> 00B5F830  vmovaps     xmm0,xmmword ptr [y]  <br /> 00B5F835  vmovaps     xmm1,xmmword ptr [ebp-190h]  <br /> 00B5F83D  vaddps      xmm0,xmm1,xmm0  <br /> 00B5F841  vmovaps     xmmword ptr [ebp-170h],xmm0  <br /> 00B5F849  vmovaps     xmm0,xmmword ptr [ebp-170h]  <br /> 00B5F851  vmovaps     xmmword ptr [s],xmm0  <br /> 00B5F859  vmovaps     xmm0,xmmword ptr [s]  <br /> 00B5F861  mov         eax,dword ptr [i]  <br /> 00B5F864  vmovaps     xmmword ptr dest (0B78240h)[eax*4],xmm0  <br /> 00B5F86D  jmp         saxpy_128+72h (0B5F7C2h)  <br /></pre>
Here we see that the compiler issued additional instructions that do not correspond to loop management or intrinsics in the original source code. What is happening here is that after registers are loaded from memory, they are copied to and from the stack. The explanation is that, from the C language perspective, the __m128 variables reside on the stack, and the compiler is just putting the data to the place where it was declared. It is the O2 optimization step, not the fact that we used intrinsics, that is normally responsible for removing such unnecessary copying. The extra copying will likely happen in Debug, but may also happen in Release mode if the project's Optimization setting is not Maximum Speed or /O2. The example shown here compiles to Intel AVX, but the same thing happens with Intel SSE as well. <br /><br /><br />
<h2 class="sectionHeading">Register Shortage</h2>
When using temporary __m128 or __m256 variables for single instruction multiple data (SIMD) programming, the optimizing compiler usually does a good job of keeping these as registers. Even with optimizations, the compiler may still sometimes generate assembly code that copies temporary values to the stack. Consider for example a 3D spring (distance constraint) update written using hybrid structure of arrays (SOA) style programming. The following example is based on code from the AVX cloth sample available on Intel's website: <br /><br />
<pre name="code" class="cpp">  void springupdate(__m256 A[][3], __m256 B[][3],__m256 &amp;restlen)<br />  {<br />    __m256 half = _mm256_set1_ps(0.5f)<br />    for(int i=0;i != N ; i++)  // 8*N constraints in total<br />    {<br />      // each a and b contain the xyz endpoints for 8 pseudo-springs<br />      __m256 *a=A[i];<br />      __m256 *b=B[i];<br />      __m256 vx  = _mm256_sub_ps(b[0],a[0]); // v.x=b.x-a.x<br />      __m256 vy  = _mm256_sub_ps(b[1],a[1]); // v.x=b.x-a.x<br />      __m256 vz  = _mm256_sub_ps(b[2],a[2]); // v.x=b.x-a.x<br />      __m256 dp  = vx*vx+vy*vy+vz*vz;        // assume operator overloads for add and mul <br />      __m256 imag= _mm256_rsqrt_ps(dp);      // inverse magnitude<br />      // normalize v<br />      vx = _mm256_mul_ps(vx,imag); // vx *= inverse magnitude <br />      vy = _mm256_mul_ps(vy,imag); // vy *= imag<br />      vz = _mm256_mul_ps(vz,imag); // vz *= imag <br />      __m256 half_stretch = ( dp*imag - restlen) * half;<br />      // move endpoints a and b together <br />      a[0]=a[0]+ vx * half_stretch;    <br />      a[1]=a[1]+ vy * half_stretch;    <br />      a[2]=a[2]+ vz * half_stretch;    <br />      b[0]=b[0]- vx * half_stretch;    <br />      b[1]=b[1]- vy * half_stretch;    <br />      b[2]=b[2]- vz * half_stretch;   <br />    }   <br />  }<br /></pre>
For brevity, the above code assumes the obvious operator overloads are implemented and inlined. Even if all intrinsic calls were written out by hand, there is a good chance that compiling a routine like this will produce assembly code that will copy data to and from the stack and/or load the same data from arrays A and B multiple times. While there is no limit to how many variables of this type a programmer uses, there are a limited number of hardware registers available. When Intel AVX code is compiled into a 32-bit executable, the compiler has only 8 YMM registers available. Even with this oversimplified distance constraint equation, the code uses 6 registers for the endpoints, another 3 for the vector v between them, as well as registers for the inverse magnitude, rest length, magnitude minus rest length, half constant, and half stretch amount. Clearly, the compiler must use the same register for more than one variable in this loop. Therefore, it will have to reload values from (and possibly copy values to) the stack. In this situation, the solution to avoid register spilling is to compile the code for 64-bit. Then, instead of just 8, the compiler has 16 YMM (256-bit) registers at its disposal, which is more than enough for this particular simulation. <br /><br /><br />
<h2 class="sectionHeading">Conclusion</h2>
Starting from a C/C++ intrinsics sample, we've shown the good and the bad of what sort of assembly code can be generated. The suboptimal extra register copying code generation can result from compiler settings such as not using fast-code optimization (O2), or from not using 64-bit when more than 8 registers are needed. There may be other reasons why a compiler might not generate the assembly the programmer expects. Therefore, while intrinsics are often the preferred choice for code optimization, it is still a good idea to inspect the generated assembly to ensure the compiled result is as expected.
<div  id="vc-meta">
<div id="vc-meta-author">
<div>Stan Melax</div>
</div>
<div id="vc-meta-pubdate">03-06-2011</div>
<div id="vc-meta-modificationdate">03-06-2011</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product">
<div></div>
</div>
<div id="vc-meta-category">
<div>Intel® AVX</div>
<div>Intel® SSE</div>
</div>
<div id="vc-meta-thumb"></div>
<div id="vc-meta-thumb-tout"></div>
<div id="vc-meta-thumb-hero"></div>
<div id="vc-meta-tocenable">no</div>
<div id="vc-meta-abstract">Compared to assembly code, C/C++ code using intrinsics is subject to more compilation steps to generate the final code. Improper compliation settings may result in assembly instructions that copy registers unnecessarily and reduce performance. From a simple example using intrinsics, this short article shows good and bad assembly produced and then explains what happened and how to avoid it.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/dont-spill-that-register-ensuring-optimal-performance-from-intrinsics/</link>
      <pubDate>Mon, 07 Mar 2011 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/dont-spill-that-register-ensuring-optimal-performance-from-intrinsics/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/dont-spill-that-register-ensuring-optimal-performance-from-intrinsics/</guid>
      <category>Visual Computing</category>
      <category>Intel® AVX</category>
      <category>Game Development</category>
    </item>
  </channel></rss>
