<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Fri, 25 May 2012 23:00:47 -0700 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/articles/visual-computing/type/technical-article/feed/" rel="self" type="application/rss+xml" />
    <title>Intel Software Network articles Feed</title>
    <link>http://software.intel.com/en-us/articles/visual-computing/type/technical-article/</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>Fluid Simulation for Video Games (part 13)</title>
      <description><![CDATA[ <p><b>By Dr. Michael J. Gourlay</b></p>
<h2 class="sectionHeading">Downloads</h2>
Download <a href="http://software.intel.com/file/43799">Fluid Simulation for Games (part 13)</a> [PDF 1.1MB]<br /> Download <a href="http://software.intel.com/file/43798">MjgIntelFluidDemo_Part13.rar</a> [RAR 2.3MB]<br /> <br /> <img src="http://software.intel.com/file/43797" height="471" width="740" /><br />
<p><b>Figure 1.</b> <i>Convex polyhedra interacting with a vortex particle fluid</i></p>
<p><i><br /></i></p>
<h2 class="sectionHeading">Convex Obstacles</h2>
<p>Video games are compelling because they are interactive. Even visual effects should respond to other entities in the environment, especially those the user controls. Particle effects, including fluids, should therefore respond to rigid bodies of any shape. Those shapes should include airfoils that can experience lift.</p>
<p>This article—the thirteenth in a series—describes how to augment the fluid particle system described earlier, interact with rigid bodies with any polyhedral shape, and generate a lift-like force on those bodies. <a href="http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-1/">Part 1</a> summarized fluid dynamics; <a href="http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-2/">part 2</a> surveyed fluid simulation techniques. <a href="http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-3/">Part 3</a> and <a href="http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-4/">part 4</a> presented a vortex-particle fluid simulation with two-way fluid–body interactions that runs in real time. <a href="http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-5/">Part 5</a> profiled and optimized that simulation code. <a href="http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-6/">Part 6</a> described a differential method for computing velocity from vorticity, and <a href="http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-7/">part 7</a> showed how to integrate a fluid simulation into a typical particle system. <a href="http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-8/">Part 8</a> explained how a vortex-based fluid simulation handles variable density in a fluid; <a href="http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-9/">part 9</a> described how to approximate buoyant and gravitational forces on a body immersed in a fluid with varying density. <a href="http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-10/">Part 10</a> described how density varies with temperature, how heat transfers throughout a fluid, and how heat transfers between bodies and fluid. <a href="http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-11/">Part 11</a> added combustion, a chemical reaction that generates heat. <a href="http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-12/">Part 12</a> explained how improper sampling caused unwanted jerky motion and described how to mitigate it.</p>
<p> </p>
<h2 class="sectionHeading">Collision Detection</h2>
<p>Detecting collisions between objects first entails computing the distance between them. Video games model most shapes with planar, polygonal faces and treat particles as though they have spherical shape. You therefore need to compute the distance between planes and spheres.</p>
<p> </p>
<b>The Math Behind Planes</b>
<p>A <i>plane</i> is a two-dimensional surface in a three-dimensional space. You can define a plane in various ways. One convenient representation uses a normal vector <img src="http://software.intel.com/file/43808" /> and a distance, <i>d</i>. The <i>plane equation</i> has this formulation:</p>
<p ><img src="http://software.intel.com/file/43809" /></p>
<p>All points with coordinates <img src="http://software.intel.com/file/43810" /> that satisfy this equation lie within the plane. When <img src="http://software.intel.com/file/43811" /> has unit length (<img src="http://software.intel.com/file/43812" />), <i>d</i> is the distance of the plane from the origin. You can use this value to represent a plane as a vector with four components (that is, a 4-vector plane): <img src="http://software.intel.com/file/43813" /> or <img src="http://software.intel.com/file/43814" />.</p>
<p>The distance, <i>D</i>, of a point <img src="http://software.intel.com/file/43815" /> from the plane <img src="http://software.intel.com/file/43814" /> is:</p>
<p ><img src="http://software.intel.com/file/43816" /></p>
<p>Notice that this equation has the same form as the equation for the plane, except that instead of equating it to zero, the formula tells you the distance of the point to the plane. This is obviously consistent because if a point has zero distance to the plane, then the point lies in the plane.</p>
<p>But wait a moment! This formula could result in negative values. For example, choose the point <img src="http://software.intel.com/file/43817" /> and the plane with <img src="http://software.intel.com/file/43818" /> and <i>d</i> = 1. The formula claims that point has negative distance from the plane. What the heck is <i>negative distance</i>?</p>
<p>A plane divides all of space into two halves. A <i>half-space</i> is the region of space on one side of a plane. The signed distance formula tells you whether a point lies in the positive half-space or the negative half-space of a plane, as Figure 2 depicts.</p>
<p ><img src="http://software.intel.com/file/43800" /></p>
<p ><b>Figure 2.</b> <i>Planes and half-spaces</i></p>
<p ><br /><i></i><i></i></p>
<p>Think of a plane as facing in the direction of its normal. Points behind the plane have negative distance. Points in front of the plane have positive distance. When a point lies behind a plane, we call its distance (which is negative) the <i>penetration depth</i>.</p>
<p>The <code>Plane</code> class shows an implementation of the planar formulae that builds upon a 4-vector class (Vec4):</p>
<img src="http://software.intel.com/file/43820" /><br /> <br />
<p>You can compute the distance of a sphere from a plane by computing the distance of the sphere's center (a point) from the plane, then subtracting the sphere's radius.</p>
<p> </p>
<p class="Note">Note: <i>Technically, planes exist in other dimensions. For example, in a two-dimensional space, a plane is also a line. In a four-dimensional space, a plane is a three-dimensional hyperplane. But this article discusses three-dimensional spaces, where planes are two dimensional.</i></p>
<p class="Note"><i><br /></i></p>
<p><b>Planes Make Convex Hulls</b></p>
<p>A <i>polytope</i> is a shape with flat sides. In two dimensions, polytopes are called <i>polygons</i>. In three dimensions, polytopes are called <i>polyhedra</i>.</p>
<p>A <i>convex shape</i> is one where any line segment between any two points in the shape is also in that shape. So, a <i>convex polytope</i> is such a shape with flat sides, as shown in Figure 3.</p>
<p>An array of planes can represent a convex polyhedron. You can describe each face, <i>i</i>, of the polyhedron using a plane representation, <img src="http://software.intel.com/file/43819" />. This is called a <i>half-space representation</i> (or <i>H-representation</i>).</p>
<p ><img src="http://software.intel.com/file/43894" /></p>
<p ><b>Figure 3.</b> <i>Convex versus non-convex polytopes</i></p>
<p ><i><br /></i></p>
<p>To determine whether a point is inside a polyhedron, compute the distance of that point to each face plane of the polyhedron. If all distances are negative, the point lies inside the polyhedron.</p>
<p ><img src="http://software.intel.com/file/43802" /></p>
<div ><b>Figure 4.</b> <i>Measuring distance or penetration depth between stationary spheres and polytopes<br /><br /></i></div>
<p>As Figure 4 depicts, computing the distance between a stationary point (or sphere) and the planes of a polytope does not always give an unambiguous measure of distance or penetration depth. Sometimes, the best measure of distance could be from an edge or vertex of the polytope. But the distance formula will always correctly tell whether a point (or sphere) is inside, outside, or overlapping a polytope.</p>
<p> </p>
<p><b>Alternative Method</b></p>
<p>The point-to-plane method suffices when detecting collisions between particles and polytopes. Other algorithms exist to compute the distance between two convex shapes. One of the most famous and useful, especially among game developers, is the Gilbert–Johnson–Keerthi (GJK) distance algorithm. Although its code is fast and simple, the concepts are not easy to explain and are not in the scope of this article. Furthermore, determining penetration depth can be even more problematic and usually entails more sophisticated approaches, such as the Expanding Polytope Algorithm (EPA). See the "For Further Study" section at the end of this article for more information.</p>
<p> </p>
<p><b>Collision Detection</b></p>
<p>Detecting a collision between objects entails computing their separation distance or penetration depth. Also, when objects collide, you usually want to know the region of contact and <i>contact normals</i>—that is, the direction along which to apply force or displacement to separate the objects.</p>
<p>Although the point-to-plane distance formula will tell you whether a point lies inside a polytope, it will not unambiguously tell you its distance or penetration depth. In addition to the edge cases described earlier, determining penetration depth entails the relative direction of travel of the two objects. The correct answer depends on the configuration of objects before and after the collision. If objects lie inside each other (or if a particle lies inside a polytope), then you have detected the collision after it occurred. This is called <i>interpenetration</i>, and it should be avoided or corrected when it happens.</p>
<p>For visual effects involving hundreds or thousands of tiny particles, you can get adequate results by using the following simple algorithm:</p>
<ol>
<li>Given a query point, a polytope, its position and orientation, initialize <code>largestDistance</code> to an extremely large negative value.</li>
<li>For each plane in the polytope: <ol >
<li>Compute the distance between the query point and the plane.</li>
<li>If that distance exceeds <code>largestDistance</code>, then: <ol >
<li>Assign <code>largestDistance</code> to that distance</li>
<li>Remember this plane index</li>
</ol></li>
</ol> </li>
<li>Return the plane index and <code>largestDistance</code>.</li>
</ol>
<p>For spheres, subtract their radius from the returned largest distance to get the separation distance. If that value is negative, the sphere interpenetrates the polytope.</p>
<p>From that information, you can compute a contact point and normal:</p>
<p>Given a query point, a plane, a polytope orientation, and the largest distance:</p>
<ol>
<li>Reorient the plane normal to world space.</li>
<li>Scale the normal for the returned plane index by the largest distance to get a penetration vector.</li>
<li>Subtract the penetration vector from the query point to get the contact point.</li>
</ol>
<p>Although this algorithm does not accurately measure distance for the edge cases, the distance and normal it returns yield sufficiently close results to work for collision response.</p>
<p>The <code>ConvexPolytope</code> class implements these algorithms. (See the demonstration code that accompanies these articles for more details.)</p>
<img src="http://software.intel.com/file/43821" /><br /> <br />
<p class="Body">The routine <code>ContactPoint</code> uses information computed by <code>ContactDistance</code>:</p>
<img src="http://software.intel.com/file/43822" /><br /> <br />
<p><b><br /></b></p>
<p><b>Broad Phase</b></p>
<p>The <code>ContactDistance</code> algorithm iterates over every face in a polyhedron. That process can get expensive. You can reduce that expense in a few ways:</p>
<ul>
<li>Only compute <code>ContactDistance</code> when the query point lies within a coarse bounding volume (such as a bounding sphere) that contains the polytope. That computation is much faster and can let you skip the more expensive <code>ContactDistance</code> until the query point is somewhat near the polytope. The accompanying code uses this technique.</li>
<li>If you only care about interpenetration, you can change <code>ContactDistance</code> to return immediately when it finds any distance-to-plane that is positive. The returned value will not necessarily be the largest distance, but when positive, you don’t care. Note that if you want to compute the distance to a sphere instead of to a point, then you would have to pass in the sphere radius and take that into account. The accompanying code includes a routine that uses this technique.</li>
</ul>
<p>To further reduce CPU cost , but at the expense of memory and complexity you can:</p>
<ul>
<li>Remember the plane from the previous iteration, and reuse that for the next attempt. If it’s still positive, there is no collision, and you can bail out after testing one plane. Note that this would entail storing another integer per particle. Because there can be tens of thousands of particles, that can add up.</li>
<li>Store face connectivity information and only visit adjacent faces whose distance would increase <code>largestDistance</code>. Doing so can significantly reduce the number of faces visited. Also, game engines often include such adjacency information. You might be able to exploit that information for particle collisions.</li>
</ul>
<br />
<p><b>Deepening Penetration</b></p>
<p>If a particle penetrates an object, it could end up closer to the opposite side of the object rather than the side it penetrated, as Figure 5 shows. This could happen for thin objects or fast particles. It is therefore useful to consider only those planes for which particles are moving farther behind.</p>
<p ><img src="http://software.intel.com/file/43803" /></p>
<p ><b>Figure 5.</b> <i>Measuring collision depth between a moving spherical particle and polytope</i></p>
<p> </p>
<p>The demo code accompanying this article contains a routine, <code>ConvexPolytope::CollisionDistance</code>, that implements this idea.</p>
<p> </p>
<p><b>Continuous Collision Detection</b></p>
<p>The most accurate way to determine contacts would entail <i>continuous collision detection</i> (CCD)—that is, detecting the collision just as it happens (instead of after the fact). CCD involves computing a <i>time of impact</i> (TOI) and either advancing the simulation up to that point or rewinding back to it. One way to approximate CCD to estimate TOI is to move into a reference frame where only the object moves. The other object will still be in motion. Now, <i>sweep</i> that moving object across space to span the region it would occupy at all points during the test interval. If the swept shape intersects with the stationary shape, the two objects probably collided during that interval.</p>
<p>Sweeping a shape is relatively easy if its motion is pure translation but more difficult if its motion includes rotation. Video games therefore either treat only linear motion for continuous collision detection or simply use discrete collision detection and allow objects to interpenetrate.</p>
<p>For spherical shapes, like particles, the swept shape is a line segment with hemispherical caps, also known as a <i>capsule </i>or<i> sausage</i>. You can compute intersections between capsules and planes using simple formulae. But particle effects for video games do not need that level of sophistication, and it takes longer to compute than most games budget for effects.</p>
<p> </p>
<p><b>Concave Shapes</b></p>
<p>The technique described in this article applies directly to convex shapes. To apply to concave shapes, you can either compute the convex hull of that shape or decompose the shape into convex components. See the "For Further Study" section for more information.</p>
<p> </p>
<h2 class="sectionHeading">Collision Response</h2>
<p>When the detection phase indicates that a particle interpenetrated an obstacle, the simulation must resolve the collision. In other words, it must push the particle outside the obstacle and adjust the fluid flow to satisfy boundary conditions.</p>
<p><a href="http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-1/">Part 1</a>, <a href="http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-2/">part 2</a>, and <a href="http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-4/">part 4</a> explain boundary conditions and one way to solve them approximately, so I will not repeat that here. This article only describes changes that facilitate particles interacting with convex polyhedra.</p>
<p> </p>
<p><b>Simplified Vorton Interaction with Planes</b></p>
<p>The routine <code>SolveBoundaryConditions</code> iterates through each rigid body and collides vortons and tracers with that body by calling <code>CollideVortonsSlice</code> and <code>CollideTracersSlice</code>. As of this article, <code>ColideVortonsSlice</code> is a new routine, extracted from <code>SolveBoundaryConditions</code> from previous articles.</p>
<p>The code snippet below focuses on changes made to facilitate colliding with convex polytopes. Code in <span >purple bold</span> is new.</p>
<img src="http://software.intel.com/file/43823" /><br /> <br />
<p>Notice that this code first checks whether the particle lies within a bounding sphere, regardless of whether the obstacle is a sphere or polytope. That is a broad-phase collision test.</p>
<p>The routines <code>CollideTracersSlice</code> and <code>RemoveEmbeddedParticles</code> have similar changes. See the demonstration code accompanying this article for details.</p>
<p> </p>
<p><b>Parallelization</b></p>
<p>The routine <code>CollideVortonsSlice</code> was extracted from <code>SolveBoundaryConditions</code> to facilitate parallelizing it with Intel® Threading Building Blocks (Intel® TBB). In addition to extracting that code into its own routine, other changes were made. Previously, the corresponding code directly applied changes to the rigid body’s temperature and momentum. The old code performed operations like this:</p>
<ol>
<li>Read body temperature.</li>
<li>Compute heat exchange based on body temperature.</li>
<li>Write new body temperature.</li>
</ol>
<p>But when run in parallel, such updates cause a race condition, as shown in Table 1.</p>
<p> </p>
<p><b>Table 1.</b> <i>Parallel threads cause a race condition.</i></p>
<table  class="tableFormat1" border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<th valign="top" width="339">Thread 1</th><th valign="top" width="339">Thread 2</th>
</tr>
<tr>
<td valign="top" width="339">
<p class="CellBody">Read body temperature.</p>
</td>
<td valign="top" width="339">
<p class="CellBody">Read body temperature.</p>
</td>
</tr>
<tr>
<td valign="top" width="339">
<p class="CellBody">Compute heat exchange based on body temperature.</p>
</td>
<td valign="top" width="339">
<p class="CellBody">Compute heat exchange based on body temperature.</p>
</td>
</tr>
<tr>
<td valign="top" width="339">
<p class="CellBody">–</p>
</td>
<td valign="top" width="339">
<p class="CellBody">Write new body temperature.</p>
</td>
</tr>
<tr>
<td valign="top" width="339">
<p class="CellBody">Write new body temperature.</p>
</td>
<td valign="top" width="339">
<p class="CellBody">–</p>
</td>
</tr>
</tbody>
</table>
<p> </p>
<p>Both threads update a value at the same address (temperature, in this example). Only one thread can "win."</p>
<p>You could solve this issue by synchronizing the code with mutex locks on the body temperature. But doing so would serialize that critical section of code, which would in turn defeat the purpose of parallelizing it.</p>
<p>Instead, have each thread accumulate changes in a variable local to each thread. When the thread terminates, have the parent thread accumulate those changes and apply them to the body. This might seem to be a perfect use case for Intel TBB’s <code>parallel_reduce</code> operation.</p>
<p>There is one more catch, however: That accumulation operation is not associative. (See <a href="http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-12/">Part 12</a> for details of a similar problem.) Even though addition is associative for real numbers, it is not for floating-point numbers. To ensure that this parallelized routine is deterministic, you have to spawn and join threads manually, because Intel TBB’s <code>parallel_reduce</code> does not split and join deterministically. Instead, use Intel TBB’s <code>parallel_invoke</code> and manually spawn and join threads.</p>
<img src="http://software.intel.com/file/43824" /><br /> <br />
<p>Create a functor to run <code>CollideVortonsReduce</code> with Intel TBB’s <code>parallel_invoke</code>:</p>
<img src="http://software.intel.com/file/43825" /><br /> <br />
<p>For comparison, the demonstration code accompanying this article also includes code for using Intel TBB’s <code>parallel_reduce</code>. It works—in the sense that it generates a usable result—but it is not deterministic, so using it would impede diagnosing issues.</p>
<p> </p>
<h2 class="sectionHeading">Results</h2>
<p>Let’s replace some of the spheres in previous articles with polytopes.</p>
<p>Although the demonstration code uses boxes, the algorithms and data structures support any convex polytope, as depicted later in Figure 8.</p>
<p> </p>
<p><b>Scenarios</b></p>
<p>Figure 6 shows a flat plate interacting with flames and smoke. Notice that particles move around the plate, and the plate causes vortices to shed from it.</p>
<img src="http://software.intel.com/file/43804" /><br />
<p><b>Figure 6.</b> <i>Various views of a plate above flames</i></p>
<p>Figure 7 shows a flat plate moving horizontally through fluid, leaving a wake with vortices. The code accompanying this article also includes demonstrations with the obstacle rotating about longitudinal and lateral axes, exhibiting the Magnus (curve ball) effect.</p>
<img src="http://software.intel.com/file/43805" /><br />
<p><b>Figure 7.</b> <i>Flat plate moving through fluid</i></p>
<p>Figure 8 shows a polyhedral airfoil moving horizontally through fluid, leaving a wake with vortices. This demonstrates that the technique applies to shapes other than boxes and spheres.</p>
<img src="http://software.intel.com/file/43806" /><br /> <br />
<p><b>Figure 8. </b>Airfoil moving through fluid. This comes from Gourlay (2010), which used a similar formulation to that presented in this article.</p>
<p> </p>
<p><b>Performance</b></p>
<p>Table 2 shows how the collision-detection and response routines perform for the scenario with flames and smoke passing by the flat plate. This scenario had, on average, 49,000 tracer particles and 981 vortex particles (per frame), two spheres, and one box. Tracers hit bodies 8876 times per frame, and vortons hit bodies 63 times per frame. The benchmark ran 6000 frames for each run. The processor was a four-core (eight hardware threads with hyperthreading) Intel® Core™ i7-2600 running at 3.4 GHz.</p>
<p> </p>
<p><b>Table 2.</b> <i>Collision-detection and response routines for the smoke and flame scenario</i></p>
<table  class="tableFormat1" border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<th valign="bottom" width="79">No. of threads</th> <th valign="bottom" width="119">Solve boundary conditions</th> <th valign="bottom" width="119">SBC tracers</th> <th valign="bottom" width="119">SBC vortons</th> <th valign="bottom" width="119">Sim update</th> <th valign="bottom" width="119">Total (including render)</th>
</tr>
<tr>
<td valign="top" width="79">
<p class="CellBody">1</p>
</td>
<td valign="top" width="119">
<p class="CellBody">2.42</p>
</td>
<td valign="top" width="119">
<p class="CellBody">2.346</p>
</td>
<td valign="top" width="119">
<p class="CellBody">0.057</p>
</td>
<td valign="top" width="119">
<p class="CellBody">4.12</p>
</td>
<td valign="top" width="119">
<p class="CellBody">25.7</p>
</td>
</tr>
<tr>
<td valign="top" width="79">
<p class="CellBody">2</p>
</td>
<td valign="top" width="119">
<p class="CellBody">1.606</p>
</td>
<td valign="top" width="119">
<p class="CellBody">1.563</p>
</td>
<td valign="top" width="119">
<p class="CellBody">0.0423</p>
</td>
<td valign="top" width="119">
<p class="CellBody">2.52</p>
</td>
<td valign="top" width="119">
<p class="CellBody">16.3</p>
</td>
</tr>
<tr>
<td valign="top" width="79">
<p class="CellBody">3</p>
</td>
<td valign="top" width="119">
<p class="CellBody">1.134</p>
</td>
<td valign="top" width="119">
<p class="CellBody">1.095</p>
</td>
<td valign="top" width="119">
<p class="CellBody">0.0378</p>
</td>
<td valign="top" width="119">
<p class="CellBody">2.15</p>
</td>
<td valign="top" width="119">
<p class="CellBody">11.7</p>
</td>
</tr>
<tr>
<td valign="top" width="79">
<p class="CellBody">4</p>
</td>
<td valign="top" width="119">
<p class="CellBody">1.142</p>
</td>
<td valign="top" width="119">
<p class="CellBody">1.107</p>
</td>
<td valign="top" width="119">
<p class="CellBody">0.0348</p>
</td>
<td valign="top" width="119">
<p class="CellBody">2.03</p>
</td>
<td valign="top" width="119">
<p class="CellBody">11.4</p>
</td>
</tr>
<tr>
<td valign="top" width="79">
<p class="CellBody">6</p>
</td>
<td valign="top" width="119">
<p class="CellBody">0.804</p>
</td>
<td valign="top" width="119">
<p class="CellBody">0.774</p>
</td>
<td valign="top" width="119">
<p class="CellBody">0.0312</p>
</td>
<td valign="top" width="119">
<p class="CellBody">1.81</p>
</td>
<td valign="top" width="119">
<p class="CellBody">9.65</p>
</td>
</tr>
<tr>
<td valign="top" width="79">
<p class="CellBody">8</p>
</td>
<td valign="top" width="119">
<p class="CellBody">0.784</p>
</td>
<td valign="top" width="119">
<p class="CellBody">0.75</p>
</td>
<td valign="top" width="119">
<p class="CellBody">0.0303</p>
</td>
<td valign="top" width="119">
<p class="CellBody">1.76</p>
</td>
<td valign="top" width="119">
<p class="CellBody">9.34</p>
</td>
</tr>
<tr>
<td valign="top" width="79">
<p class="CellBody">12</p>
</td>
<td valign="top" width="119">
<p class="CellBody">0.69</p>
</td>
<td valign="top" width="119">
<p class="CellBody">0.657</p>
</td>
<td valign="top" width="119">
<p class="CellBody">0.0303</p>
</td>
<td valign="top" width="119">
<p class="CellBody">1.61</p>
</td>
<td valign="top" width="119">
<p class="CellBody">8.38</p>
</td>
</tr>
<tr>
<td valign="top" width="79">
<p class="CellBody">16</p>
</td>
<td valign="top" width="119">
<p class="CellBody">0.718</p>
</td>
<td valign="top" width="119">
<p class="CellBody">0.684</p>
</td>
<td valign="top" width="119">
<p class="CellBody">0.0336</p>
</td>
<td valign="top" width="119">
<p class="CellBody">1.64</p>
</td>
<td valign="top" width="119">
<p class="CellBody">8.59</p>
</td>
</tr>
</tbody>
</table>
<p> </p>
<p>Figure 9 shows a plot of the data in the table.</p>
<p ><img src="http://software.intel.com/file/43807" /></p>
<br />
<p ><b>Figure 9. </b><i>Run times for the benchmark scenario</i></p>
<p><b><br /></b></p>
<p><b>Lift</b></p>
<p>Nonrotating spheres do not generate lift. But lift occurs on asymmetric shapes like flat plates and airfoils.</p>
<p>The algorithm to solve boundary conditions generates lift-like impulses. A flat plate moving horizontally through the fluid at an appropriate angle of attack should encounter lift pushing the plate upward. And indeed, this simulation generates a qualitatively similar result—but it’s from deflecting particles that bounce off the obstacle with partially elastic collisions. In contrast, simulations used in science and engineering calculate the pressure the fluid exerts on bodies, and that calculation can include lift. This simulation does not calculate pressure.</p>
<p>Furthermore, the collision response algorithm does not generate new vortons; it only reassigns values for existing vortons in contact with the object. To be more physically accurate, objects should generate new vortons when necessary. For example, vorticity would be generated on the leeward side of the airfoil, and this would generate a low-pressure region behind and above the airfoil, the vertical component of which would be lift. See the "For Further Study" section for more information about more physically accurate ways to calculate realistic pressure and aerodynamic forces on objects immersed in a fluid.</p>
<p> </p>
<h2 class="sectionHeading">Summary</h2>
<p>Using a simple point-to-plane distance formula, you can make fluid effects interact with obstacles that have shapes commonly used to create models in a video game. The algorithm is easy to parallelize using Intel TBB and runs in less than a millisecond for tens of thousands of particles.</p>
<p> </p>
<h2 class="sectionHeading">Future Articles</h2>
<p>Liquids take the shape of their containers on all but one surface, so modeling liquids also implies modeling containers. Future articles will include extending boundary conditions to include interiors, which will allow for creating containers. That will pave the way for a discussion of free surface tracking and surface tension—properties of liquids.</p>
<p> </p>
<h2 class="sectionHeading">For Further Study</h2>
<ul>
<li>Casey Muratori posted a video (<a href="https://mollyrocket.com/849">https://mollyrocket.com/849</a>) that explains the GJK algorithm using straightforward geometry.</li>
<li>Lien and Amato ("Approximate Convex Decomposition of Polyhedra," 2006) describe an algorithm to decompose a concave model into nearly convex shapes. Lien’s Ph.D. dissertation (<a href="http://cs.gmu.edu/~jmlien/masc/uploads/Main/lien-dissertation.pdf">http://cs.gmu.edu/~jmlien/masc/uploads/Main/lien-dissertation.pdf</a>) contains pseudo-code for their algorithms and a comprehensive bibliography on the subject. Also see their technical report (<a href="http://cs.gmu.edu/~jmlien/masc/uploads/Main/cd3d_TR_2006.pdf">http://cs.gmu.edu/~jmlien/masc/uploads/Main/cd3d_TR_2006.pdf</a>).</li>
<li>The Wikipedia article on Convex Hull Algorithms (<a href="http://en.wikipedia.org/wiki/Convex_hull_algorithms">http://en.wikipedia.org/wiki/Convex_hull_algorithms</a>) describes algorithms to obtain the convex hull of a set of points, such as the vertices of a model.</li>
<li>In chapter 6 of <i>Vortex Methods: Theory and Practice</i>, Cottet and Koumoutsakos describe a vorticity creation algorithm to satisfy boundary conditions. In contrast to the algorithm presented here and in <a href="http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-4/">part 4</a>, their algorithm takes into account the entire body at once instead of only a single point at a time.</li>
<li>I presented a formulation similar to the one described in this article in "<a href="http://webstaff.itn.liu.se/~perla/Siggraph2010/content/posters/0008.pdf">Fluid-body simulation using vortex particle operations</a>," an animation poster session at the International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), Los Angeles in 2010.</li>
</ul>
<h2 class="sectionHeading">About the Author</h2>
<p>Dr. Michael J. Gourlay works as a senior software engineer on interactive entertainment. He previously worked at Electronic Arts Inc. (EA Sports) as the software architect for the Football Sports Business Unit, as a senior lead engineer on <i>Madden* NFL</i>, on character physics and the procedural animation system used by EA on <i>Mixed Martial Arts</i>, and as a lead programmer on <i>NASCAR</i>*. He wrote the visual effects system used in EA games worldwide and patented algorithms for interactive, high-bandwidth online applications. He also developed curricula for and taught at the University of Central Florida, Florida Interactive Entertainment Academy, an interdisciplinary graduate program that teaches programmers, producers, and artists how to make video games and training simulations. Prior to joining EA, he performed scientific research using computational fluid dynamics and the world’s largest massively parallel supercomputers. His previous research also includes nonlinear dynamics in quantum mechanical systems and atomic, molecular, and optical physics. Michael received his degrees in physics and philosophy from Georgia Tech and the University of Colorado Boulder.</p>
<p> </p>
<p><sup>*</sup>Other names and brands may be claimed as the property of others.</p>
<p> </p>
<div id="vc-meta" >
<div id="vc-meta-author">
<div></div>
</div>
<div id="vc-meta-pubdate">05-16-2012</div>
<div id="vc-meta-modificationdate">05-16-2012</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product">
<div></div>
</div>
<div id="vc-meta-category">
<div></div>
</div>
<div id="vc-meta-thumb">http://software.intel.com/file/43797</div>
<div id="vc-meta-abstract">This is a series on fluid simulation for games. This article describes how to augment the fluid particle system described earlier, interact with rigid bodies with any polyhedral shape, and generate a lift-like force on those bodies.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-13/</link>
      <pubDate>Wed, 16 May 2012 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-13/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-13/</guid>
      <category>Visual Computing</category>
      <category>Visual Computing Source</category>
    </item>
    <item>
      <title>Surface Based Anti-Aliasing</title>
      <description><![CDATA[ <link media="screen" href="http://software.intel.com/media/gamedev/css/3302_Intel_VC_01.css?v=11" type="text/css" rel="stylesheet" />
<link media="screen" href="http://software.intel.com/file/23729" type="text/css" rel="stylesheet" />
<table border="0" cellpadding="0" cellspacing="0" width="100">
<tbody>
<tr>
<td valign="top">
<div id="left_container">
<div id="header_content"><a href="http://software.intel.com/en-us/visual-computing/" title="Visual Computing Research"><img src="http://software.intel.com/file/42465" border="0" height="96" width="727" /></a></div>
<div id="left_content_container2"><!-- START left content -->
<div id="showcase_01">
<h2>Surface Based Anti-Aliasing<br /> <br /> By Marco Salvi, Kiril Vidimce, Intel Corp.</h2>
<p>We present surface based anti-aliasing (SBAA), a new approach to real-time anti-aliasing for deferred renderers that improves the performance and lowers the memory requirements for anti-aliasing methods that sample sub-pixel visibility. We introduce a novel way of decoupling visibility determination from shading that, compared to previous multi-sampling based approaches, significantly reduces the number of samples stored and shaded per pixel. Unlike postprocess anti-aliasing techniques used in conjunction with deferred renderers, SBAA correctly resolves visibility of sub-pixel features, minimizing spatial and temporal artifacts.<br /> <br /> Read the paper: <a href="http://software.intel.comjavascript:void(0)" onclick="ndownload('http://software.intel.com/file/43579')">Surface Based Anti-Aliasing</a> [PDF 2.4 MB]</p>
<p ><img src="http://software.intel.com/file/43585" /></p>
</div>
</div>
</div>
</td>
</tr>
</tbody>
</table>
<div id="vc-meta" >
<div id="vc-meta-author">
<div></div>
</div>
<div id="vc-meta-pubdate">05-02-2012</div>
<div id="vc-meta-modificationdate">05-02-2012</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product">
<div></div>
</div>
<div id="vc-meta-category">
<div>Research</div>
</div>
<div id="vc-meta-thumb">http://software.intel.com/file/43580</div>
<div id="vc-meta-abstract">We present surface based anti-aliasing (SBAA), a new approach to real-time anti-aliasing for deferred renderers that improves the performance and lowers the memory requirements for anti-aliasing methods that sample sub-pixel visibility. We introduce a novel way of decoupling visibility determination from shading that, compared to previous multi-sampling based approaches, significantly reduces the number of samples stored and shaded per pixel.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/surface-based-anti-aliasing/</link>
      <pubDate>Wed, 02 May 2012 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/surface-based-anti-aliasing/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/surface-based-anti-aliasing/</guid>
      <category>Visual Computing</category>
      <category>Visual Computing Source</category>
    </item>
    <item>
      <title>Cthulhu PC Debut on Ultrabooks, A Game Craft Journey</title>
      <description><![CDATA[ <div><b>By Bob Duffy</b></div>
<div><br /></div>
<img src="http://software.intel.com/file/43597" title="Call of Cthulhu: The Wasted Land" alt="Call%20of%20Cthulhu%3A%20The%20Wasted%20Land" />
<div><br /></div>
<div>We interview Tomas Rawlings of Red Wasp Design, developer of the H.P. Lovecraft inspired PC game “<a target="_parent" href="http://www.appup.com/app-details/call-of-cthulhu:-the-wasted-land">Call of Cthulhu: The Wasted Land</a>” (available from <a target="_blank" href="http://www.appup.com/app-details/call-of-cthulhu:-the-wasted-land">Intel AppUp</a>), about his inspiration for the game, any challenges he faced in porting to Ultrabook PC’s, and any recommendations he has for developers taking on similar challenges.  Here's what he has to say.
<div><span ><br /></span>
<div><span ><br /></span></div>
<div><span ><b><span >Section One: Inspiration</span></b></span><span ><br /></span><br /><b><img width="300" src="http://redwaspdesign.files.wordpress.com/2011/06/cthulhu_sketches_c.jpg"  />Bob:</b> The title is more than telling, and appears to be from an <a target="_blank" href="http://en.wikipedia.org/wiki/The_Call_of_Cthulhu">H.P. Lovecraft classic, “Call of Cthulhu”</a>.  What inspired you to take on this material?<br /><br /><b>Tomas:</b> I've been a fan of Lovecraft's fiction for a long, long time.  I love the <a target="_blank" href="http://www.chaosium.com/article.php?story_id=87">Call of Cthulhu RPG</a>, and I've been wanting to make a turn-based game for a while too.  Happily all 3 came together in this project!<br /><br /><b>Bob:</b> I’ve only made it past the first level, is there a Cthulhu in the story?<br /><br /><b>Tomas:</b> That would be telling.  Cthulhu him(it?)self is a HUGE entity that on the scale of the game would dwarf the screen if we tried to depict him.  However there are related entities that are key in the Call of Cthulhu RPG that you will meet and who will try to kill you.  However we don't want to spoil the ending for players.<br /><br /><b>Bob: </b>Why set this in the “Great War” ?<br /><br /><b>Tomas:</b> We wanted a setting that has not been overused in games (WW2 has for example) and yet had plenty of historical resonance and recognition – not to mention cool weapons we could use.  World War 1 achieved all that as a setting.  Also there is a personal connection, we're approaching the 100th Anniversary of the start of the war (2014) and my great grandfather, Sid Brown, who I knew well, fought in that war, so I wanted to remember him in a form I work with, games. <i>(<a target="_blank" href="http://redwaspdesign.wordpress.com/2011/07/20/meet-sid-sapper-brown-based-on-my-grandfather/ ">read blog</a>)</i><br /><br /><b>Bob:</b> It has to be challenging to base a game on copyrighted material such as HP Lovecraft’s story.  Do you have any advice for developers wanting to take on stories with licensed characters?<br /><br /><b>Tomas:</b> I think it is key to understand why people love that story or its characters.  For me in this project that is easy as I'm a fan of Lovecraft's fiction and the Call of Cthulhu RPG.  But you still need to get into the essence of what you are working with.  Lovecraft's fiction is about the realization that we are not alone, and the entities we share existence with don't really like us nor want to share this world with us.  It's also about the gradual realization of how bleak and monstrous the wider universe is.</div>
<div><br /></div>
<div><br /></div>
<div><b><span >Section Two: Technical</span></b><br /><br />
<div><b>Bob:</b> What language is the application developed in? <img width="300" src="http://redwaspdesign.files.wordpress.com/2011/12/callofcthulhu_wastedland_hd_01.png?w=640&amp;h=426"  /><br /><br /><b>Tomas:</b> The whole project has been written in C++ using our own tech and tools to allow us to move easily between platforms, except we used FMOD as a sound manager.<br /><br /><b>Bob:</b> Can you tell us more about your development process for a game like this, from to game concepts to final development?<br /><br /><b>Tomas:</b> Initially we agree the broad concept; a turn-based strategy RPG.  We chose this from a pile of concepts we'd put together.  Once that was settled the real work began; were we going to do 2D or 3D?  We picked 3D as 2D started to impose too many restrictions.  Then you need to decide how it is going to look; is the game area realistic in scale (as we chose), or representational (like Advanced Wars).  Then I worked on the design and the levels and characters.  Stu (our artist) worked on some concept art then started to build all the 3D models.  Mike started coding the 3D engine and it just grew from there.  We're a small team and we've worked together lots before, which is a huge advantage for development.<br /><br /><b>Bob: </b>How long did it take to create the iPad version, before porting to the PC?<br /><br /><b>Tomas: </b>We didn't really create one version in isolation.  It took about a year before the first version was out.  However we were developing the engine and the art so that porting the game to other platforms was not going to mean a huge re-write each time.<br /><br /><b>Bob:</b> How long did the port to PC take?<br /><br /><b>Tomas:</b> For the PC port we re-authored the interface to be wide screen for laptops and modern monitors, this took a few days based on the iPad interface files.  At the programming end we had to add some extra Windows code to deal with different resolutions, Windows 7 touch screen support and the location of saved game data.  All in it took around two months including testing.<br /><br /><b>Bob: </b>Did the development framework lend itself to porting the applications, and was the development framework (it) influenced by a desire to port the app?<br /><br /><b>Tomas:</b> The game is pretty self-contained as a single player experience, so we knew what we were doing as we moved to AppUp.<br /><br /><b>Bob: </b>What were key considerations in deciding to port the app to PC?<br /><br /><b><img  src="http://redwaspdesign.files.wordpress.com/2012/04/1360_gen03.png?w=640&amp;h=361" width="300" />Tomas:</b> Widescreen!  PCs have much more screen space than iOS devices so we wanted to make sure we took advantage of that.</div>
<div><b><br /></b></div>
<div><b>Bob:</b> The graphics are crisp, with an HD quality, especially in titles scenes. Did you make or consider graphic changes for the app to show well on a PC / Ultrabook?<br /><br /><b>Tomas: </b>Yes, we decided early on that we should rework the graphics so that they would fit and look their best on most PC's/laptops but without compromising the frame rate.<br /><br /><br /><b>Bob: </b> The game play is very natural, it’s very easy to control the turn based system.  Were there any considerations on mouse / keyboard as input devices for the PC/Ultrabook version?<br /><br /><b>Tomas: </b>Porting from the iPad version i.e. the touch screen, it seemed natural that we keep the interface and input controls very simple.  We didn't want to use the keyboard just because it was there so we went for keeping all the controls on the mouse will help keep the user in control and not cause confusion over keyboard controls as well especially if they are new to the turn-based genre.  <br /><br /><b>Bob:</b> Have you considered touch for the PC as well?<br /><br /><b>Tomas:</b> We've made the game work with Windows 7 Touchscreen machines, so you can use the touch screen to scroll and select units for information.  The mouse is a good input device for strategy games and so it was easy to make this work well on PC.  We also had to make sure that it worked equally well on touch pads as well as a mouse; which is where we've done lots of testing to ensure this.<br /><br /><b>Bob:</b> The game is very responsive on my Ultrabook, nice frame rate etc.  Did you do work or consider doing things to take advantage of the processing capabilities of Ultrabooks?<br /><br /><b>Tomas:</b> The graphics system/AI and game logic were already well optimized for the other targeted platforms so we knew that the Intel chips would have no problems at all in running the game very quickly.  We're looking at optimizing as a future update, especially on the PC platform allowing us to add extra visual effects.<br /><br /><b>Bob:</b> Your <a href="http://redwaspdesign.wordpress.com/2012/04/19/call-of-cthulhu-the-wasted-land-summoned-to-pc-via-the-intel-appup-center/" target="_blank">press statement</a> says this game will be available in multiple languages.  Is that new work and was it significant?<br /><br /><b>Tomas:</b> Yes - German, French, Italian and Spanish language versions.  We have an international fan-base on our social media channels and the Call of Cthulhu RPG is translated into a number of languages, so we were always keen to do this.  However it does require resources that we don't have in the office and so Intel's support here was invaluable though we organized the translations ourselves and pulled in a few favors.<br /><br /><b>Bob: </b>Where there any unexpected challenges when porting the app to PC?<br /><br /><b>Tomas: </b>All of us at Red Wasp cut our teeth on PC development back in the late 90's so we knew what to expect most of the time.  The challenge is always making sure it runs on every system possible at a good frame rate.<br /><br /><b>Bob: </b>Is the game threaded, to make use of multi-cores?<br /><br /><b>Tomas:</b> The game currently isn't multi-threaded but as our technology and game requirements advance we will definitely be adding support for multi-cores and we're excited about it.<br /><br /><b>Bob: </b>Threading for Building Blocks should certainly help you guys out there, and speaking of Intel tools, did you make use of any Intel tools, to compile and or optimize the performance of your game. i.e. Graphics Performance Analyzer?<br /><br /><b>Tomas:</b> We used the AppUp SDK and debugger.  The AppUp SDK was very easy to use!<br /> <br /><b>Bob: </b>Did you consider any other AppUp API for monetization?<br /><br /><b>Tomas:</b> We're not currently doing extra content, but will use these when we do.<br /><br /><b>Bob: </b>Do you see different monetization strategies for your games, per platform or is there one approach you take for a given title?<br /><br /><b>Tomas: </b>We aim to make a great game that players will love and hope that does the rest!  We know this is a gamer's game; it's not an Angry Birds, so we've been working hard to make sure our core audience of fellow gamers, geeks and Mythos fans know of us and what we're doing.<br /><br /><b>Bob: </b>Tells us a bit about your approach to marketing and outreach with a game like this, seems I’m already hearing things on Twitter<br /><br /><b>Tomas:</b> We started outreaching to people when development first started.  I was already blogging about Cthulhu related stuff as a fan http://agreatbecoming.com/category/cthulhu-thursday/  so expanding this to talk about the game came naturally.  We started building relationships with the relevant press right from the beginning too which continue to grow and their support has been invaluable.</div>
<div><br /></div>
<div><b><span ><br /></span></b></div>
<div><b><span >Section Three: Closing Thoughts</span></b>
<div><br /><b>Bob:</b> What was it like working with Intel and the AppUp European team?<img width="300" src="http://redwaspdesign.files.wordpress.com/2011/12/callofcthulhu_wastedland_hd_07.png"  /><br /><br /><b>Tomas:</b> They've been great.  We've had lots of tech and marketing support and we've had no problems at all.  We're keen to stay working with Intel and AppUp.<br /><br /><b>Bob:</b> With this project being delivered to AppUp, would you do anything different next time?<br /><br /><b>Tomas:</b> Every time you make a game you always look back at it and see plenty you'd change the next time around.  One of the great things about digital delivery is we can (and have) iterated the game.<br /><br /><b>Bob:</b> Any recommendations to developers on working with Intel and developing apps for Ultrabooks and AppUp?<br /><br /><b>Tomas:</b> Do it (develop, I mean).  The PC is a great gaming platform and Ultrabooks and AppUp are taking advantage of that; you should too!<br /> <br />People can follow the game's progress and find out much more about the game at our site: <a target="_blank" href="http://www.redwaspdesign.com">www.redwaspdesign.com</a> or on twitter <a target="_blank" href="http://twitter.com/redwaspdesign">@redwaspdesign</a> or on Facebook <a target="_blank" href="http://www.facebook.com/redwaspdesign">www.facebook.com/redwaspdesign</a> <br /></div>
<div><br /></div>
<div>If you have additional thoughts and experiences please reach out to me on Twitter <a href="http://twitter.com/bobduffy">@bobduffy</a> and or share them by responding in our comments section.<br /></div>
<div><br /></div>
<div>***</div>
<div><br /></div>
<div>For more information on developing Ultrabook apps for distribution and monetization via Intel AppUp, visit our <a href="http://software.intel.com/en-us/ultrabook">Ultrabook Community</a></div>
</div>
</div>
</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/game-craft-journey-for-cthulhu-pc-debut-on-ultrabooks/</link>
      <pubDate>Mon, 30 Apr 2012 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/game-craft-journey-for-cthulhu-pc-debut-on-ultrabooks/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/game-craft-journey-for-cthulhu-pc-debut-on-ultrabooks/</guid>
      <category>Visual Computing</category>
      <category>Ultrabook</category>
      <category>Software Business Network</category>
    </item>
    <item>
      <title>OpenCL* Device Fission for CPU Performance</title>
      <description><![CDATA[ <h2 class="sectionHeading">Download Article</h2>
Download <a href="http://software.intel.com/file/43564" target="_blank">OpenCL* Device Fission for CPU Performance</a> [PDF 762KB]<br /> <br />
<h2 class="sectionHeading">Summary</h2>
Device fission is an addition to the OpenCL* specification that gives more power and control to OpenCL programmers over managing which computational units execute OpenCL commands. Fundamentally, device fission allows the sub-dividing of a device into one or more sub-devices, which, when used carefully, can provide a performance advantage, especially when executing on CPUs.<br /> <br /> The newly released Intel® SDK for OpenCL* Applications 2012 is a comprehensive software development environment for OpenCL applications on 3rd generation Intel® Core™ processor family-based platforms. This SDK also provides developers with the ability to develop and target OpenCL applications on Intel CPUs of previous generations using both the Windows* and Linux* operating systems.<br /> <br /> The Intel SDK for OpenCL Applications 2012 provides a rich mix of OpenCL extensions and optional features that are designed for developers who want to utilize all resources available on Intel CPUs. This article focuses on device fission, available as an OpenCL 1.1 extension with this version of the SDK. Download your FREE copy of the Intel SDK for OpenCL Applications 2012 at: <a href="http://www.intel.com/software/opencl">www.intel.com/software/opencl</a>.<br /><br /> <br />
<h2 class="sectionHeading">What is Device Fission?</h2>
The OpenCL specification is composed of a hierarchy of several models including the Platform, Execution, Memory, and Programming Models. The highest level model, the Platform Model, consists of a host processor connected to one or more OpenCL devices. OpenCL devices execute commands submitted to them by the host processor. A device can be a CPU, GPU, or other accelerator device. A device further comprises one or more computational (or compute) units. For example, for a multicore CPU, a computational unit is a thread executing on a core. For a GPU, a computational unit is a thread executing on a stream processor or streaming multiprocessor (SM). As the number of computational units and threads have grown over time, it is beneficial at some point to exert more control over these resources, rather than treating them as a single homogenous computing resource.<br /> <br /> To help control which computational units execute OpenCL commands, an important addition, named Device Fission, was made to the OpenCL specification to give more power to the OpenCL programmer.  Device fission is defined in the OpenCL 1.2 specification (and is available as an OpenCL 1.1 extension).<br /> <br /> Device fission is a useful feature that allows the sub-dividing of a device into two or more sub-devices. Google dictionary defines fission as “the action of dividing or splitting something into two or more parts.” After identifying and selecting a device from an OpenCL platform, you can further split the device into one or more sub-devices.<br /> <br /> There are several methods available for determining how sub-devices are created. Each sub-device can have its own context and work queue and its own program if needed. This enables more advanced task parallelism across the work queues.<br /> <br /> A sub-device acts just like a device would act in the OpenCL API. An API call with a device as a parameter can have a sub-device as a parameter. In other words, there are no special APIs for sub-devices, other than for creating one. Just like a device, a context or a command queue can be created for a sub-device. Using a sub-device allows you to refer to specific computational units within the original device.<br /> <br /> Sub-devices can also be further sub-divided into more sub-devices. Each sub-device has a parent device from which it was derived. Creating sub-devices does not destroy the original parent device. The parent device and all descendent sub-devices can be used together if needed.<br /> <br /> Device fission can be considered an advanced feature that can improve the performance of OpenCL code and/or manage compute resources efficiently. Using device fission does require some knowledge of the underlying target hardware. Device fission should be used carefully and may impact code portability and performance if not used properly.<br /><br /> <br />
<h2 class="sectionHeading">Why Use Device Fission?</h2>
In general, device fission allows the programmer to have greater control over the hardware platform by selecting which computational units are used by the OpenCL runtime to execute commands. The reason this control is useful is that, if used properly, it can provide better OpenCL performance or make the overall platform more efficient.<br /> <br /> Here are some example cases where device fission is useful.<br /> <br /> 
<ul>
<li>Device fission allows the use of a portion of a device. This is useful when there is other non-OpenCL work on the device that needs resources. It can guarantee the entire device is not taken by the OpenCL runtime.</li>
<li>Device fission can allow specialized sharing among work-items such as sharing an L3 cache or sharing a NUMA node.</li>
<li>Device fission can allow a set of sub-devices to be created, each with its own command queue. This lets the host processor control these queues and dispatch work to the sub-devices as needed.</li>
<li>Device fission allows specific sub-devices to be used to take advantage of data locality.</li>
</ul>
Later in this paper, strategies for using device fission are discussed in more detail, but first we’ll show how to code for device fission in OpenCL 1.2.<br /><br /> <br />
<h2 class="sectionHeading">How to Use Device Fission in OpenCL* 1.2</h2>
This section provides an overview on how to use device fission and create sub-devices in OpenCL 1.2. Refer to section 4.3 (Partitioning a Device) of the OpenCL 1.2 specification for further details.<br /> <br /> There are several partitioning types and options available when creating sub-devices. The three basic options for determining how to split or partition the device are:       
<ul>
<li>Equally – Partition the device into as many sub-devices as can be created, each containing a given number of computational units.</li>
<li>By Counts – Partition the device based on a given number of computational units in each sub-device. A list of the desired number of compute units per sub-device can be provided.</li>
<li>By Affinity Domain – Partition the device based on the affinity of the compute units to share the same level of cache hierarchy or to share a NUMA node. Sub-devices can be created from compute units that share the same: 		      
<ul>
<li>NUMA node</li>
<li>L4 Cache</li>
<li>L3 Cache</li>
<li>L2 Cache</li>
<li>L1 Cache</li>
<li>Next partitionable affinity domain</li>
</ul>
</li>
</ul>
“Next partitionable affinity domain” partitions the device along the next partitionable affinity domain, starting with NUMA first and then proceeding down to L4, L3, L2, and then finally L1 cache, finding the level in which the device can be further sub divided. For most NUMA platforms where the caches are integrated in the node, the next partitionable affinity domain is NUMA. For a non-NUMA platform, it would typically be the outermost cache level.<br /> <br /> These options are controlled by the programmer through a list of properties provided as parameters in the Create Sub Devices call which is described in the next section.<br /> <br /> The partitioning types supported by the OpenCL implementation can be queried (described later in this article).<br /><br /> <br /> <b>Create a Sub-device</b><br /> The Get Device ID call in OpenCL helps find an available OpenCL device in a platform. Once a device is found using the clGetDeviceIDs call, you can then create one or more sub-devices using the clCreateSubDevices call. This is normally completed after the selection of the device and before creating the OpenCL context.<br /> <br /> The clCreateSubDevices call is:<br />
<pre name="code" class="cpp">cl_int clCreateSubDevices (
	cl_device_id in_device,
	const cl_device_partition_property *properties,
	cl_uint num_devices,
	cl_device_id *out_devices,
	cl_uint *num_devices_ret)
</pre>
<ul>
<li><code>in_device:</code> The id of the device to be partitioned.</li>
<li><code>properties:</code> List of properties to specify how the device is to be partitioned. This is discussed below in more detail.</li>
<li><code>num_devices:</code> Number of sub-devices (used to size the memory for out_devices).</li>
<li><code>out_devices:</code> Buffer for the sub-devices created.</li>
<li><code>num_devices_ret:</code> Returns the number of sub-devices that device may be partitioned into according to the partitioning scheme specified in properties. If num_devices_ret is NULL, it is ignored.</li>
</ul>
<b>Partition Properties</b><br /> Understanding the partition properties is key for partitioning the device into sub-devices. After deciding the type of partitioning (Equally, By Counts, or By Affinity Domain), develop the list of properties to pass as a parameter in the clCreateSubDevices call. The property list begins with the type of partitioning to be used, followed by additional properties that further define the type of partitioning and other information, and then finally the list ends with a 0 value. Property list examples are shown in the next section that helps illustrate the concept.<br /> <br /> The partition property that starts the property list is the type of partitioning:<br /> 
<ul>
<li>CL_DEVICE_PARTITION_EQUALLY</li>
<li>CL_DEVICE_PARTITION_BY_COUNTS</li>
<li>CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN</li>
</ul>
The next value in the list depends on the partition type:<br /> <br /> 
<ul>
<li>CL_DEVICE_PARTITION_EQUALLY is followed by N, the number of compute units for each sub-device. The device is partitioned into as many sub-devices as can be created that have N compute units in each sub-device.</li>
<li>CL_DEVICE_PARTITION_BY_COUNTS is followed by a list of compute unit counts. For each number in the list, a sub-device is created with that many compute units. The list of compute unit counts is terminated by CL_DEVICE_PARTITION_BY_COUNTS_LIST_END.</li>
<li>CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN is followed by the type of partitioning for the affinity, either NUMA, L4_CACHE, L3_CACHE, L2_CACHE, L1_CACHE, or Next Partitionable: 	      
<ul>
<li>CL_DEVICE_AFFINITY_DOMAIN_NUMA</li>
<li>CL_DEVICE_AFFINITY_DOMAIN_L4_CACHE</li>
<li>CL_DEVICE_AFFINITY_DOMAIN_L3_CACHE</li>
<li>CL_DEVICE_AFFINITY_DOMAIN_L2_CACHE</li>
<li>CL_DEVICE_AFFINITY_DOMAIN_L1_CACHE</li>
<li>CL_DEVICE_AFFINITY_DOMAIN_NEXT_PARTITIONABLE</li>
</ul>
</li>
</ul>
The last value in the property list is always 0.<br /><br /><br /> <b>Property List Examples</b><br /> This section contains examples of property lists.<br /> <br /> To illustrate this example, we have an example target machine as our device. The target machine is a NUMA platform with 2 processors, each with 4 cores. There are a total of 8 physical cores in the machine. Intel Hyper-Threading Technology is enabled. There are a total of 16 logical threads in the machine. Each processor has a shared L3 cache that all 4 cores share. Each core has private L1 and L2 caches. With Hyper-Threading Technology enabled, each core has two threads, so each L1 and L2 cache is shared between two threads. There is no L4 cache. See Figure 1.<br /> <br /> <img src="http://software.intel.com/file/43561" /><br /> <b>Figure 1.</b> <i>Configuration of the Target Machine for Property List Examples</i><br /> <br /> The following table shows examples of property lists, assuming that the OpenCL implementation supports that particular partition type.<br /> <br /> Notice the property lists always begin with the type of partitioning and end with a 0.<br /> <br /> <b>Table 1.</b> <i>Property List Examples</i><br /> <br /> 
<table class="tableFormat1">
<tbody>
<tr>
<th>Property List</th> <th>Description</th> <th>Result on the Example Target Machine</th>
</tr>
<tr>
<td>{ CL_DEVICE_PARTITION_EQUALLY, 8,0 }</td>
<td>Partition the device into as many sub-devices as possible, each with 8 compute units.</td>
<td>2 sub-devices, each with 8 threads.</td>
</tr>
<tr>
<td>{ CL_DEVICE_PARTITION_EQUALLY, 4, 0 }</td>
<td>Partition the device into as many sub-devices as possible, each with 4 compute units.</td>
<td>4 sub-devices, each with 4 threads.</td>
</tr>
<tr>
<td>{ CL_DEVICE_PARTITION_EQUALLY, 32, 0 }</td>
<td>Partition the device into as many sub-devices as possible, each with 32 compute units.</td>
<td>Error! 32 exceeds the CL_DEVICE_PARTITION_ MAX_COMPUTE_UNITS.</td>
</tr>
<tr>
<td>{ CL_DEVICE_PARTITION_BY_COUNTS, 3, 1, CL_DEVICE_PARTITION_BY_COUNTS_LIST_END, 0 }</td>
<td>Partition the device into two sub-devices, one with 3 compute units and one with 1 compute unit.</td>
<td>1 sub-device with 3 threads and 1 sub-device with 1 thread.</td>
</tr>
<tr>
<td>{ CL_DEVICE_PARTITION_BY_COUNTS, 2, 2, 2, 2 CL_DEVICE_PARTITION_BY_COUNTS_LIST_END, 0 }</td>
<td>Partition the device into four sub-devices, each with 2 compute units.</td>
<td>4 sub-devices, each with 2 threads.</td>
</tr>
<tr>
<td>{ CL_DEVICE_PARTITION_BY_COUNTS, 3, 1, CL_DEVICE_PARTITION_BY_COUNTS_LIST_END, 0 }</td>
<td>Partition the device into two sub-devices, one with 3 compute units and one with 1 compute unit.</td>
<td>1 sub-device with 3 threads and 1 sub-device with 1 threads.</td>
</tr>
<tr>
<td>{ CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN,  CL_DEVICE_AFFINITY_DOMAIN_NUMA, 0 }</td>
<td>Partition the device into sub-devices that share a NUMA node.</td>
<td>2 sub-devices with 8 threads each. Each sub-device is located on its own NUMA node.</td>
</tr>
<tr>
<td>{ CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN,  CL_DEVICE_AFFINITY_DOMAIN_L1_CACHE, 0 }</td>
<td>Partition the device into sub-devices that share an L1 cache.</td>
<td>8 sub-devices with 2 threads each. The L1 cache is not shared in our example machine.</td>
</tr>
<tr>
<td>{ CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN,  CL_DEVICE_AFFINITY_DOMAIN_L2_CACHE, 0 }</td>
<td>Partition the device into sub-devices that share an L2 cache.</td>
<td>8 sub-devices with 2 thread each. The L2 cache is not shared in our example machine.</td>
</tr>
<tr>
<td>{ CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN,  CL_DEVICE_AFFINITY_DOMAIN_L3_CACHE, 0 }</td>
<td>Partition the device into sub-devices that share an L3 cache.</td>
<td>2 sub-devices with 8 threads each. The L3 cache is shared among all 8 threads within each processor.</td>
</tr>
<tr>
<td>{ CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN,  CL_DEVICE_AFFINITY_DOMAIN_L4_CACHE, 0 }</td>
<td>Partition the device into sub-devices that share an L4 cache.</td>
<td>Error! There is no L4 cache.</td>
</tr>
<tr>
<td>{ CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN,  CL_DEVICE_AFFINITY_DOMAIN_NEXT_PARTITIONABLE, 0 }</td>
<td>Partition the device based on the next partitionable domain. In this case, it is NUMA.</td>
<td>2 sub-devices with 8 threads each. Each sub-device is located on its own NUMA node.</td>
</tr>
</tbody>
</table>
<br /> <b><br />Hyper-Threading Technology and Compute Units</b><br /> If Hyper-Threading Technology is enabled, a computational unit is equivalent to a thread. Two threads share one core. If Hyper-Threading Technology is disabled, a computational unit is equivalent to a core. One thread executes on the core. Code should be written to handle either case.<br /><br /> <br />
<h2 class="sectionHeading">Contexts for Sub-devices</h2>
Once the sub-devices are created, we can create contexts for them using the clCreateContext call. Note that if we use clCreateContextFromType to create a context from a given type of device, the context created does not reference any sub-devices that have been created from devices of that type.<br /><br /> <br />
<h2 class="sectionHeading">Programs for Sub-devices</h2>
Just like creating a program for a device, a different program can be created for each sub-device. This is an efficient method to do task parallelism. Different programs can be created for different sub-devices.<br /> <br /> An alternative is to share a program among devices and sub-devices. Program binaries can be shared among devices and sub-devices. A program binary built for one device can be used with all of the sub-devices created from that device. If there is no program binary for a sub-device, the parent program will be used.<br /><br /> <br />
<h2 class="sectionHeading">Partitioning a Sub-device</h2>
Once a sub-device is created, it can be further partitioned by creating sub-devices from a sub-device. The relationship of devices forms a tree, with the original device as the root device at the top of the tree.<br /> <br /> Each sub-device will have a parent device. The root device will not have a parent.<br /> <br /> <img src="http://software.intel.com/file/43562" /><br /> <b>Figure 2.</b> <i>Device Partitioning Example</i><br /> <br /> In Figure 2, we show an example of a device being partitioned first using Partition By Affinity Domain – NUMA, and then one of the sub-devices being partitioned using Partition Equally.<br /><br /> <br />
<h2 class="sectionHeading">Query a Sub-device</h2>
The clGetDeviceInfo call has several additions to access sub-device related information.<br /> <br /> Prior to creating sub-devices, we can query a device using clGetDeviceInfo to see:<br /> <br /> 
<ul>
<li>CL_DEVICE_PARTITION_MAX_SUB_DEVICES: Maximum number of sub-devices that can be created for this device.</li>
<li>CL_DEVICE_PARTITION_PROPERTIES: Partition Types that are supported by this device.</li>
<li>CL_DEVICE_PARTITION_AFFINITY_DOMAIN: List of supported affinity domains for partitioning the device using Partitioning By Affinity Domain.</li>
</ul>
Of course, we recommend checking that the Partition Type you want to use is supported. Some OpenCL implementations may not support all types.<br /> <br /> After creating sub-devices, we can query sub-devices the same way devices are queried. Through querying, we can discover things like:<br /> <br /> 
<ul>
<li>CL_DEVICE_PARENT_DEVICE: Parent device for the given sub-device.</li>
<li>CL_DEVICE_PARTITION_TYPE: Current partition type in use for this sub-device.</li>
</ul>
A query to a root device and all descending sub-devices should return the same values for almost all queries. For example, when queried, the root device and all descendant sub-devices should return the same CL_DEVICE_TYPE or CL_DEVICE_NAME. The exceptions are the following queries:<br /> <br /> 
<ul>
<li>CL_DEVICE_GLOBAL_MEM_CACHE_SIZE</li>
<li>CL_DEVICE_BUILT_IN_KERNELS</li>
<li>CL_DEVICE_PARENT_DEVICE</li>
<li>CL_DEVICE_PARTITION_TYPE</li>
<li>CL_DEVICE_REFERENCE_COUNT</li>
<li>CL_DEVICE_MAX_COMPUTE_UNITS</li>
<li>CL_DEVICE_MAX_SUB_DEVICES</li>
</ul>
<p> </p>
<h2 class="sectionHeading">Release and Retain Sub-device</h2>
There are two calls that allow the programmer to maintain the reference count of a sub-device. We can increment the reference count (retain) or decrement the reference count (release) just like other OpenCL objects. clRetainDevice increments the reference count for the given sub-device. clReleaseDevice decrements the reference count for the given sub-device.<br /><br /><br /> <br />
<h2 class="sectionHeading">Other Considerations</h2>
Here are some items to check for in your code when using device fission.<br /> <br /> Check to see that device fission is supported for your device. Check the maximum number of sub-devices that can be created.<br /> <br /> Check to see that the device fission partition type is supported. This can be checked using the GetDeviceInfo call.<br /> <br /> After creating the sub-devices, check to see that devices are indeed created correctly. For example, if you are using Partition By Affinity Domain – L3 Cache, check to see if the expected number of sub-devices are created.<br /> <br /> It is also important to make your code robust and able to handle future platform changes. Consider how your code will handle target hardware architecture changes in the future. Consider how the code will execute on a target machine with:<br /> <br /> 
<ul>
<li>New or different cache hierarchy</li>
<li>NUMA or Non-NUMA platforms</li>
<li>More or fewer compute units</li>
<li>Heterogeneous compute nodes</li>
<li>Hyper-Threading Technology enabled or disabled</li>
</ul>
<p> </p>
<h2 class="sectionHeading">Device Fission Code Examples</h2>
In this section, we show some simple code examples to demonstrate device fission.<br /> <br /> <b><br />Code Example #1 - Partition Equally</b><br /> In this code example, we use Partition Equally to divide the device into as many sub-devices as possible, each with four computational units. (Error checking on OpenCL calls is omitted).<br />
<pre name="code" class="cpp">// Get Device ID from selected platform:

clGetDeviceIDs( platforms[platform], CL_DEVICE_TYPE_CPU, 1, &amp;device_id, &amp;numDevices);

// Create sub-device properties: Equally with 4 compute units each:

cl_device_partition_property props[3];
props[0] = CL_DEVICE_PARTITION_EQUALLY;  // Equally
props[1] = 4;                            // 4 compute units per sub-device
props[2] = 0;                            // End of the property list

cl_device_id subdevice_id[8];
cl_uint num_entries = 8;

// Create the sub-devices:

clCreateSubDevices(device_id, props, num_entries, subdevice_id, &amp;numDevices);

// Create the context:

context = clCreateContext(cprops, 1, subdevice_id, NULL, NULL, &amp;err);
</pre>
<b><br />Code Example #2  - Partition By Counts</b><br /> In this code example, we partition the device by counts with one sub-device with 2 compute units and one sub-device with 4 compute units.  (Error checking on OpenCL calls is omitted).<br />
<pre name="code" class="cpp">// Get Device ID from selected platform:

clGetDeviceIDs( platforms[platform], CL_DEVICE_TYPE_CPU, 1, &amp;device_id, &amp;numDevices);

// Create two sub-device properties: Partition By Counts

cl_device_partition_property_ props[5];
props[0] = CL_DEVICE_PARTITION_BY_COUNTS; // Equally
props[1] = 2;                             // 2 compute units 
props[2] = 4;                             // 4 compute units 
props[3] = CL_DEVICE_PARTITION_BY_COUNTS_LIST_END; // End Count list
props[4] = 0;                             // End of the property list

cl_device_id subdevice_id[2];
cl_uint num_entries = 2;

// Create the sub-devices:

clCreateSubDevices(device_id, props, num_entries, subdevice_id, &amp;numDevices);

// Create the context:

context = clCreateContext(cprops, 1, subdevice_id, NULL, NULL, &amp;err);
</pre>
<b><br />Code Example #3  - Partition By Affinity Domain (NUMA)</b><br /> In this code example, we partition the device using Partition By Affinity Domain – NUMA. (Error checking on OpenCL calls is omitted).<br />
<pre name="code" class="cpp">// Get Device ID from selected platform:

clGetDeviceIDs( platforms[platform], CL_DEVICE_TYPE_CPU, 1, &amp;device_id, &amp;numDevices);

// Create sub-device properties: Partition By Affinity Domain - NUMA

cl_device_partition_property props[3];
props[0] = CL_DEVICE_PARTITION_BT_AFFINITY_DOMAIN; // By Affinity
props[1] = CL_DEVICE_AFFINITY_DOMAIN_NUMA;         // NUMA
props[2] = 0;                                      // End of the property list

cl_device_id subdevice_id[8];
cl_uint num_entries = 8;

// Create the sub-devices:

clCreateSubDevices(device_id, pprops, num_entries, subdevice_id, &amp;numDevices);

// Create the context:

context = clCreateContext(cprops, 1, subdevice_id, NULL, NULL, &amp;err);
</pre>
<h2 class="sectionHeading">Strategies for Using Device Fission</h2>
In this section, we discuss some different strategies for using device fission to improve the performance of OpenCL programs or to manage the compute resources efficiently. The strategies are not mutually exclusive as one or more strategies may be used together.<br /> <br /> One pre-requisite to leveraging the strategies is to truly understand the characteristics of your workload and how it performs on the intended platform. The more you know about the workload, the better you will be able to take advantage of the platform.<br /><br /> <br /> <b>Strategy #1: Create a High Priority Task</b><br /> Device fission can be used to create a sub-device for a high priority task to execute on dedicated cores. To ensure that a high priority task has adequate resources to execute when it needs to, reserving one or more cores for that task makes sense. The idea is to keep other less critical tasks from interfering with the high priority task. The high priority task can take advantage of all of the cores’ resources.<br /> <br /> <i>Strategy:</i> Use Partition By Counts to create a sub-device with one or more cores and another sub-device with the remaining cores. The selected cores can be exclusively dedicated to the high-priority task running on that sub-device. Other lower priority tasks can be dispatched to the other sub-device. <br /><br /> <br /> <b>Strategy #2: Leverage Shared Cache or Common NUMA Node</b><br /> If the workload exhibits a high level of data sharing between work items in the program, then creating a sub-device where all of the compute units share a cache or are located within the same NUMA node can improve performance. Without device fission, there is no guarantee that the work items will share a cache or share the same NUMA node.  <br /><br /><i>Strategy:</i> Create sub-devices that share a common L3 cache or are co-located on the same NUMA node. Use Partition By Affinity to create a sub-device for sharing an L3 cache or NUMA node.<br /><br /> <br /> <b>Strategy #3: Exploit Data Re-Use and Affinity</b><br /> Without device fission, submitting work to a work queue may dispatch it to a previously unused or “cold” core. A “cold” core is one whose instruction and data caches and TLBs (cache for address translations) may not have any relevant data and instructions for the OpenCL program. It will take time for data and instructions to be brought into the core and placed into caches and TLBs. Normally this is not an issue, but this can be a problem if the code does not run for a significant period of time. By the time the program warms up the processor caches, the program may have reached its end. Typically, this is not critical for medium and long running programs. The time penalty for warming up the processor can be amortized across longer execution times and it is normally not an issue. For very short running programs, however, it can be an issue. In this case, we need to take advantage of warmed processors by ensuring that subsequent executions of a program are routed to the same processors as previously used. This can also arise when larger applications are created from many smaller programs. The program executing before the current one accesses the data and brings it into the processor. The subsequent program can take advantage of that work. <br /> <br /> <i>Strategy:</i> Use Partition By Counts or Partition By Affinity to create a sub-device to specify specific cores for the work queue. Try to re-use the core’s warm caches and TLBs, especially for short running programs.<br /><br /> <br /> <b>Strategy #4: Enable Task Parallelism</b><br /> For certain types of programs, device fission can provide an improved environment for enabling task parallelism. Support for task parallelism is inherent in OpenCL with the ability to create multiple work queues for a device. The ability to create sub-devices can take that model to an even higher level. Creating sub-devices each with their own work queue allows more sophisticated task parallelism and runtime control. Examples are applications that act like “flow graphs ” where dependencies among the various tasks that make up the application help determine program execution. The tasks within the program can be modeled like nodes in a graph. The node edges or connections to other nodes model the task dependencies. For complex dependencies, multiple work queues with multiple sub-devices allow tasks to be dispatched independently and can ensure that forward progress is made.<br /> <br /> You can also create different sub-devices with different characteristics. The sub-device can be created while keeping in mind the types of tasks it will execute. There also may be cases where the host wants to or needs to balance the work across these work queues rather than leaving it to the OpenCL runtime.<br /> <br /> <i>Strategy:</i> Enable task parallelism by creating a set of sub-devices using Partition By Affinity or Partition Equally. Create work queues for each sub-device. Dispatch work items to work queues. The host can then manage the work across multiple work queues.<br /><br /> <br /> <b>Strategy #5: High Throughput</b><br /> There may be cases where absolute throughput is important, but data sharing is not. Suppose we have high throughput jobs to execute on a multiprocessor NUMA platform but there is limited or no data sharing between the jobs. Each job needs maximum throughput, e.g., it can use all of the available resources like on-chip caches. In this case, we might get the best performance if the jobs were executed on different NUMA nodes. We want to ensure that the jobs are not executed on a single NUMA node and have to compete for resources.<br /> <br /> <i>Strategy:</i> Use Partition By Affinity to create N sub-devices – one sub-device for each NUMA node. The sub-devices can then use all NUMA node’s resources including all of the available cache.<br /><br /> <br />
<h2 class="sectionHeading">Conclusion</h2>
To summarize, device fission is an addition to the OpenCL specification that gives more power and control to the OpenCL programmer to manage which computational units execute OpenCL commands. By sub-dividing a device into one or more sub-devices, we can control where the OpenCL programs execute and if used carefully can provide better performance and use the available compute resources more efficiently.<br /> <br /> The Device Fission extension for OpenCL 1.1 is available on the OpenCL CPU device supported by the Intel SDK for OpenCL Applications 2012. The SDK is available at <a href="http://www.intel.com/software/opencl">www.intel.com/software/opencl</a>.<br /><br /> <br />
<h2 class="sectionHeading">APPENDIX: Device Fission in OpenCL 1.1</h2>
In OpenCL 1.1, device fission is available as an extension: OpenCL Extension #11 (cl_ext_device_fission), dated June 9, 2010. This section highlights most of the programming differences between the 1.1 extension and the OpenCL 1.2 specification.<br /> <br /> It is recommended to generally follow the OpenCL 1.2 API for device fission as the 1.1 Extension may be deprecated in the future.<br /><br /> <br /> <b>Include File</b><br /> The include file for 1.1 Extensions, cl_ext.h, should be added to the code:<br />
<pre name="code" class="cpp">#include &lt; CL/cl_ext.h &gt;
</pre>
<b><br />Partition By Names</b><br /> The 1.1 Extension supports an additional partition type not supported in OpenCL 1.2: Partition By Names. This allows the programmer to specify a list of compute unit names to partition the device. See the Extension document for more information.<br /><br /> <br /> <b>Properties</b><br /> The following table shows the equivalent properties for OpenCL 1.2 and the 1.1 Extension.<br /> <br /> 
<table class="tableFormat1" width="100%">
<tbody>
<tr>
<th width="50%">1.2 Property</th> <th width="50%">1.1 Extension Property</th>
</tr>
<tr>
<td>CL_DEVICE_PARTITION_EQUALLY</td>
<td>CL_DEVICE_PARTITION_EQUALLY_EXT</td>
</tr>
<tr>
<td>CL_DEVICE_PARTITION_BY_COUNTS</td>
<td>CL_DEVICE_PARTITION_BY_COUNTS_EXT</td>
</tr>
<tr>
<td>CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN</td>
<td>CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN_EXT</td>
</tr>
<tr>
<td>CL_DEVICE_AFFINITY_DOMAIN_NUMA</td>
<td>CL_AFFINITY_DOMAIN_NUMA_EXT</td>
</tr>
<tr>
<td>CL_DEVICE_AFFINITY_DOMAIN_L4_CACHE</td>
<td>CL_DEVICE_AFFINITY_DOMAIN_L4_CACHE</td>
</tr>
<tr>
<td>CL_DEVICE_AFFINITY_DOMAIN_L3_CACHE</td>
<td>CL_DEVICE_AFFINITY_DOMAIN_L3_CACHE</td>
</tr>
<tr>
<td>CL_DEVICE_AFFINITY_DOMAIN_L2_CACHE</td>
<td>CL_AFFINITY_DOMAIN_L2_CACHE_EXT</td>
</tr>
<tr>
<td>CL_DEVICE_AFFINITY_DOMAIN_L1_CACHE</td>
<td>CL_AFFINITY_DOMAIN_L1_CACHE_EXT</td>
</tr>
<tr>
<td>CL_DEVICE_AFFINITY_DOMAIN_NEXT_PARTITIONABLE</td>
<td>CL_AFFINITY_DOMAIN_NEXT_FISSIONABLE_EXT</td>
</tr>
</tbody>
</table>
<br /> <br />The end of the partition property list is the list terminator: CL_PROPERTIES_LIST_END_EXT.<br /> <br /> Please note most of the tokens above have different enumeration values between OpenCL 1.2 and the 1.1 Extension feature. The only exceptions are the list terminators.<br /><br /> <br /> <b>clGetDeviceInfo Selectors</b><br /> The table below lists matching cl_device_info values between the 1.1 Extension and OpenCL 1.2.<br /> <br /> 
<table class="tableFormat1" width="100%">
<tbody>
<tr>
<th width="50%">1.2 Selector</th> <th width="50%">1.1 Extension Selector</th>
</tr>
<tr>
<td>CL_DEVICE_PARENT_DEVICE</td>
<td>CL_DEVICE_PARENT_DEVICE_EXT</td>
</tr>
<tr>
<td>CL_DEVICE_PARTITION_PROPERTIES</td>
<td>CL_DEVICE_PARTITION_TYPES_EXT</td>
</tr>
<tr>
<td>CL_DEVICE_PARTITION_AFFINITY_DOMAIN</td>
<td>CL_DEVICE_AFFINITY_DOMAINS_EXT</td>
</tr>
<tr>
<td>CL_DEVICE_REFERENCE_COUNT</td>
<td>CL_DEVICE_REFERENCE_COUNT_EXT</td>
</tr>
<tr>
<td>CL_DEVICE_PARTITION_TYPE</td>
<td>CL_DEVICE_PARTITION_STYLE_EXT</td>
</tr>
</tbody>
</table>
<br /> <b><br />API Changes</b><br /> The Create Sub Devices call in the 1.1 Extension is:<br />
<pre name="code" class="cpp">    cl_int clCreateSubDevicesEXT( 
        cl_device_id in_device,
        const cl_device_partition_property_ext * properties,
        cl_uint num_entries,
        cl_device_id *out_devices,
        cl_uint *num_devices );
</pre>
Note that sizeof(cl_device_partition_property_ext) differs from sizeof(cl_device_partition_property).<br /> <br /> The Retain/Release Device API calls have an EXT suffix. They behave identically to their OpenCL 1.2 counterparts.<br /><br /> <br /> <b>Behavior Changes</b><br /> The 1.1 Extension does not support binary inheritance from the parent device. Binaries must be explicitly built for sub-devices.<br /> <br /> The 1.1 Extension specifies that partitioning a device participating in a context created by clCreateContext causes the context to reference the resultant sub-devices. This behavior is not supported in the Intel 1.1 Extension implementation and was deprecated in the OpenCL 1.2 specification.<br /> <br />
<div id="vc-meta" >
<div id="vc-meta-author"></div>
<div id="vc-meta-pubdate">04-30-2012</div>
<div id="vc-meta-modificationdate">04-30-2012</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product">
<div class="oclsdk">Intel® SDK for OpenCL* Applications</div>
</div>
<div id="vc-meta-category">
<div></div>
</div>
<div id="vc-meta-thumb">http://software.intel.com/file/43562</div>
<div id="vc-meta-thumb-tout"></div>
<div id="vc-meta-thumb-hero"></div>
<div id="vc-meta-tocenable">no</div>
<div id="vc-meta-abstract">Device fission is an addition to the OpenCL* specification that gives more power and control to OpenCL programmers over managing which computational units execute OpenCL commands. Fundamentally, device fission allows the sub-dividing of a device into one or more sub-devices, which, when used carefully, can provide a performance advantage, especially when executing on CPUs.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/opencl-device-fission-for-cpu-performance/</link>
      <pubDate>Fri, 27 Apr 2012 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/opencl-device-fission-for-cpu-performance/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/opencl-device-fission-for-cpu-performance/</guid>
      <category>Visual Computing</category>
      <category>Visual Computing Source</category>
    </item>
    <item>
      <title>Microsoft* DirectCompute on Intel® Ivy Bridge Processor Graphics</title>
      <description><![CDATA[ <b>By Wolfgang Engel</b><br /><br />
<h2 class="sectionHeading">Download Article</h2>
Download <a target="_blank" href="http://software.intel.com/file/43342">Microsoft* DirectCompute on Intel® Ivy Bridge Processor Graphics</a> [PDF 762KB]<br /><br /><br />Microsoft* DirectCompute exposes the compute functionality of graphics hardware as a new shader type: the compute shader. A compute shader is similar to a vertex, geometry, or pixel shader and offers a programming interface that makes the massively parallel computation power of graphics hardware available to tasks outside of the normal raster-based graphics pipeline exposed with Microsoft* Direct3D* or OpenGL*.<br /><br /><b>Note:</b> <i>Although DirectCompute was introduced with Microsoft* DirectX* 11, it is possible to run a compute shader on Microsoft* DirectX* 10-, 10.1-, and 11-class hardware.</i><br /><br />DirectCompute has several advantages over application programming interfaces (APIs) that also offer a compute solution. First, it integrates well with Direct3D*, which means that it has not only efficient interoperability with the Direct3D* resources like textures and buffers, but many of the concepts and the language syntax are already well known to Direct3D* programmers. In addition, the interface is more generalized and simplified when compared to other computing APIs. Like Direct3D*, DirectCompute guarantees consistent results across different hardware.<br /><br /><br />
<h2 class="sectionHeading">DirectCompute Applications</h2>
DirectCompute does not have a fixed mapping between the data it is processing and the threads doing the processing, like the vertex or pixel shader. One thread can process one or many data elements, and the application can control directly how many threads are used to perform the computation.<br /><br />Applications for DirectCompute are algorithms that do not map into the fixed mapping between data and processing or threads, which the vertex or pixel shader requires and which do not require involving the rasterizer. Typical use cases include:<br /><br />
<ul>
<li>Game physics and artificial intelligence;</li>
<li>Applying algorithms in image space that require kernels that are more flexible when it comes to the relationship between data used and the threads applied on this data; and</li>
<li>The many advanced rendering effects that use the Read/Write abilities, such as order-independent transparency, ray tracing, and global illumination effects.</li>
</ul>
<h2 class="sectionHeading">DirectCompute Memory Model</h2>
Graphics hardware is expected to have different memory types, each of which favors certain access patterns. In DirectCompute, you can differentiate between <i>register-based memory, device memory</i>, and <i>group shared memory</i>.<br /><br /><b><i>Register-based Memory</i></b><br />DirectCompute uses the same set of registers as the other programmable stages of the Direct3D* pipeline (<i>DirectCompute registers</i>). Because you can only program shaders in the High-level Shader Language (HLSL), those registers are not accessible to you. Nevertheless, the compiler can generate intermediate assembly. It is difficult to tell how much value the assembly code has, because it is only intermediate and might be changed by the driver of the underlying hardware.<br /><br />The most interesting register-based memory is the temporary registers. Overall, 4096 temporary and indexed temporary registers—which are array like—are available. The regular temporary registers are named <i>r#</i>, while the indexed temporary array registers are named <i>x#[n]</i>. Those registers might be shared among all threads in flight in a processing core or among several processing cores. The driver compiler selects and allocates these registers automatically, without any direct influence from the shader programmer. There is a strong chance that more complex shaders will consume more temporary registers, thereby “starving” the hardware of temporary registers. A good graphics hardware profiling tool will flag the fact that the hardware is running out of temporary registers. You can then attempt to rewrite parts of the shader to reduce the number of temporary registers used.<br /><br /><b><i>Device Memory</i></b><br />Although data in temporary registers only persists during execution of a shader program, you also need data that is kept over a longer period of time and offers more storage. DirectCompute can store data in generic Direct3D* resources such as buffers and textures. Read/Write buffers and textures, structured buffers, and byte address (raw) buffers are available. To write into and read from textures and buffers, so-called memory views are available. DirectX* 11 introduced the Unordered Access view (UAV), which allows scattered writes and gathered reads. To read memory in a shader, DirectX* 10 introduced a Shader Resource view (SRV). A special category of device memory is the constant buffer, which favors access patterns were 16 consecutive memory reads are done. An example for this access pattern is reading a 4×4 matrix from constant memory.<br /><br /><b>Read/Write Buffers and Textures</b><br />Read-only texture memory is supposed to favor memory access patterns that are spatially close to each other. For example, bilinear filtering requires access to texels that are close to each other. DirectX* 11 added a new set of Read and Write textures:<br /><br />
<ul>
<li><code>RWBuffer</code></li>
<li><code>RWTexture1D, RWTexture1DArray</code></li>
<li><code>RWTexture2D, RWTexture2DArray</code></li>
<li><code>RWTexture3D</code></li>
</ul>
Here’s an example of how to define an RWTexture2D texture in a shader:<br />
<pre name="code" class="cpp">// Compute Shader code: RWTexture with Unordered Access View in u0
RWTexture2D&lt;float4&gt; output : register (u0);
</pre>
<b><br />Structured Buffers</b><br />A <i>structured buffer</i> is a buffer that contains elements of a structure. Here’s a simple example:<br />
<pre name="code" class="cpp">// Compute Shader code: structured buffer with Unordered Access View in u0
struct BufferStruct
{
  float4 color;
};
RWStructuredBuffer&lt;BufferStruct&gt; output : register(u0);
</pre>
To fill up a structured buffer in the compute shader, you can use code like this:<br />
<pre name="code" class="cpp">uint stride = WindowWidth;  

// buffer stride, assumes data stride = 
// data width (i.e. no padding)
// DTid is the SV_DispatchThreadID
uint idx = (DTid.x) + (DTid.y) * stride;
output[idx].color = color;  
</pre>
The following code creates a structured buffer on the application level:<br />
<pre name="code" class="cpp">// 
// structured buffer
//
struct BufferStruct
{
  float color[4];
};

D3D11_BUFFER_DESC sbDesc;
sbDesc.BindFlags = D3D11_BIND_UNORDERED_ACCESS | D3D11_BIND_SHADER_RESOURCE;
sbDesc.CPUAccessFlags = 0;
sbDesc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED; sbDesc.StructureByteStride = sizeof(BufferStruct);

int Height = WindowHeight; 
int Width = WindowWidth;
sbDesc.ByteWidth = sbDesc.StructureByteStride * Width * Height; sbDesc.Usage = D3D11_USAGE_DEFAULT;
pd3dDevice-&gt;CreateBuffer(&amp;sbDesc, NULL, &amp;pStructuredBuffer);
</pre>
<b><br />Byte Address Buffers</b><br /><i>Byte address buffers</i>, or <i>raw buffers</i>, are a special type of buffer addressed using a byte offset from the beginning of the buffer. The byte offset must be a multiple of 4 so that it is word aligned.<br /><br />The type of raw buffers is always 32-bit unsigned <code>int</code>. Other data types would need to be cast to unsigned <code>int</code>. Raw buffers are useful for generating geometry with DirectCompute, because they can be bound as vertex and index buffers. In HLSL, they are declared as follows:<br />
<pre name="code" class="cpp">ByteAddressBuffer
RWByteAddressBuffer
</pre>
DirectX* 11 aligns raw buffers to 16 bit.<br /><br /><b><br />Constant Buffers</b><br />Constant buffers provide read-only access to data that is expected to be accessed as 16 consecutive float values. As long as they are accessed in order, the cost is similar to reading only one value.<br /><br />A shader has access to 4096 32-bit, four-component constants: 64 KB. Although DirectX* 10.x and DirectX* 11 define this as the upper limit for the size of a constant buffer, DirectX* 11.1 allows you to store many more constants in the constant buffer and to access a subrange of this buffer in a shader:<br />
<pre name="code" class="cpp">// Create constant buffer
typedef struct
{
	float diffuse[4]; // diffuse shading color
	float mu[4];    // quaternion julia parameter
	float epsilon;  // detail julia
	int c_width;      // view port size
	int c_height;
	int selfShadow;  // selfshadowing on or off 
	float orientation[4*4]; // rotation matrix
	float zoom;
} QJulia4DConstants;

D3D11_BUFFER_DESC Desc;
    	Desc.Usage = D3D11_USAGE_DYNAMIC;
    	Desc.BindFlags = D3D11_BIND_CONSTANT_BUFFER;
    	Desc.CPUAccessFlags = D3D11_CPU_ACCESS_WRITE;
    	Desc.MiscFlags = 0;

    	// must be multiple of 16 bytes
Desc.ByteWidth = ((sizeof( QJulia4DConstants ) + 15)/16)*16;  
pd3dDevice-&gt;CreateBuffer(&amp;Desc, NULL, &amp;pcbFractal);
</pre>
<b><br />Shader Resource View and Unordered Access View</b><br />Similar to the other shader stages in the DirectX* pipeline, an SRV is supported in DirectCompute to allow a shader to read resource memory. In case of a structured buffer, you can create an SRV as follows:<br />
<pre name="code" class="cpp">// 
// shader resource view on structured buffer
//
D3D11_SHADER_RESOURCE_VIEW_DESC sbSRVDesc;
ZeroMemory( &amp;sbSRVDesc, sizeof( sbSRVDesc ) ); sbSRVDesc.Buffer.ElementOffset = 0;
sbSRVDesc.Buffer.ElementWidth = sbDesc.StructureByteStride; sbSRVDesc.Buffer.FirstElement = sbUAVDesc.Buffer.FirstElement; sbSRVDesc.Buffer.NumElements = sbUAVDesc.Buffer.NumElements; sbSRVDesc.Format = DXGI_FORMAT_UNKNOWN;
sbSRVDesc.ViewDimension = D3D11_SRV_DIMENSION_BUFFER;
hr = pd3dDevice-&gt;CreateShaderResourceView((ID3D11Resource *) pStructuredBuffer, &amp;sbSRVDesc, &amp;pComputeShaderSRV);
</pre>
A UAV allows you to randomly scatter writes into byte address or raw buffers and structured buffers, and then randomly gather while reading those buffers. DirectX* 11 can bind eight UAVs at the same time.<br /><br />Code for a UAV to a structured buffer might look like this:<br />
<pre name="code" class="cpp">// Unordered access view on structured buffer
D3D11_UNORDERED_ACCESS_VIEW_DESC sbUAVDesc;
ZeroMemory( &amp;sbUAVDesc, sizeof(sbUAVDesc) ); sbUAVDesc.Buffer.FirstElement = 0;
sbUAVDesc.Buffer.Flags = 0;
sbUAVDesc.Buffer.NumElements = sbDesc.ByteWidth / sbDesc.StructureByteStride; 
sbUAVDesc.Format = DXGI_FORMAT_UNKNOWN;
sbUAVDesc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
HRESULT hr = pd3dDevice&gt;CreateUnorderedAccessView((ID3D11Resource *)pStructuredBuffer, &amp;sbUAVDesc, &amp;pComputeOutputUAV);
</pre>
In DirectX* 11.x, a UAV also allows a new access pattern called <i>append and consume</i>. This pattern allows for building and accessing data in list or stack form. An append and consume buffer is a structured or a raw buffer with a specially created UAV:<br />
<pre name="code" class="cpp">// Unordered access view on structured buffer
D3D11_UNORDERED_ACCESS_VIEW_DESC sbUAVDesc;
ZeroMemory( &amp;sbUAVDesc, sizeof(sbUAVDesc) ); sbUAVDesc.Buffer.FirstElement = 0;
sbUAVDesc.Buffer.Flags = D3D11_BUFFER_UAV_FLAG_APPEND;
sbUAVDesc.Buffer.NumElements = sbDesc.ByteWidth / sbDesc.StructureByteStride; 
sbUAVDesc.Format = DXGI_FORMAT_UNKNOWN;
sbUAVDesc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
HRESULT hr = pd3dDevice&gt;CreateUnorderedAccessView((ID3D11Resource *)pStructuredBuffer, &amp;sbUAVDesc, &amp;pComputeOutputUAV);
</pre>
In HLSL, the <code>AppendStructuredBuffer&lt;T&gt;</code> provides the <code>append(T)</code> method; the <code>ConsumeStructuredBuffer&lt;T&gt;</code> provides the <code>T .consume()</code> method.<br /><br /><b><i><br />Thread Group Shared Memory</i></b><br />Thread group shared memory (TGSM) is located in on-chip memory. You can consider it a cache to minimize off-chip bandwidth use. All the threads in a thread group access this memory. In other words, TGSM allows threads within a given group to cooperate and share data. Reads and Writes to shared memory are fast compared to global buffer loads and stores—close to the speed of register Reads and Writes.<br /><br />A common programming pattern is to have the threads within a group cooperatively load a block of data into shared memory, process the data, and then write out the results to a writable buffer. A typical example is storing all of the neighboring pixels for horizontal or vertical blur kernels for a post-processing pipeline.<br /><br />TGSM is not persistent between dispatch calls. So, the result of one dispatch call needs to be stored somewhere else. TGSM is indicated in the HLSL shader code using the <code>groupshared</code> type qualifier:<br />
<pre name="code" class="cpp">groupshared float sharedmem[256];<br /> </pre>
<h2 class="sectionHeading">DirectCompute Threading Model</h2>
The typical multithreading paradigm used in traditional CPU-based algorithms uses separate processor cores and threads for execution, coupled with a shared memory space and manual synchronization. High-end CPUs like Intel® Ivy Bridge have up to six cores each, supporting up to two threads.<br /><br />DirectCompute uses a different threading model. DirectCompute-capable devices can run thousands of threads, with flexible mapping of threads to data elements, while the same shader or program executes them all—a process called <i>kernel in parallel</i>.<br /><br /><br /><b><i>Kernel Processing</i></b><br />A compute shader is considered a processing kernel when executed. A kernel is instantiated for each thread and applied to a set of data. The data is provided through Direct3D* resources bound to the DirectCompute stage. In other words, each hardware thread can be tasked with executing one individual invocation of a kernel that is the same for all threads in a dispatch call.<br /><br />That means that you can split the typical data for a DirectCompute application into small enough parts that it can be processed in separation. A typical DirectCompute application requires data that consists of a large number of similarly structured pieces of data—the classical domain of graphics hardware.<br /><br /><br /><b><i>Dispatching Kernels</i></b><br />Executing a compute shader is also called <i>dispatching a kernel</i>. Two functions in DirectX* 11.x dispatch a kernel:<br />
<pre name="code" class="cpp">Dispatch(UINT ThreadGroupCountX, UINT ThreadGroupCountY, UINT ThreadGroupCountX);

DispatchIndirect(ID3D11Buffer *pBufferForArgs, UINT AlignedOffsetForArgs);
</pre>
The first method expects three values that represent the number of thread groups in three dimensions that should be dispatched. For example, if an application calls the <code>Dispatch()</code> method with 4, 8, and 2, a total of 64 thread groups will be launched. The number of threads in each of those thread groups is specified in the compute shader.<br /><br />The second method—<code>DispatchIndirect()</code>—indirects the dispatch call by allowing it to fill a buffer with the parameters that the Dispatch() call expects, and then dispatch a new job based on the buffer data. The following code snippet shows a typical example of how to call <code>Dispatch()</code> and provide the number of threads in a thread group:<br />
<pre name="code" class="cpp">// C++ application code
pImmediateContext-&gt;Dispatch(Width / THREADSX, Height / THREADSY, 1 );

// HLSL compute shader code
[numthreads(THREADSX, THREADSY, 1)]
void CS_QJulia4D( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID, uint GI : SV_GroupIndex )
...
</pre>
In case of the thread group for x, the width of the window is divided by the number of threads that should be in the thread group. The number of threads is defined in the HLSL shader code. With 16 threads in each thread group and a window size of 800, the application will use 50 thread groups consisting of 16 threads each. For the y direction, if the window has a height of 640, there will be 20 thread groups consisting of 32 threads. This example dispatches 1000 thread groups, each with 512 threads. So, 512,000 threads are in flight.<br /><br />DirectX* 11.x supports 3D groups and threads. Think of the threads in a thread group in DirectX* 11 as a 3D array. Each thread “array” is part of a thread group as a 2D or 3D array. A thread in a thread group is addressed by using registers that hold the dimensions of the threads and thread groups.<br /><br /><br />
<h2 class="sectionHeading">Thread Addressing System</h2>
Each of the 512,000 threads in the example above execute an instance of a kernel or a compute shader. How does each kernel know which thread is responsible for its execution? Knowing which thread is executing the kernel is important for indexing into data, and then reading data from Direct3D* resources.<br /><br />The DirectCompute runtime provides system values stored in registers to a kernel. Four registers hold this data:<br /><br />
<ul>
<li><code>vThreadID.xyz</code></li>
<li><code>vThreadGroupID.xyz</code></li>
<li><code>vThreadIDInGroup.xyz</code></li>
<li><code>vThreadIDInGroupFlattended</code></li>
</ul>
The values those registers hold are accessible in the compute shader via the following semantics:<br /><br />
<ul>
<li><code >SV_DispatchThreadID</code> - index of the thread within the entire dispatch in each dimension: x - 0..x - 1; y - 0..y - 1; z - 0..z - 1</li>
<li><code >SV_GroupID</code> - index of a thread group in the dispatch — for example, calling Dispatch(2,1,1) results in possible values of 0,0,0 and 1,0,0, varying from 0 to (numthreadsX * numthreadsY * numThreadsZ) – 1</li>
<li><code >SV_GroupThreadID</code> - 3D version of SV_GroupIndex - if you specified numthreads(3,2,1), possible values for the SV_GroupThreadID input value have the range of values (0–2,0–1,0)</li>
<li><code >SV_GroupIndex</code> - index of a thread within a thread group</li>
</ul>
A simple example for writing into a 2D texture is shown in the following source code:
<pre name="code" class="cpp">RWTexture2D&lt;float4&gt; output : register (u0); 
void CS_QJulia4D( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID, uint GI : SV_GroupIndex )
{
...
	output[DTid.xy] = color;
}
</pre>
The following code shows how to access a 1D structured buffer in a compute shader:<br />
<pre name="code" class="cpp">struct BufferStruct
{
	float4 color;
};
RWStructuredBuffer&amp;;lt;BufferStruct&gt; output : register (u0); // UAV 0 
void CS_QJulia4D( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID, uint GI : SV_GroupIndex )
{
...
	uint stride = c_width;  
	uint idx = (DTid.x) + (DTid.y) * stride;

	output[idx].color = color;  
}<br /> </pre>
<h2 class="sectionHeading">Thread Synchronization</h2>
As with traditional multithreaded programming models, many threads can read and write the same memory location, and therefore there is a potential for memory corruption resulting from read-after-write hazards. To synchronize memory access of threads, <i>memory barriers</i> and <i>atomic functions</i> are available.<br /><br /><br /><b><i>Memory Barriers</i></b><br />In DirectX* 11.x, six different HLSL intrinsics, called <i>memory barriers</i>, can synchronize thread execution and memory writes:<br /><br />
<ul>
<li><code>AllMemoryBarrier/*WithGroupSync</code></li>
<li><code>DeviceMemoryBarrier/*WithGroupSync</code></li>
<li><code>GroupMemoryBarrier/*WithGroupSync</code></li>
</ul>
A <i>memory barrier</i> is a method for saying, “wait until the memory operations are complete.” You use such a barrier to ensure that when threads share data with one another in a device or TGSM, the desired values written to the memory have had a chance to be written before being read by other threads. In other words, there is an important distinction here between the shader core executing a Write instruction and that instruction actually being carried out by the GPU’s memory system and written to memory. Depending on the underlying hardware, there can be a variable amount of time between writing a value and when it actually ends up at its memory destination.<br /><br />There are <code>*MemoryBarriers</code> for TGSM, device memory, and both memory types. The <code>*MemoryBarrierWithGroupSync</code> stalls until outstanding memory operations, which are active at the time of calling, have finished and <i>all</i> threads in the group have hit the instruction. A typical example for using a memory barrier is shown in the following code:<br />
<pre name="code" class="cpp">for (uint tile = 0; tile &lt; numTiles; tile++) 
{
    sharedPos[threadId] = particles[…];
       
    GroupMemoryBarrierWithGroupSync();

    // gravitation() uses sharedPos[] as input data
    acceleration = gravitation(…);
        
    GroupMemoryBarrierWithGroupSync();
}
</pre>
The granularity with which those barriers stall out outstanding memory operations is 4 bytes. Memory barriers are used to synchronize a whole group of threads: They are not an appropriate solution for synchronizing only a few threads in a thread group. This is where atomic functions come in handy.<br /><br /><br /><b><i>Atomic Functions</i></b><br />DirectX* 11.x supports atomic functions, or <i>interlocked functions</i>, in the compute and pixel shaders. They are guaranteed to operate atomically—in other words, they are guaranteed to occur in the order programmed. Here is a list of the atomic functions:<br /><br />
<ul>
<li><code>InterlockedAdd</code></li>
<li><code>InterlockedMin</code></li>
<li><code>InterlockedMax</code></li>
<li><code>InterlockedOr</code></li>
<li><code>InterlockedAnd</code></li>
<li><code>InterlockedXor</code></li>
<li><code>InterlockedCompareStore</code></li>
<li><code>InterlockedCompareExchange</code></li>
<li><code>InterlockedExchange</code></li>
</ul>
With the exception of <code>InterlockedExchange</code>, all functions accept only input values that are integer or unsigned integer values in TGSM. For example, if the compute shader wants to keep a count of the number of threads that encounter a particular value, <code>InterlockedAdd()</code> can be called. InterlockedCompareExchange() compares the value of a destination to a reference value; if the two match, the third argument is written to the destination (<code>Zink</code>).<br /><br />Note that all of those integer atomics except <code>Add, Min</code>, and <code>Max</code> work as-is on floating-point numbers if they are passed in <code>asInteger()</code>. However, <code>Min</code> and <code>Max</code> work as-is on <code>asInteger()</code> floats as long as all floats are positive.<br /><br /><br />
<h2 class="sectionHeading">Example</h2>
The example program shows the Mandelbrot set. (There are good explanations of the algorithm on Wikipedia.) Jan Vlietinck’s website and other websites show implementations with source code. Figure 1 shows a screenshot of the example program.<br /><br /><img src="http://software.intel.com/file/43343" /><br /><b>Figure 1.</b> <i>Mandelbrot</i><br /><br />Figure 2 shows another screenshot of the demo application running on Intel® Ivy Bridge.<br /><br /><img src="http://software.intel.com/file/43344" /><br /><b>Figure 2.</b> <i>Mandelbrot</i><br /><br />This example is well suited to explaining a DirectX* 11 implementation, because it uses the minimum number of API calls to set up a DirectCompute application for DirectX* 11 and nicely shows the minimum requirements. In general, the example code is not written with production quality in mind to make it easier to read and more instructive. So, there’s no checking all the return statements and spending some time on picking the right device, and the window size can’t be changed without re-compilation. Let’s go through the implementation following the order of the DirectX*11 calls.<br /><br /><br /><b>Setting Up the Device</b><br />The simplest way to set up a device is to call <code>D3D11CreateDeviceAndSwapChain()</code> with the default values. That means that whatever feature set the first device supports is exposed to the application. For a DirectX* 11 application running with the DirectX* 11 feature level, that means that while creating the device, it needs to be verified that the underlying hardware supports the DirectX* 11 feature level.<br />
<pre name="code" class="cpp">   //
   // Initialize Direct3D device and swap chain
   //	
   DXGI_SWAP_CHAIN_DESC sd;
// …
   // this qualifies the back buffer for being the target of compute shader writes 
   sd.BufferUsage = DXGI_USAGE_RENDER_TARGET_OUTPUT | DXGI_USAGE_UNORDERED_ACCESS | DXGI_USAGE_SHADER_INPUT;
// …
   
   // return value -&gt; what the hardware supports
   D3D_FEATURE_LEVEL MaxFeatureLevel = D3D_FEATURE_LEVEL_11_0; 
   
   // we are asking for DirectX 11 feature level support here
   D3D_FEATURE_LEVEL FeatureLevel = D3D_FEATURE_LEVEL_11_0;

   HRESULT hr = D3D11CreateDeviceAndSwapChain(
							   NULL,
							   D3D_DRIVER_TYPE_HARDWARE,
							   NULL, 
							   D3D11_CREATE_DEVICE_DEBUG,
							   &amp;FeatureLevel,
							   1,
							   D3D11_SDK_VERSION,
							   &amp;sd,
							   &amp;pSwapChain,
							   &amp;pd3dDevice,
							   &amp;MaxFeatureLevel,
							   &amp;pImmediateContext);
</pre>
The code asks whether the hardware supports at least the DirectX* 11.0 feature level. If it doesn’t, the return value shows an error.<br /><br />With the swap chain created, a pointer to the back buffer and a render target view to write into this back buffer can be retrieved. This back buffer is then set as the main render target:<br />
<pre name="code" class="cpp">DXGI_SWAP_CHAIN_DESC sdtemp;
pSwapChain-&gt;GetDesc(&amp;sdtemp);

// get access to the back buffer via a texture
ID3D11Texture2D* pTexture;
pSwapChain-&gt;GetBuffer(0, __uuidof( ID3D11Texture2D ), ( LPVOID* )&amp;pTexture );

// create shader unordered access view on back buffer for compute shader to write into texture
pd3dDevice-&gt;CreateUnorderedAccessView(pTexture, NULL, &amp;pComputeOutput );
</pre>
DirectX*11-capable hardware supports <code>RWTexture</code> and allows you to use a UAV into this texture. The code above retrieves the back buffer from the swap chain and creates a UAV that points to the buffer. This way, writing into a structured buffer and reading this buffer later are not necessary.<br /><br /><br /><b><i>Constant Memory</i></b><br />The only buffer that needs to be allocated for this example is a constant buffer. To make use of the read optimizations that this buffer offers, the buffer needs to be aligned to 16 bytes:<br />
<pre name="code" class="cpp">//
// Create constant buffer
//
D3D11_BUFFER_DESC Desc;
Desc.Usage = D3D11_USAGE_DYNAMIC;
Desc.BindFlags = D3D11_BIND_CONSTANT_BUFFER;
Desc.CPUAccessFlags = D3D11_CPU_ACCESS_WRITE;
Desc.MiscFlags = 0;
Desc.ByteWidth = ((sizeof( MandelbrotConstants ) + 15)/16)*16; // must be multiple of 16 bytes

pd3dDevice-&gt;CreateBuffer(&amp;Desc, NULL, &amp;pcbFractal);		
</pre>
Later in the Render() functions, this constant buffer is filled by mapping the system memory copy to GPU memory:<br />
<pre name="code" class="cpp">// Fill constant buffer
D3D11_MAPPED_SUBRESOURCE msr;
pImmediateContext-&gt;Map(pcbFractal, 0, D3D11_MAP_WRITE_DISCARD, 0,  &amp;msr);
 *(MandelbrotConstants *)msr.pData = MandelC;
pImmediateContext-&gt;Unmap(pcbFractal,0);<br /> </pre>
<b><i>Compiling the Compute Shader Kernel</i></b><br />A compute shader is compiled like any other shader in DirectX*11:<br />
<pre name="code" class="cpp">//
// compile the compute shader
//
if(D3DX11CompileFromFile(L"Mandelbrot.hlsl", NULL, NULL, "Mandelbrot", "cs_5_0", 0, 
				0, NULL, &amp;pByteCodeBlob, &amp;pErrorBlob, NULL)!= S_OK)
  MessageBoxA(NULL, (char *)pErrorBlob-&gt;GetBufferPointer(), "Error", MB_OK | MB_ICONERROR);
if(pd3dDevice-&gt;CreateComputeShader(pByteCodeBlob-&gt;GetBufferPointer(), 
			pByteCodeBlob-&gt;GetBufferSize(), NULL, &amp;pCompiledComputeShader)!= S_OK)
  MessageBoxA(NULL, "CreateComputerShader() failed", "Error", MB_OK | MB_ICONERROR);
</pre>
The main difference is the specification of the <code>cs_5_0</code> target.<br /><br /><br /><b><i>Dispatching the Kernel</i></b><br />The application code that runs the compute shader is rather compact:<br />
<pre name="code" class="cpp">// Set compute shader
pImmediateContext-&gt;CSSetShader(pCompiledComputeShader, NULL, 0 );
 		
// UAV for CS output into back-buffer
pImmediateContext-&gt;CSSetUnorderedAccessViews(0, 1, &amp;pComputeOutput, NULL);
		
// CS constant buffer
pImmediateContext-&gt;CSSetConstantBuffers(0, 1, &amp;pcbFractal );

// Run the CS
pImmediateContext-&gt;Dispatch((gWidth) / 16, (gHeight) / 16, 1 );

// make it visible
pSwapChain-&gt;Present( 0, 0 );
</pre>
Before the <code>Dispatch()</code> call that invokes the compute shader execution, the compute shader is set, a UAV is set to write into the back buffer, and a constant buffer is set that holds data for the Mandelbrot algorithm.<br /><br /><br /><b><i>Thread Addressing System</i></b><br />The <code>Dispatch()</code> call shown above creates a grid that consists of the windows width and height, each divided by a thread group of 16 threads. In other words, with a 640×480 window, the grid consists of 40 thread groups in the <i>x</i> and 30 thread groups in the <i>y</i> direction, while the <i>z</i> direction is one. Because the compute shader directly writes into the back buffer, you might think of this as tiled-based rendering, where each tile is rendered with a group of threads consisting of 256 threads.<br /><br />The compute shader then defines with the <code>numthreads</code> keyword the thread group size of 16×16×1.<br />
<pre name="code" class="cpp">[numthreads(16, 16, 1)]
void Mandelbrot( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID, uint GI : SV_GroupIndex )
{
…
        output[ DTid.xy ] = color;
}
</pre>
The <code>SV_DispatchThreadID</code> system value is used to address the right thread. This runtime-generated value provides the index into a thread within the entire dispatch. In the case of the Mandelbrot compute shader, the <code>output</code> variable outputs to the 2D texture that is the back buffer. It is indexed so that all tiles can be filled simultaneously by different instances of the compute shader.<br /><br /><br />
<h2 class="sectionHeading">Summary</h2>
DirectCompute allows you to program compute shaders on Intel® Ivy Bridge hardware. Operations that need a more relaxed relationship between threads and data or operations that do not require the rasterizer can be therefore brought over to the graphics hardware. Doing so allows you to balance the load between the CPU and processor graphics with a fine level of granularity.<br /><br /><br />
<h2 class="sectionHeading">References</h2>
<ul>
<li>Wikipedia on the Mandelbrot set, <a target="_blank" href="http://en.wikipedia.org/wiki/Mandelbrot_set">http://en.wikipedia.org/wiki/Mandelbrot_set</a></li>
<li>Registers used in cs_5_0 <a target="_blank" href="http://msdn.microsoft.com/en-us/library/hh447206(v=VS.85).aspx">http://msdn.microsoft.com/en-us/library/hh447206(v=VS.85).aspx</a></li>
<a target="_blank" href="http://msdn.microsoft.com/en-us/library/hh447206(v=VS.85).aspx"></a>
<li><a target="_blank" href="http://msdn.microsoft.com/en-us/library/hh447206(v=VS.85).aspx">Jan Vlietinck, </a><a target="_blank" href="http://users.skynet.be/fquake">http://users.skynet.be/fquake</a></li>
<a target="_blank" href="http://users.skynet.be/fquake">
<li>Jason Zink, Matt Pettineo, Jack Hoxley, “Practical Rendering &amp; Computing with Direct3D 11,” CRC Press, 2011; p. 305.</li>
</a>
</ul>
<p> </p>
<a target="_blank" href="http://users.skynet.be/fquake">
<h2 class="sectionHeading">About the Author</h2>
Wolfgang Engel is the CTO/CEO &amp; Co-Founder of </a><a target="_blank" href="http://www.conffx.com/">Confetti</a>*, a think-tank for advanced real-time graphics research for the video game and movie industry. Previously, he worked for more than 4 years in Rockstar's core technology group as the lead graphics programmer. Some of his game credits can be found at <a target="_blank" href="http://www.mobygames.com/developer/sheet/view/developerId,158706">http://www.mobygames.com/developer/sheet/view/developerId,158706</a>. He is the editor of the <i>ShaderX</i> and <i>GPU Pro</i> books, the author of several other books, and speaks on graphics programming on conferences worldwide. He has been a DirectX* MVP since July 2006 and active in several advisory boards in the industry. He also teaches the class “GPU Programming” at University of California, San Diego. You can find him on Twitter at @wolfgangengel.<br /><br />
<div id="vc-meta" >
<div id="vc-meta-author"></div>
<div id="vc-meta-pubdate">04-23-2012</div>
<div id="vc-meta-modificationdate">04-23-2012</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product">
<div></div>
</div>
<div id="vc-meta-category">
<div>IvyBridge</div>
</div>
<div id="vc-meta-thumb">http://software.intel.com/file/43343</div>
<div id="vc-meta-thumb-tout"></div>
<div id="vc-meta-thumb-hero"></div>
<div id="vc-meta-tocenable">no</div>
<div id="vc-meta-abstract">Microsoft* DirectCompute exposes the compute functionality of graphics hardware as a new shader type: the compute shader. DirectCompute allows you to program compute shaders on Intel® Ivy Bridge hardware. This paper describes how.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/microsoft-directcompute-on-intel-ivy-bridge-processor-graphics/</link>
      <pubDate>Mon, 23 Apr 2012 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/microsoft-directcompute-on-intel-ivy-bridge-processor-graphics/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/microsoft-directcompute-on-intel-ivy-bridge-processor-graphics/</guid>
      <category>Visual Computing</category>
      <category>Visual Computing Source</category>
    </item>
    <item>
      <title>Pre-Compositing Textures for Terrain Rendering</title>
      <description><![CDATA[ <h2 class="sectionHeading">Download Article</h2>
Download <a href="http://software.intel.com/file/43370">Pre-Compositing Textures for Terrain Rendering</a> [PDF 632KB]<br /><br /><br />
<h2 class="sectionHeading">Introduction</h2>
There are many ways to generate and render terrain, but for years one of the most common ways has been to generate a terrain from a heightmap and use a mask texture to blend, or composite, diffuse textures together. Depending on the number of textures blended, this can be a fairly time consuming operation as it typically involves multiple texture samples over a large number of pixels. One solution is to take advantage of the fact that both terrain geometry and terrain texturing are almost always static and don’t change every frame, or even during the entire game. This allows the programmer to switch the texture compositing from the GPU, where it is calculated every frame, to the CPU where it is only calculated when needed. If this concept is taken to the extreme, the textures for the entire terrain can be baked prior to run-time resulting in the “mega-texture.” The middle ground is where this sample focuses, showing how to composite both diffuse textures and normal maps in the area immediately surrounding the camera in order to save time on the GPU.<br /><br /><br />
<h2 class="sectionHeading">Texture Compositing Overview</h2>
Texture compositing (or texture splatting) is a technique where a mask (or blend) texture is used to determine how to blend textures together [1]. For example, if the terrain contains a grass texture, a single channel texture can be used to denote how much grass should be visible with 0.0 indicating no grass and a 1.0 indicating 100% and anything in between being partially transparent grass. For each additional texture that is added to the mix, an additional channel is needed in the blend texture to represent that texture. If a RGBA8888 format texture is used, four diffuse maps can be controlled with a fifth texture blended in as a base texture. The fifth texture is applied first at full opacity with the remaining textures blended on top of it. In this sample, the blend texture is used to blend together five diffuse textures and five normal maps. Each normal map is paired with a diffuse texture and regulated by the same blend map channel. To combine the textures, a linear interpolation between the source and destination textures is used. This means that the order the textures are composited in makes a difference. If both grass and dirt have a 1.0 in their associated channels, whichever is blended last will obscure the other. Figure 1 shows an example composition of grass (base texture), dirt, stone, rock, and snow. The blend map stores the blend factor for the individual textures in the rgba channels.<br /><br />
<p ><img src="http://software.intel.com/file/43385" /><br /><br /><b>Figure 1</b> - <i>Texture Splatting - In this example, a blend map is used to blend together 4 textures with a 5th being used as a base texture. The red square simply denotes that only a portion of the blend map is used to generate the composited texture.</i></p>
<h2 class="sectionHeading">What to Composite</h2>
In order for compositing to work on the CPU, the terrain needs to be broken into tiles. If it is not, then the composited texture is either going to require a massive amount of memory as the entire terrain will be composited, or the resolution will be very low as it will be stretched out across the terrain. CPU-composited tiles are chosen based on the camera’s position. The sample uses the tile that contains the camera and the eight surrounding tiles. When the camera crosses into a new tile, tiles no longer in the grid are dropped and new ones are added; tiles that weren’t dropped out of the grid don’t need to be recomposited.<br /><br />
<p ><img src="http://software.intel.com/file/43386" /><br /><br /><b>Figure 2</b> - <i>The nine tiles surrounding the camera (C) are composited. If the camera moves to a new tile, composited textures for the out of bounds tiles are marked for deletion (D) while textures for newly in-range tiles are marked for compositing (N).</i></p>
Tiles not in the compositing grid are shaded on the GPU in the standard manner, having the textures blended there. This will generally be a much smaller amount of pixels as the majority of the visible terrain will be made up of composited tiles. In fact, since it is guaranteed that the GPU tiles will be farther away from the camera, a simpler shader can be used. For example, fewer textures can be used or less detailed mip levels.<br /><br /><br />
<h2 class="sectionHeading">Details</h2>
All of the work related to rendering the terrain occurs in the file <code>Terrain.cpp</code> and starts in the function <code>Terrain::Render()</code>. For each frame, all of the tiles are first sorted by their distance from the camera and culled to reduce unnecessary draw calls and pixel operations. Next, if CPU rendering is enabled, the nine tiles marked as having their compositing operations complete are rendered. Next, any tiles still being composited or outside of the compositing area are rendered using the compositing shader on the GPU. After all the tiles are submitted for rendering the function <code>DetermineNewTiles()</code> is called, which, as the name implies, determines whether there are any new tiles that need to be composited and if so, kicks off the necessary asynchronous tasks.<br /><br />The first thing <code>DetermineNewTiles()</code> does is check to see if any mipmap tasks are complete. If so, the composited texture resources are copied from the CPU to the GPU and the tile is marked for CPU rendering. Next, we check to see if there is any compositing work still ongoing, if so the function exits and no new work is added to the queue. This is done to ensure that no work in progress is cancelled. In practice, this will rarely happen as it would involve the camera crossing multiple tile boundaries within a fraction of a second. The most likely scenario is when the camera crosses near an intersection of tiles.<br /><br />After it is determined that there is no ongoing work we determine which tiles are new and need to be composited and which are old and can be dropped. Once the tiles that need to be composited are determined, two sets of asynchronous tasks are started. The first task is to generate the composited texture which is done in <code>CompositeTileRange()</code>. The second, which is dependent on the first task set, is to generate the mipmaps for the composited texture and compress it using the desired compression method, DXT5 in this sample. Since these are being done asynchronously, the main thread can continue on with its normal work. The dependent mipmap task set is what is checked at the start of the <code>DetermineNewTiles()</code> function in order to ascertain whether CPU work is done and the process is repeated.<br /><br />In this sample, the approach taken to compositing the textures is fairly straightforward. The blend texture is stretched over the entire terrain while the diffuse textures are each stretched over one terrain tile. Both the source and destination (composited) textures are the same size so the source textures are essentially just copied to the destination after being modified by the blend value. The main issue to be aware of is that unlike sampling on the GPU, texture operations on the CPU are not done at the pixel level, but instead at the destination resolution which is most likely less than pixel resolution. The same problems of minification or magnification can occur, only they happen before the texture is sampled by the GPU. In an extreme case where point sampling is used on the CPU half the data of the source texture is lost completely as shown in Figure 3. This problem can be easily solved by choosing a destination resolution that is an even multiple of the source resolution.<br /><br />
<p ><img src="http://software.intel.com/file/43387" /><br /><br /><b>Figure 3</b> - <i>In this example both the source (left) and destination (right) texture resolutions are the same. Tiling the source texture twice in the horizontal direction does not produce the desired results. By tiling twice, the step size becomes two and every other texel is skipped and the information is lost.</i></p>
To support blending textures on the CPU they need to be accessible on the CPU. This sample stores each source texture as 8 bits per channel. In addition, there needs to be a place to store the results of the blending operation so 9 destination buffers, with 8 bits per channel, are created along with enough extra space to store the mipmap chain. These are stored in an array of <code>CompositedTexture</code> structs along with some other information specific to each tile that has been composited, such as DirectX resource pointers, the current stage in the compositing pipeline, and timing information. After mipmaps are created, the textures are compressed using DXT5 compression with results written directly to a mapped staging buffer. This ends the mipmap task, which <code>DetermineNewTiles()</code> will detect on the next frame causing a <code>CopyResource()</code> function call to copy the staging buffer data to a GPU texture resource.<br /><br /><br />
<h2 class="sectionHeading">Performance</h2>
As stated earlier, the goal of this sample is to save time on the GPU by reducing the complexity of the shader used for processing terrain pixels. Since this is basically a pixel shader optimization, the performance benefit varies greatly depending on how many pixels are processed, which varies based on resolution and camera orientation. For testing, the starting camera location and orientation were used (shown in Figure 4). The red area indicates the tiles being pre-composited on the CPU and the green areas are tiles being composited in the pixel shader on the GPU. The GPU compositing shader contains 12 texture samples by default: a blend map, a specular map, and 5 diffuse and 5 normal maps. The CPU compositing shader contains three texture samples: a composited diffuse map, a composited normal map, and a specular map. The compositing time is fairly constant at about 300 ms per tile, which includes compositing both diffuse and normal maps, generating mipmaps, and DXT5 compression. All work was performed on a pre-release 3rd generation Intel® Core™ processor, code-named Ivy Bridge GT2 quad-core system with hyperthreading and 4 GB memory. The operating system was 64-bit Windows 7*.<br /><br />
<p ><img src="http://software.intel.com/file/43388" /><br /><br /><b>Figure 4</b> - <i>Sample with highlighting enabled. Red areas are pre-composited on the CPU while green areas are composited on the GPU.</i></p>
<p ><img src="http://software.intel.com/file/43389" /><br /><br /><b>Table 1</b> - <i>Frame times in ms for CPU compositing and GPU compositing at 1366x768 resolution. The number of textures decreases by removing pairs of diffuse and normal map textures from the GPU compositing shader. The CPU composited shader remains unchanged at three texture reads.</i></p>
<p ><img src="http://software.intel.com/file/43390" /><br /><br /><b>Table 2</b> - <i>Selected GPU metrics from Intel Graphics Performance Analyzer for rendering the tile closest to the camera. The main discrepancy between the two methods is in bandwidth used for all the texture reads and the total time required to render the terrain tile.</i></p>
One issue to be aware of that can affect performance is a power saving feature in Windows 7 called core parking. When activity is light on a core, Windows will “park” it, or not schedule work for that core until activity on the system has reached some threshold that allows the processor to move cores into a low power state and save energy. Unfortunately this can lead to slow frames in this sample. The reason is because for the most part, only one core is being used for the main game loop. When the camera crosses a tile boundary, a bunch of work is suddenly kicked off, which fully subscribes all the cores. Since many of the cores are parked as activity has been light, the main thread which is responsible for rendering can be swapped out for a significant period of time until all of the cores become unparked. The end result is frame hitching when new tiles need to be composited.<br /><br /><br />
<h2 class="sectionHeading">Conclusion</h2>
For games that have either exceeded the frametime for terrain or would like to do more with terrain, then pre-compositing terrain textures on the CPU is one approach that can help solve this problem. Due to the relatively static nature of terrain, the work only needs to be done once and can be reused for many frames. For the test scene, the frametime improved by 20% and the actual GPU time for rendering a tile improved by 48%. Drawbacks include the increase in memory required both on the CPU and GPU, and restrictions on the relative dimensions of source and destination textures. Benefits include faster rendering times for terrain tiles and since the work doesn’t need to occur every frame, some more intensive operations could be done like additional texture layers or Wang tiles [3] for increased variability.<br /><br />Some mitigation efforts that could be done to address the memory issues would be to combine the mipmap and compression work with the compositing work. The compositing task could decompress textures, do the compositing work, generate mipmaps, and recompress all at once. This would save quite a bit of memory by allowing both source and destination textures to be stored in compressed format. Further, the compositing task could output the compressed texture directly to a mapped staging buffer, which would eliminate the need for buffers to store the composited textures on the CPU. Additionally, the time taken for compositing could potentially be decreased through algorithm improvements and SSE/AVX usage.<br /><br /><br />
<h2 class="sectionHeading">Thanks</h2>
A special thanks to Frank Luna for allowing us to use code from his book “Introduction to 3D Game Programming with DirectX 10.”[2]<br /><br /><br />
<h2 class="sectionHeading">References</h2>
1. Charles Bloom, “Terrain Texture Compositing by Blending in the Frame-Buffer (aka "Splatting" Textures),” Nov 2, 2000, Retrieved 14:29, December 13, 2011, from <a target="_blank" href="http://www.cbloom.com/3d/techdocs/splatting.txt">http://www.cbloom.com/3d/techdocs/splatting.txt</a><br /><br />2. Frank D. Luna, “Introduction to 3D Game Programming with DirectX 10”, Wordware Publishing, Inc., 2008<br /><br />3. Wang tile, Oct 6, 2011, In <i>Wikipedia, The Free Encyclopedia</i>. Retrieved 22:28, December 13, 2011, from <a target="_blank" href="http://en.wikipedia.org/w/index.php?title=Wang_tile&amp;oldid=454296576">http://en.wikipedia.org/w/index.php?title=Wang_tile&amp;oldid=454296576</a><br /><br />
<div id="vc-meta" >
<div id="vc-meta-author">
<div></div>
</div>
<div id="vc-meta-pubdate">04-23-2012</div>
<div id="vc-meta-modificationdate">04-23-2012</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product">
<div></div>
<div></div>
</div>
<div id="vc-meta-category">
<div>IvyBridge</div>
</div>
<div id="vc-meta-thumb">http://software.intel.com/file/43388</div>
<div id="vc-meta-abstract">There are many ways to generate and render terrain, but for years one of the most common ways has been to generate a terrain from a heightmap and use a mask texture to blend, or composite, diffuse textures together.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/pre-compositing-textures-for-terrain-rendering/</link>
      <pubDate>Mon, 23 Apr 2012 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/pre-compositing-textures-for-terrain-rendering/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/pre-compositing-textures-for-terrain-rendering/</guid>
      <category>Visual Computing</category>
      <category>Visual Computing Source</category>
    </item>
    <item>
      <title>Case Study: Parallelizing a Recursive Problem with Intel® Threading Building Blocks</title>
      <description><![CDATA[ <h2>by Louis Feng</h2>
<h2></h2>
<h2>Download Article</h2>
Download <a target="_blank" href="http://software.intel.com/file/43272">Case Study: Parallelizing a Recursive Problem with Intel® Threading Building Blocks</a> [PDF 1.1MB]<br /><br />Recently I have been working closely with DreamWorks Animation engineers to improve the performance of an existing key library in their rendering system. Using a combination of algorithmic improvements and parallelization, we achieved an overall of 35X performance improvement in some cases. One of the requirements for this project was to minimize changes to the library structure to control development cost. In this article, I will share some of the techniques I used to parallelize a recursive problem in the DreamWorks Animation rendering system.<br /><br />Before I dive into the details, let me give a quick overview of the application that I was trying to analyze and speed up. Artists use digital content creation tools to create scene assets, such as virtual cameras, lights, and 3D models. To render an image using these scene assets files, they first must be parsed and then converted to rendering data structures before executing the renderer. This conversion step is referred to as data conditioning. Scene assets are represented by a graph. The graph has nodes representing objects like cameras, lights, and models. A node could also reference another node as an instance, for example, a forest of trees. The data conditioning step recursively transforms in-memory scene objects into the representation needed for rendering, copying all the necessary data. Data conditioning often involves a large amount of data and a large number of data objects. In practice the data conditioning cost varies widely depending on the number of objects and the complexity of objects in the scene. It's essential that this conversion operation is done quickly because it happens at the start of each rendering process. Hypothetically, if it takes an hour to render a frame, we don't want to spend 15 minutes in data conditioning. That would be a 25% overhead. More importantly, part of the interactive workflow, where a few seconds could make a big difference, also has to go through the data conditioning step. The main computation of the data conditioning library involved recursive graph traversal. My task was to figure out how to speed up this step as much as possible.<br /><br />
<p ><img src="http://software.intel.com/file/43240" /></p>
<b>Figure 1.</b> <i>Intel® VTune Amplifier XE Lightweight Hotspot shows the CPU utilization of the data conditioning library. The top area is showing the timeline and two green bars (each representing a thread). The active thread (second green bar) is showing CPU activities. The bottom half of the picture shows the overall CPU utilization.</i><br /><br />A few of the Intel tools and technologies, such as Intel® C++ Compiler, Intel® VTune Amplifier XE, Intel® Inspector XE, and Intel® Threading Building Blocks, were essential in achieving the performance improvements. I used Intel® VTune Amplifier XE's Lightweight Hotspot analysis to profile the library, see Figure 1. This picture is only showing the execution time inside the library, excluding all the system calls using the Module filter in VTune Amplifier XE. As shown in the time line, about 100 seconds are spent on data conditioning and overall CPU utilization is low. After some performance analysis, I found that the data conditioning library is a good candidate for parallel execution because in most cases, data objects can be processed independently.<br /><br />
<p ><img src="http://software.intel.com/file/43241" /></p>
<b>Figure 2.</b> <i>The conditioning library sequential execution call stack and corresponding computation cost. The middle column shows Self CPU time spent in a particular function. The right column shows Total CPU time which includes Self time and Self time of all functions that were called from that function. Here it is shown as a percentage of the total execution time of the run.</i><br /><br />Let's take a look at the call stack from the Hotspot analysis shown in Figure 2. <code>conditionObject()</code> is the main entry point for the recursive traversal. The call stack goes much deeper, which is not shown in this picture. This type of recursion is called mutual recursion or indirect recursion. The <code>conditionObject()</code> method is called from many locations in the library. Almost 90% of the data transformation is spent on <code>conditionObject()</code>, so it's a great target for parallelization.<br /><br />Intel® VTune Amplifier XE analysis provides valuable information on where the performance issues are. For example, I discovered that during a function call, an object was automatically casted into another object when passed in as a function parameter. Constructing a new instance of that class is fairly expensive and that function was called frequently. It would be difficult to detect such issues without a tool like VTune Amplifier XE. Over the course of the project, DreamWorks Animation engineers made many algorithmic and implementation improvements (such as fixing the object casting issue) which sped up the single thread conditioning library performance by over 4.5X. Using TBB for parallelization, I was able to obtain an average of 6.25X additional speed up on an 8 core Xeon® system.<br /><br />Intel has many technologies for enabling parallelization: TBB, OpenMP*, Cilk Plus*, and many others. I chose the TBB library because the problem we are trying to solve is complex and TBB has solved most of them already. Also DreamWorks has standardized on TBB as their primary programming model for data and task parallelism in their graphics code. The computation kernel (in this case, <code>conditionObject()</code>) can be considered as a TBB task. TBB already has thread-safe data structures and algorithms. We can leverage some of the important features of TBB, such as the high performance memory allocator, work-stealing based task scheduler, and synchronization primitives. One thing to note is that data conditioning involves allocating a large number of small objects. This is important to consider because the memory allocator could be a performance bottleneck in multithreaded applications due to synchronization. As I will show later, the TBB high performance memory allocator can help solve this problem (1).<br /><br />One great way to learn TBB is to study how it's used with design patterns (2). When I started working on this project, I considered whether I could simply apply some of the existing TBB design patterns to solve this problem. The recursion is similar to the Agglomeration and Divide and Conquer patterns, but in this case we are dealing with indirect recursion without a clean way to convert to direct recursion requiring a change to the implementation. There are also data dependencies and thread-safety issues that need to be resolved.<br /><br />To figure out a solution, let's step back and look at the problem in a more abstract way. From the call stack, you can see that the recursive computation is called from many different locations. The number of nodes we need to transform is unknown ahead of the time because nodes can create instances of other nodes. There are data dependencies between the nodes, which limits parallelism and increases complexity. Additionally, although one of the design goals of the data conditioning library is to be thread-safe, some parts of it are not. For example, the node data are accessed through a cache data structure called Context. This cache data structure is restricted to a single thread. Ideally, we want to enable multithreading while trying to minimize the changes to the library. More changes to the library means increased risks and complexity for the project.<br /><br />
<p ><img src="http://software.intel.com/file/43242" /></p>
<b>Figure 3.</b> <i>(a) Shows the control flow diagram of the example recursive program. We start by visiting the root node and do some work. For example, the work might involve allocating new objects (e.g. cameras, lights, and 3D models), and then process and compute object data. If this node has children, we visit each child node and do some more work there. This is done recursively. (b) Shows the result of refactoring the code to prepare for parallelization.</i><br /><br />We can actually solve each of the problems independently. To remove the dependencies between the nodes, we do the computation in the following two ways. One is to satisfy the bookkeeping of data objects so that the child node can be processed without blocking on the parent, see Figure 3. For example, we can keep track of all the nodes we have already visited and create an instance of the corresponding data object without actually filling in the data (which is the expensive part). This allows object instances to still reference each other. Another way to remove data dependency is extract these operations into a post-processing step. For example, when one node object requests data from another node object, this type of operation has to be moved into the post-processing stage after task synchronization.<br /><br />
<p ><img src="http://software.intel.com/file/43244" /></p>
<b>Figure 4.</b> <i>Adding TBB into the mix. Instead of executing the compute kernel directly, a TBB task is created and spawned.</i><br /><br />While we don't know the number of tasks ahead of time, using TBB we can create them recursively on the fly. This may not be the most optimal way of using TBB, but the flexibility is important to us, see Figure 4. The independent part of the computation can then run in parallel. To work around the thread-safety issues of the cache, we can create an instance of the data structure for each thread. Fortunately, the cache only uses small amount of memory. If it's unfeasible to create separate cache instances for each thread, I would look into changing the cache data structure to ensure thread-safety.<br /><br /><img src="http://software.intel.com/file/43245" /><br /><b>Figure 5.</b> <i>An example of a recursive program that traverses a graph and does some computation at each node of the graph.</i><br /><br />Let's look at the source code of a simplified example program, see Figure 5. This example has a similar structure as the DreamWorks data conditioning library with many details in the original library safely ignored for the purpose of this discussion. We have a simple program which builds a graph, and for each of the node in the graph we want to do some work through the <code>processNode()</code> function. A few parameters are used by <code>processNode()</code>: the graph node, a context, and the state. Context has a cache that's not thread-safe. The state object has everything else we need to carry around for the computation. Now we are going to make this program run in parallel.<br /><br /><img src="http://software.intel.com/file/43246" /><br /><b>Figure 6.</b> <i>Code refactoring to separate object data dependencies.</i><br /><br />If you recall, our solution to remove object data dependencies is by separating work into multiple parts:<br /><br />
<ul>
<li>Bookkeeping to manage new objects and inter-object references.</li>
<li>Main computation kernel that processes and computes independent object data.</li>
<li>Post-processing on object data that have dependencies.</li>
</ul>
Everything else remains the same. Figure 6 shows the new structure of the code. The key is, with these changes, all the interfaces remained intact. Any external calls to <code>processNode()</code> need not be changed.<br /><br />Now we are ready to add TBB into the mix. Since the <code>computationKernel()</code> function can be run in parallel safely, I will create a TBB task for it. Figure 7 shows the actual code to do just that. The bold faced lines are new code I have added to the example program. <code>TASK_ROOT</code> is going to be the parent task of the tasks we are going to create later.<br /><br /><img src="http://software.intel.com/file/43247" /><br /><b>Figure 7.</b> <i>Added TBB code to spawn tasks for the compute kernel.</i><br /><br />It's an <code>empty_task</code> because it doesn't actually do anything. It's important to set the reference count to 1 immediately. It's used to let TBB know that I am going to call <code>wait_for_all()</code> in a blocking style. Otherwise, <code>wait_for_all()</code> might return before all the tasks complete. For <code>empty_task</code> I also have to destroy it explicitly when it's no longer needed. Now look at the <code>processNode()</code> function. Instead of calling <code>computeKernel()</code>, I created a <code>ComputeTask</code> as the children of our <code>TASK_ROOT</code>. <code>Allocate_additional_child_of()</code> increases the reference count of the parent task. Then the child task is spawned.<br /><br /><img src="http://software.intel.com/file/43248" /><br /><b>Figure 8.</b> <i>The </i><code>ComputeTask</code> <i>that is run by the TBB scheduler.</i><br /><br />Figure 8 shows the implementation of <code>ComputeTask</code> which inherits from the tbb::task base class. It keeps a copy of all the function parameters we passed to it so that it can continue to run when TBB schedules an instance of this task and runs <code>execute()</code>. In this case, <code>computeKernel()</code> function is called with all the parameters and the proper values. While this code will run in parallel, we still have one remaining problem. Recall that Context is not thread safe. So far we have one context that's shared by all the tasks and threads. We need to fix this so that we don't have race conditions. What we need to do is have an instance of <code>Context</code> for each thread. We don't want to create a context for each task because that would be too expensive.<br /><br />To get the per thread data, there are two ways you can do it. One way is for each thread to find out its own unique thread ID at run time. Another way is to use TBB thread local storage (TLS). TBB TLS is essentially a container that stores per thread data. In any given thread that's running, you can ask for your local instance of the data from this container. Each instance of the data is only created the first time when a thread asks for it. For example, if your machine has 16 threads, and you allocated only 5 threads to run TBB tasks. There will be a maximum of 5 instances of the data created for these threads.<br /><br />The Intel® TBB team has recommended using TLS rather than using threads ID for a number of reasons (3). TBB advocates task-based parallelism. It wants us to stay away from exposing the underlying threads. If you know the thread ID, then you can do things that TBB may not intend to be used for. TBB allows you get thread ID, but only use it if you have very good reasons. Another benefit of using thread local storage is that you don't have to worry about how many threads you are working with or what type of system you are running on.<br /><br /><img src="http://software.intel.com/file/43249" /><br /><b>Figure 9.</b> <i>Use TBB thread local storage to access data that cannot be shared between threads.</i><br /><br />Figure 9 shows how I used thread local storage. Instead of passing in the context when the task was created, I used the thread local context instance. I have declared a thread local storage type using TBB <code>enumerable_thread_specific</code> class. I also created an object called <code>THREAD_CONTEXT</code> initialized with an exemplar context object which will hold our thread local data. When <code>THREAD_CONTEXT.local()</code> is called, first <code>THREAD_CONTEXT</code> will check that for this thread whether a context object has already been allocated. If it has been allocated, <code>local()</code> simply returns a reference to it. Otherwise, a new instance of the context object will be created (through copy constructor) then returned.<br /><br />
<p ><img src="http://software.intel.com/file/43250" /></p>
<b>Figure 10.</b> <i>RDL library performance comparison. Shot 2 is a medium size scene and our improvements resulted in 35X speed up, from the original 98.56 seconds run time down to 2.79 seconds. Legend: ST = single thread, MT-TBB = multithreaded using TBB, MT-TBB-Malloc = multithreaded using TBB and TBB malloc.</i><br /><br />In summary, I have divided the computation kernel at each node into three parts:<br /><br />
<ul>
<li>Bookkeeping to keep track of the objects and allow node objects to reference each other if needed</li>
<li>I moved data dependency into the post processing step so that the computation kernel can run independently in parallel.</li>
<li>Thread local storage is used for non-thread-safe data to avoid race conditions.</li>
</ul>
Using these techniques, I was able to speed up the data conditioning library by 6.25X on average on an 8 core Xeon system without making major changes to the library interface. One of the test data, shot 2, was speed up by 35X comparing to the original ~100 seconds of run time, see Figure 10. Notice the speed improvements when TBB malloc is used to replace the standard malloc. I have done additional tests on shots of various sizes and the performance improvements are consistent. Of course, there is still room for improvement. This library uses mutexes in a couple locations to avoid thread safety issues. I believe we can improve the performance further by reducing the number of synchronizations. Some data which didn't require locking, achieve over 8X speed up on an 8 core Xeon. In TBB 4.0, flow graph is introduced to solve graph related problems such as the one we discussed here. TBB graph could help simplify and solve some of the object data dependency problems.<br /><br />1. <b>O'Neill, John, Wells, Alex and Walsh, Matt</b>. Optimizing Without Breaking a Sweat. <i>Intel® Software Network</i>. [Online] [Cited: 11 29, 2011.] <a href="http://software.intel.com/en-us/articles/optimizing-without-breaking-a-sweat/">http://software.intel.com/en-us/articles/optimizing-without-breaking-a-sweat/</a>.<br /><br />2. TBB Documentation. [Online] <a target="_blank" href="http://threadingbuildingblocks.org/documentation.php">http://threadingbuildingblocks.org/documentation.php</a>*.<br /><br />3. <b>Robison, Arch.</b> Abstracting Thread Local Storage. <i>Intel® Software Network</i>. [Online] <a href="http://software.intel.com/en-us/blogs/2008/01/31/abstracting-thread-local-storage/">http://software.intel.com/en-us/blogs/2008/01/31/abstracting-thread-local-storage/</a>.<br />
<div id="vc-meta" >
<div id="vc-meta-author"></div>
<div id="vc-meta-pubdate">04-13-2012</div>
<div id="vc-meta-modificationdate">04-13-2012</div>
<div id="vc-meta-taxonomy">Case Studies</div>
<div id="vc-meta-category-product">
<div class="tbb">Intel® TBB</div>
<div></div>
</div>
<div id="vc-meta-category">
<div></div>
</div>
<div id="vc-meta-thumb">http://software.intel.com/file/43240</div>
<div id="vc-meta-thumb-tout"></div>
<div id="vc-meta-thumb-hero"></div>
<div id="vc-meta-tocenable">no</div>
<div id="vc-meta-abstract">Intel worked closely with DreamWorks Animation engineers to improve the performance of a key rendering system library by up to 35X performance improvement in some cases.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/case-study-parallelizing-a-recursive-problem-with-intel-threading-building-blocks/</link>
      <pubDate>Fri, 13 Apr 2012 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/case-study-parallelizing-a-recursive-problem-with-intel-threading-building-blocks/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/case-study-parallelizing-a-recursive-problem-with-intel-threading-building-blocks/</guid>
      <category>Parallel Programming</category>
      <category>Visual Computing</category>
      <category>Visual Computing Source</category>
      <category>Server Developer Community</category>
    </item>
    <item>
      <title>Improving the Compute Performance of Video Processing Software Using AVX (Advanced Vector Extensions) Instructions (by Eli Hernandez and Larry Moore)</title>
      <description><![CDATA[ <h2 class="sectionHeading">Download Article</h2>
Download <a href="http://software.intel.com/file/41302">Improving the Compute Performance of Video Processing Software Using AVX (Advanced Vector Extensions) Instructions</a> [PDF 311KB]<br /><br />
<h2 class="sectionHeading">Abstract</h2>
Modern x86 CPUs permit instruction level parallelism (e.g. SIMD) on register vectors at most 128-bits. Second Generation Intel® Core™ Processors include the first generation of AVX (256-bit operators), which permits increased parallel processing. This paper outlines a case study in which AVX instructions are used to improve the compute performance of a de-saturation algorithm. The paper also discusses how future integer based AVX instructions might be used to further enhance SIMD optimizations and achieve even greater performance benefits on video processing algorithms.<br /><br />
<h2 class="sectionHeading">1. Introduction</h2>
Modern x86 CPUs permit Instruction Level Parallelism (ILP), such as Single Instruction Multiple Data (SIMD), on vectors at most 128-bit. These register vectors can be used to process multiple data elements with fewer instructions. Second Generation Intel® Core™ Processors (codenamed Sandy Bridge) included the first generation of AVX, which is a 256-bit instruction set extension to the Intel® Streaming SIMD Extensions (Intel® SSE).<br /><br />The first generation of AVX included a wide range of instructions designed primarily to accelerate compute intensive algorithms performing arithmetic operations on floating point data. However, even if an algorithm is integer based, using AVX instructions could potentially increase an algorithm’s performance without sacrificing accuracy of the results. In video processing algorithms, the pixel channels are often stored as 8-bit unsigned integers (bytes) and processed as 32-bit or larger format integer values. Therefore, most video algorithms require conversion of pixels to and from a format. Wider bit widths are used for calculation accuracy and smaller formats are used to save space. Typically, floating-point units are not used because the extra conversion costs do not significantly improve accuracy. However, AVX is capable of greatly improving the runtime performance of video processing software and a vast number of other software applications by the increased parallelism.<br /><br />This paper describes a case study in which AVX instructions are used to enhance the performance of a de-saturation algorithm (a common video filter). The case study takes the algorithm from a non-SIMD state to AVX based SIMD. The paper also discusses how future generations of AVX may be able to further aid performance optimization and enable greater performance of video processing.<br /><br />
<h2 class="sectionHeading">2. Intel SIMD Overview</h2>
On Intel SIMD architectures, a vector register can store a group of data elements of a single data type (e.g. floats or integers). The vector registers of Sandy Bridge are 256 bits wide whereas all other processors since Intel® Pentium III were 128 bits wide. Each vector (called YMM in Sandy Bridge) register can store 8 floats, 8 32-bit integers, 32 chars, etc. AVX instructions operate on the full 256 bits, but SSE can only operate on 128 bits.<br /><br />A SIMD enabled-processor can execute a single operation on multiple data. An operation performed simultaneously on multiple data elements is a vector process. SIMD vectorization is the process of converting an algorithm from a scalar to a vector implementation. The multiply function in sample code below is used to illustrate the difference between the scalar and SIMD vector process.<br /><br /><img src="http://software.intel.com/file/41320" /><br /><br />
<p ><img src="http://software.intel.com/file/41321" /></p>
<div ><b>Figure 1:</b><i> This illustrates the difference between scalar and vector processes. The scalar version would have 16 loads, 8 multiplications and 8 stores. SSE can potentially have 4 loads, 2 vector multiplications and 2 stores. AVX would use 2 loads, 1 large vector multiplication and 1 store. The labels with VMUL were shortened to hide the distinction between various versions of vector multiplication instructions. VMUL performs multiplication on vectors A and B for each element pair and stores the results in another vector. Let us suppose for simplicity that loads and stores cost 3 cycles, all multiplication costs 1 cycle and we are ignoring pipelining. Then the scalar version spends 80 cycles to compute 8 elements while the AVX version spends 10 cycles, yielding a theoretical speedup of 8x. This clearly illustrates why SIMD vectorization has become a very important aspect to optimize application performance. Also given observed performance benefits with SIMD, automatic SIMD vectorization has become as keystone feature in advanced compilers.<br /></i></div>
<br />
<h2 class="sectionHeading">3. Video Processing Code</h2>
Typical video processing algorithms calculate pixel values using a triple for-loop (for each frame, for each X, for each Y). This typically is seen as an area of high CPU utilization (i.e. hotspot). Video processing application hotspots are excellent candidates to optimize with AVX.<br /><br />A simple approach to optimize using SIMD involves taking advantage of the latest processor technology features, such as AVX. The following sections describe the optimization process using AVX instructions to enhance the performance of a de-saturation algorithm. The serial code implementation is briefly discussed and AVX-based SIMD instructions are used to optimize the de-saturation algorithm. Finally, this chapter ends with our performance results of the optimize code.<br /><br /><br />
<h2 class="sectionHeading">3.1. Desaturation - Sample Code</h2>
The typical implementation of the Desaturation algorithm uses the incoming pixel values to compute a luminance value. The luminance value is applied to all outgoing pixels to de-saturate images as part of processing video for output.<br /><br />As you can see in the sample code below the algorithm traverses row by row to get pixel data, which channel values (blue, green, and red) are used to calculate the luminance value. In other to achieve high accuracy the algorithm converts the one-byte channel values to single precision floating point. The floating-point values are used in a dot product type of operation to compute the luminance value. The Desaturation sample algorithm uses the fLuminace(…) function to convert pixel channel values from byte to float. The conversion to float is achieved implicitly by typecasting each channel value to float and with weights as constants for Red, Green and Red, the fLuminance(…) function uses the float values to compute luminance which value is applied to the video output.<br /><br /><img src="http://software.intel.com/file/41322" /><br /><img src="http://software.intel.com/file/41323" /><br /><br />Note that the conversion of channel data from byte to float occurs implicitly by typecasting to float. Although the scalar code looks simple and trivial, the assembly code generated by the compiler is much more complex. In analysis of the generated assembly code, the implicit byte to float conversion can be performed with fewer instructions by using the more efficient AVX instructions. As we have observed, the serial code calculates one channel and one pixel at a time. Nothing is computed in parallel (ignoring pipelining and reordering). Refer to <b>Appendix A</b> for the assembly code.<br /><br /><br />
<h2 class="sectionHeading">3.2. Desaturation - Optimization with AVX</h2>
This section outlines the transformation of the serial code and describes how AVX, SSE4.1 and SSE2 instructions optimize the de-saturation algorithm. As illustrated in Chapter 2 with SIMD, we can work on many items at once. Therefore, the load, store, conversion and math operations can be done in parallel. The algorithm below describes how we can use instruction level parallelism (via AVX instructions) to significantly improve performance. Note that the algorithm is written with the restriction that we could only use available instructions, not idealistic for future instructions as we discuss later. Therefore, lines 19, 20 and 21 involve an intermediate step to convert 32-bit integers back down to 8-bit unsigned and etcetera.<br /><br /><img src="http://software.intel.com/file/41332" /><br /><img src="http://software.intel.com/file/41333" /><br /><br />
<div ><b>Figure 2. </b><i>De-saturation algorithm</i><br /><br /></div>
<br />With the Figure 2 as the backbone of de-saturate, we can implement the real code. The motivations for using a procedure similar to Figures 2 and 3 are that:<br /><br />
<ul>
<li>AVX provides greater throughput for parallel processing of single-precision floating- point units than any past Intel SIMD x86 extension (MMX, SSE, SSE2, SSE3, SSE4.1, SSE4.2).</li>
<li>The cost (SIMD) to cast byte (8-bit unsigned char) to integer (32-bit signed integer) to single precision floating point (32-bit float) and back is less than using multiple calls of the equivalent code (scalar) using just bytes or integers.</li>
<li>Using byte based SIMD with this procedure gives poor precision. Parallel performance is not considered.</li>
<li>Using integer based SIMD with this procedure gives acceptable precision. Current AVX instructions for integer arithmetic do not exist and therefore cannot take full advantage of the 256-bit registers.</li>
<li>Using float based SIMD with this procedure gives very good precision and offers higher performance than those described above.</li>
</ul>
<img src="http://software.intel.com/file/41334" /><br /><img src="http://software.intel.com/file/41335" /><br /><br /><b>Figure 3. </b><i>De-saturation code optimzed AVX<sup>1</sup></i><br /><br />The algorithm and AVX code shown in Figures 2 and 3 convey the same exact process line-for-line. Notice that only lines 9 and 16 involve doing the real work. Theses lines each process 8 single precision floating point multiplications in parallel, totalling 16 multiplications for 2 instructions versus 16 individual multiplication instructions. Everything else is unnecessary overhead to make use of the parallel instructions or to increase precision.<br /><br />Despite the overhead, this code still improves performance by 1.45x . If integer based instructions existed with equivalent parallelism to that of single precision floating point, we could further increase performance. In such case, lines 6, 8, 10, 11, 14, 15, 17 and 18 could be eliminated. Lines 9 and 16 would operate on integers instead. Lines 19, 20 and 21 could require a single pack instruction (integer to byte). Of course, there are other hypothetical instructions that could be introduced with future AVX generations. The potential performance gain is left as an exercise for the reader. For assembly instructions generated by intrinsic functions used in the inner [ix] loop, refer to <b>Appendix B</b>.<br /><br /><br />
<h2 class="sectionHeading">3.3 Desaturation - Performance Test Results<b><sup>3</sup></b></h2>
Performance assessment of the de-saturation algorithm optimized with AVX in this study observed a 1.45x speedup when compared to the serial code. To gather performance data the de-saturation algorithm was applied to a 1440x1080 image and was looped 100 times. Performance was measured in elapsed time (milliseconds) taken to de-saturate the image, the following performance numbers were consistently observed:<br /><br />
<table cellpadding="10" cellspacing="0" border="0">
<tbody>
<tr>
<td>Serial Code:</td>
<td>1264 milliseconds</td>
</tr>
<tr>
<td>Code with AVX:</td>
<td>873 milliseconds</td>
</tr>
<tr>
<td>Performance Scaling:</td>
<td>1.45x or 1264ms/873ms</td>
</tr>
</tbody>
</table>
<br />A kernel (small application program) was used to run the algorithm. A kernel with a 1.45x scaling typically translates to a performance improvement of 10% to 15% when measured at the workload level. However, for video processing this rule of thumb does not apply. Consider you are applying the de-saturation algorithm to a one-minute or longer video clip. In that case, there will be more than 100 frames (images) to process. In theory, since more data has be processed, the performance boost potential could be more than 1.45x especially if or when processing full High-Definition (e.i.,1920x1080) video.<br /><br /><br />
<h2 class="sectionHeading">4. Packed Integer Conversion Instructions</h2>
Since our optimized de-saturation algorithm uses one of the SSE4.1 instructions, we will give an overview of SSE4.1 because other SSE4.1 instructions may be applicable for the optimization of other video processing algorithms. The Packed Integer Conversion instruction set contains 12 instructions for packed integer bit width conversions. Any of which can be utilized to optimize code where bit width is to be increased for integer data.<br /><br />The table in <b>Figure 5</b> lists the SSE4.1 instructions for packed integer conversions. The instructions support sign extension and zero extension conversions of byte to word, byte to double word, byte to quad-word, word to double word, word to quad-word, and double word to quad-word. Additionally, the chart shows a comparison of SSE2 vs SSE4.1 instructions needed to convert four (4) one-byte integers to four (4) 32-bit integers.<br /><br />The <i>pmovzxbd (byte to double word)</i> instruction was utilized a total of four (4) times in the de-saturate optimization. When/if these instructions include support for full 256-bit register, the use of this instruction in the optimized algorithm will be reduced to two (2). Thereby further improving the loop performance.<br /><br /><img src="http://software.intel.com/file/41328" /><br /><b>Figure 4. </b><i>Instructions for bit width conversions of packed integers</i><br /><br />The source operand to packed integer conversion instructions is from either an XMM register or memory. The destination is always an XMM register. When accessing memory, no alignment is required, unless alignment checking is enabled. In which case, all conversions must be aligned to the width of the memory being referenced. The number of elements that can be converted and width of memory reference is illustrated in <b>Figure 5</b>. The alignment requirement is shown in parenthesis.<br /><br /><img src="http://software.intel.com/file/41329" /><br /><b>Figure 5. Number of elements to process.</b> <i>P is Packed. MOV is Move (copy register). ZX is Zero Extend. SX is Sign Extend. B is Byte. W is Word. D is Double Word. Q is Quad-Word.</i><br /><br /><br />
<h2 class="sectionHeading">5. Conclusion</h2>
This paper has discussed how Second Generation Intel® Core™ Processors could increase parallel processing via AVX instructions and 256 bit registers. This paper outlined a case study in which AVX instructions were used to improve the compute performance of a de-saturation algorithm. The paper also discussed how future integer based AVX instructions could be used to further enhance SIMD optimizations and achieve even greater performance benefits on video processing algorithms. The procedure described demonstrated how AVX instructions or their intrinsic functions could be utilized to improve the runtime performance of video processing applications. The paper documented that despite some overhead incurred to setup for SIMD processing, the video de-saturation still achieved excellent performance benefits. <br /><br /><br />
<h2 class="sectionHeading">About the Authors</h2>
Eli Hernandez is an Application Engineer in the Consumer Client and Power Enabling Group at Intel Corporation where he works with customers to optimize their software for power efficiency and to run best on Intel hardware and software technologies. Eli joined Intel in August of 2007 with over 12 years of experience in software development for the telecom and the chemical industry. He received his B.S. in Electrical Engineering in 1989 and completed Master Studies in Computer Science in 1991-1992 from the DePaul University of Chicago.<br /><br />In 2008, Larry Moore graduated from Saint Petersburg College with Honors. He received a Who's Who Among Students Award and was a member of Phi Theta Kappa Honor Society. In 2011, he spent 8 months at Intel as an application engineer intern, in DuPont, Washington. Currently, he is attending the University of South Florida at Tampa, Florida in an accelerated graduate program, pursuing both a Bachelor of Science and Master of Science in Computer Engineering. His current research involves computer aided verification of real-time systems and model checking. Larry is also a member of IEEE Computer Society. <br /><br /><br />
<h2 class="sectionHeading">Appendix A: Inner loop quivalent assembly of the serial code</h2>
Roughly 45 instructions to proccess an iteration of the algorithm inner loop. With throughput of 1 pixels processed per iteration.<br /><br /><img src="http://software.intel.com/file/41330" /><br /><br />
<h2 class="sectionHeading">Appendix B: Equivalent assembly of inner loop optimized with AVX</h2>
Roughly 30 instructions to proccess an iteration of the algorithm inner loop. With throughput of 4 pixels processed per iteration.<br /><br /><img src="http://software.intel.com/file/41331" /><br /><br /><sup>1</sup> Load and store operations on optimized code assumes data is aligned.<br /><sup>2</sup> Please see footnote 3 and section 3.3.<br /><sup>3</sup> The performance measurements in this section are the actual numbers from real tests. However, we do not guarantee you will achieve as good of a performance.<br /><br />
<div id="vc-meta" >
<div id="vc-meta-author">
<div></div>
</div>
<div id="vc-meta-pubdate">02-08-2012</div>
<div id="vc-meta-modificationdate">02-08-2012</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product">
<div></div>
</div>
<div id="vc-meta-category">
<div></div>
</div>
<div id="vc-meta-thumb">http://software.intel.com/file/41303</div>
<div id="vc-meta-abstract">This paper describes a case study in which AVX instructions are used to enhance the performance of a de-saturation algorithm (a common video filter). The case study takes the algorithm from a non-SIMD state to AVX based SIMD. The paper also discusses how future generations of AVX may be able to further aid performance optimization and enable greater performance of video processing.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/improving-the-compute-performance-of-video-processing-software-using-avx-advanced-vector-extensions-instructions/</link>
      <pubDate>Wed, 08 Feb 2012 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/improving-the-compute-performance-of-video-processing-software-using-avx-advanced-vector-extensions-instructions/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/improving-the-compute-performance-of-video-processing-software-using-avx-advanced-vector-extensions-instructions/</guid>
      <category>Visual Computing</category>
      <category>Intel® AVX</category>
      <category>Visual Computing Source</category>
      <category>Server Developer Community</category>
    </item>
    <item>
      <title>New Levels of CT Image Performance and New Levels in Radiation Dose Management</title>
      <description><![CDATA[ <i>Advanced Image Reconstruction Algorithm Running on Intel® Xeon® Processors</i><i><br />By Kerry Evans (Intel), Terry Sych (Intel), and Steven Johnson (Intel) in collaboration with GE Healthcare</i> <i><br /><br /></i><br /> <br />
<h2 class="sectionHeading">Download Article</h2>
Download <a href="http://software.intel.com/file/41110">New Levels of CT Image Performance and New Levels in Radiation Dose Management</a> [PDF 1.5MB]<br /> <br />
<h2 class="sectionHeading">Introduction</h2>
Advances in computed tomography (CT) have increased dramatically during the past ten years, offering a non-invasive technique for examining patients as an alternative for exploratory surgeries that were once routine clinical practice. Significant strides have also been made in the effort to minimize radiation exposure associated with CT.<br /> <br /> Paving the way for improved image quality at lower dose is GE*'s model-based iterative reconstruction (MBIR) technology, called Veo*<sup>1</sup> (pronounced vay-oh). MBIR algorithms are able to reduce noise<sup>2</sup> and improve the resolution of reconstructed images compared to traditional reconstruction methods. This allows physicians to obtain the information desired for diagnosis using dramatically less dose. In development, the MBIR approach showed great promise by significantly increasing image quality, but processing times were too long for clinical use. This obstacle was overcome when GE and Intel engineers teamed up to reduce the processing time from several days for one case to one hour for multiple cases. Veo is now implemented in the GE Discovery* CT750 HD (Figure 1), and in early evaluation has generated head, chest and abdomen CT cases using X-ray dosages of less than 1 millisievert (mSv), well below the typical average dosages of 5 to 10 mSv required today. By comparison, average exposure to background radiation is about 3 mSv per year.<sup>3</sup><br /> <br /> The ultimate goal in CT, of course, is to provide clinicians with the highest quality images as fast as possible, maximizing diagnostic accuracy and speed, while optimizing staff productivity and patient throughput. This brief reviews key MBIR technology and the steps taken to dramatically decrease CT image processing time on servers equipped with the Intel® Xeon® processor.<br /> <br />
<p ><img src="http://software.intel.com/file/41111" /></p>
<div ><b>Figure 1.</b> <i>GE Discovery* CT750 HD</i><br /></div>
<br /> 
<hr />
<br /> <small><sup>1</sup> In clinical practice, the use of Veo may reduce patient CT dose depending on the clinical task, patient size, anatomical location and clinical practice. A consultation with a radiologist and a physicist should be made to determine the appropriate dose to obtain diagnostic image quality for the particular clinical task.<br /> <br /> <sup>2</sup> Noise as measured as pixel noise standard deviation.<br /> <br /> <sup>3</sup> Mettler, Jr. FA, et al., Effective Doses in Radiology and Diagnostic Nuclear Medicine: A Catalog, Radiology, July 2008, Vol. 248, No. 1, pp. 254-263<br /><br /> <br /></small>
<h2 class="sectionHeading">Historical Perspective</h2>
CT scanning is a significant advance in medical imaging for aiding in the diagnosis of a wide range of illnesses and injuries. The raw data from a CT scan is not an image, and therefore is not readable by a human; the data needs to be processed to reconstruct and produce cross-sectional images clinicians can view. CT dates back to the 1970’s in the UK, where Godfrey Hounsfield developed the first CT scanner that was used to scan a human brain.<sup>4</sup><br /> <br /> Despite major advances in X-ray sources, detector arrays, gantry mechanical design and computer performance, one component of computed tomography (CT) scanners has remained virtually constant for the past 25 years—the reconstruction algorithm.<sup>5</sup><br /> <br /> <span >Transition to MBIR</span> <br /> A technique known as filtered back-projection (FBP) has been the foundation of commercially available CT reconstruction techniques since the 1970’s. Its advantages are speed and formulas with a closed-form solution, requiring just a single pass over the acquired data. Although FBP is still widely used today, image quality is particularly sensitive to both signal and noise levels. The signal level is established by selecting the proper scanning protocols (output level of the X-ray source, CT gantry rotation speed, detector collimation, helical pitch, etc.,). Noise can be the result of fluctuations in the X-ray flux, detector signal generation process, electronic noise in the data acquisition system, as well as the attenuation properties of the patient body to the X-ray. While these noise sources can be characterized, FBP does not account for them in the reconstruction process. Thus, as radiation level is decreased (i.e., signal level is reduced) image quality suffers. In order to make the mathematics manageable, FBP ignores the important geometric properties of the CT system and assumes perfect responses for all components. (i.e., ideal point x-ray source, ideal point detector, and infinitely small image voxel). These simplifications lead to a suboptimal spatial resolution of the reconstructed images.<br /> <br /> A powerful new class of iterative reconstruction algorithms has been designed to reconstruct much higher quality images from CT scan data obtained at greatly reduced radiation levels. Veo’s model-based iterative reconstruction (MBIR) algorithm utilizes system noise models, in addition to X-ray absorption models, to produce high resolution images that allow clinicians to see anatomy in much more detail. Compared to FBP reconstruction, Veo produces much sharper HD images, as shown in Figure 2.<br /> <br />
<p ><img src="http://software.intel.com/file/41112" /></p>
<div ><b>Figure 2.</b> <i>Image Quality Comparison</i><br /></div>
<br /> <span >Today and Future</span> <br /> Due to limitations in computing power and reconstruction technology, model-based iterative approaches have not been practical for commercial CT scanners until now. Processing times have been reduced to a point where Veo can be used in today’s clinical workflow. Continued decreases in processing times are expected over time, driven by advancements in computing hardware and enhancements to MBIR algorithms.<br /> <br /> <i>Advanced model-based algorithms can extract more information from CT scan data. Our design philosophy with Veo* is to provide previously unattainable levels of combined resolution improvement and noise reduction in CT images in order to enhance diagnostic information at dramatically lower dose. </i><b>- Jiang Hsieh, Chief Scientist, GE Healthcare.<br /><br /></b> <br />
<h2 class="sectionHeading">Benefits of Veo* Technology</h2>
Launched in 2008 as the first iterative reconstruction technology, adaptive statistical iterative reconstruction (ASiR) is making a profound impact on over 1,000 GE CT scanners by enabling radiologists to obtain the images they desire. ASiR may allow the use of a lowered mA protocol, thereby reducing the required dose by up to 50 percent on Discovery CT750 HD<sup>6,7</sup>. Veo offers the next step in performance. Today, routine exams like chest, abdomen and pelvis scans may require 4-10 millisieverts (mSv) of X-ray exposure. However, Veo technology promises to deliver enhanced image quality using less than 1 mSv.<br /> <br /> When compared to previous GE reconstruction methods, Veo’s capabilities change the rules of CT imaging by applying revolutionary new modeling techniques to deliver lower noise, resolution gain and artifact suppression. Clinicians will benefit from higher spatial resolution and improved low-contrast detectability when diagnosing or treating disease. The impact is like the movement from standard TV to high-definition TV.<br /> <br /> “We are at the beginning of a very interesting advancement in CT imaging, and the future of Veo* lies in every application. In my opinion, it is one of the major advancements in CT imaging—as important as the development of helical and multi-detector CT.”<br /> Jean-Louis Sablayrolles, MD, Centre Cardiologique du Nord (CCN), Saint-Denis, France.<br /> <br /> Remarkably, Veo has the ability to deliver these improvements at unprecedented low dose levels, benefiting patients, as illustrated in Figure 3.<br /> <br />
<p ><img src="http://software.intel.com/file/41113" /></p>
<div ><b>Figure 3.</b> <i>Clincal Images Using Three Different Reconstruction Methods</i><br /></div>
<br /> 
<hr />
<br /> <sup>6</sup> In clinical practice, the use of ASiR may reduce CT patient dose depending on the clinical task, patient size, anatomical location and clinical practice. A consultation with a radiologist and a physicist should be made to determine the appropriate dose to obtain diagnostic image quality for the particular clinical task.<br /> <br /> <sup>7</sup> ASiR dose reduction was measured on a standard 20cm water phantom. The test involved maintaining constant pixel standard deviation as the mA was reduced, from 300 to 150mA, at 120 kV.<br /><br /> <br />
<h2 class="sectionHeading">MBIR Algorithm Challenges</h2>
The key to reconstruction algorithm design is to accurately model the interactions of radiation with matter as X-rays are produced in the tube, attenuated through the patient, measured at the detector, and transformed into a digital signal. In order to deliver a faithful representation of the actual process, individual models for measured noise statistics, system optics, radiation and detection physics, medical image characteristics, etc., have to be developed to reconstruct images that accurately represent the acquired scan data.<br /> <br /> Because of the complexity of the model description, no closed-form solution is available. Therefore, the estimate of the solution is iteratively determined through multiple passes over the scan data.<br /> <br /> MBIR reconstruction requires much more computing power than its predecessor, filtered back-projection. At the early stages of research, the MBIR algorithm took multiple days to process an image, clearly too long for use in clinical practice. Subsequently, the reconstruction time has been decreased to the point where multiple exams can be processed per hour using a multi-pronged approach incorporating the following:<br /> 
<ul>
<li>Algorithm optimization resulting from joint research of GE engineers and academic research partners</li>
<li>Intel® processor microarchitecture improvements made over several years</li>
<li>A very dense, high performance IBM* Blade Server system</li>
<li>Algorithms tuned by GE and Intel engineers to increase speed on Intel Xeon processors</li>
</ul>
<p> </p>
<h2 class="sectionHeading">MBIR Optimizations</h2>
In 2006, GE and Intel began collaborating with the objective of dramatically reducing the processing time of the MBIR algorithms. GE produced a simpler version of the MBIR application that was easier to benchmark and sent it to Intel for analysis. Intel looked closely at memory accesses and profiled where the CPU spent the most time. Impressive performance gains were achieved by restructuring the algorithms to run more efficiently on multi-core processors and reordering data structures to reduce cache misses and paging. Over this time period, Intel had introduced two new generations of processors, boosting performance significantly with the Intel Xeon processor that features an integrated memory controller and Intel® QuickPath Interconnect (Intel® QPI) technology. These hardware and software enhancements are described in more detail in the following.<br /> <br /> <span >Multithreading</span> <br /> Initially, the MBIR algorithm was single threaded and could only run on a single processing core. The code was split into two threading models: a data parallel threading model to split the number of data elements among available threads and a process parallel model that divided types of processing among the threads. This multithreaded approach enabled the software to take advantage of the full processing capacity of a board with two Intel Xeon processors, for a total of 8 cores and 16 threads. Threading the code produced a speed up of about 10 times, as illustrated in Figure 4.<br /> <br />
<p ><img src="http://software.intel.com/file/41114" /></p>
<div ><b>Figure 4.</b> <i>Speed up from Software and Hardware Optimizations</i><br /></div>
<br /> There are 16 threads because the processor incorporates Intel® Hyper-Threading Technology (Intel® HT Technology)<sup>8</sup>, which allows each physical core to process two software threads concurrently, increasing performance by as much as 30 percent.<sup>9,10,11</sup> More details about Intel HT Technology are presented in the section below. Multithreading made it necessary to introduce a low overhead synchronization mechanism to keep all cores working together and prevent data issues, such as race conditions.<br /> <br />
<div class="sidebar">
<h2 class="sectionHeading">Two Threads Per Core</h2>
Intel® Hyper-Threading Technology (Intel® HT Technology)<sup>8</sup> provides separate data paths for two tasks, which means the processor maintains two execution states at the same time. As a result, the CPU will process another task if the task it’s executing stalls (e.g., waiting for an I/O device), which eliminates wasteful idle time.<br /> <br /> The performance improvement derived from Intel HT Technology is illustrated in Figure 5, showing three multi-tasking examples. First, the tasks are executed sequentially, task 1 followed by task 2. Second, the tasks are assigned alternating time slots. These first two examples require about the same amount of time because they both incur significant delays when the CPU must wait for data. Third, Intel HT Technology executes both tasks concurrently, taking advantage of idle time to work on another task and thus reducing overall execution time. The key benefits of Intel HT Technology are greater performance, lower power and smaller form factor, compared to adding another processor to increase compute capacity.<br /> <br /></div>
<p ><img src="http://software.intel.com/file/41115" /><br /> <b>Figure 5.</b> <i>Three Multi-Tasking Examples</i></p>
<span >Data reordering</span> <br /> The data stride was analyzed, which identified how to order the 3D data to reconstruct the image in the most efficient manner. A close examination of the data indices revealed the stride was not uniform and data accesses were scattered all over the 4GB dataset. The data was rearranged to ensure indices always increased and to maximize cache reuse from subset to subset. These changes also allowed the processors’ prefetchers to work more efficiently, which improved the cache hit rate.<br /> <br /> Using the Intel® VTune™ Performance Analyzer, it was possible to identify additional penalty associated with memory access and memory paging. With a proper tuning of the memory configuration, this penalty became insignificant. Overall, data reordering and memory tuning doubled performance.<br /> <br /> <span >Data tuning</span> <br /> The Intel VTune Performance Analyzer also helped identify hot spots in the algorithm. Critical areas were hand-coded using Intel® Streaming SIMD Extensions (Intel® SSE) intrinsics to maximize use of the vector processing capabilities of the Intel Xeon processor.<br /> <br /> <span >Compiler change</span> <br /> GE engineers also switched from the GNU* GCC* compiler to the Intel® C++ Studio XE, improving performance by about 20 percent.<br /> <br /> During the course of the project, Intel launched two new generations of Intel Xeon processors with greater performance, which further decreased the processing time of the MBIR algorithm. In all, the combination of the previously discussed software optimizations and Intel® microarchitecture improvements enabled about a one hundred times increase in efficiency of the MBIR algorithm. Some of the Intel Development Tools available to help optimize software are described in the section below.<br /> <br />
<div class="sidebar">
<h2 class="sectionHeading">Intel Development Tools Overview</h2>
Developers of signal processing applications have a wide choice of development tools from Intel and the broad Intel ecosystem. The benefits of using these comprehensive tool suites are many and impact every phase of the software development process.<br /> <br /> <span >Intel® C++ Compiler</span> <br /> The Intel® C++ Compilers for Linux* and Microsoft* Windows* operating systems are optimized to harness key properties of Intel® architecture processors and deliver optimal performance. They take advantage of a complex set of heuristics to decide which assembly instructions can best optimize the performance in various areas, including memory access, branch prediction, vectorization and floating point operations.<br /> <br /> <span >Intel® Math Kernel Library (Intel® MKL)</span> <br /> Intel® Math Kernel Library (Intel® MKL) is a library of highly optimized, extensively threaded math routines that rely heavily on floating point computations for maximum performance. Core math functions include BLAS, LAPACK, ScaLAPACK, Sparse Solvers, Fast Fourier Transforms, Vector Math and more. <br /> <br /> <span >Intel® Integrated Performance Primitives (Intel® IPP)</span> <br /> Intel® Integrated Performance Primitives (Intel® IPP) offers a rich set of library functions and codecs capable of speeding up the development of highly optimized routines for the handling of multimedia formats and data of any kind. They have been hand optimized at a low level to provide maximum performance and ease of use with Intel® processor-based platforms. <br /> <br /> <span >Intel® VTune™ Performance Analyzer</span> <br /> Designed to help developers find bottlenecks in their applications, the tool profiles how the application is using CPU time and computing platform resources throughout the code. <br /> <br /> <span >Intel® Application Debugger</span> <br /> A rich and user friendly Eclipse* RCP-based graphical user interface, combined with OS signal and thread awareness, enables developers to cross-debug more easily by finding coding issues that affect application runtime behavior. <br /> <br /> <span >Eclipse*-based Integrated Development Environment</span> <br /> Intel® software development products can be used with the Eclipse Integrated Development Environment (IDE).<br /><br /> <br /></div>
<h2 class="sectionHeading">Advancing Image Reconstruction</h2>
Over the past several years, CT has continued to demonstrate value due to its versatile diagnostic capability, non-invasive application and ability to visualize fine anatomic detail. Innovative new technologies have improved the diagnostic information available to clinicians while lowering radiation dose. GE believes further advancements, like Veo, offer potential for another leap in resolution and reduction of patient radiation exposure - even below the level of today's CT scanners.<br /> <br /> This is possible, in part, due to the exceptional computing performance of Intel Xeon processors and dense server design from IBM. In addition, Intel engineers helped GE optimize their MBIR algorithms to increase speed on Intel Xeon processors from several days to under an hour. Another important consideration for GE was the fact that both Intel and IBM support long product life cycles, a necessity in this market segment where medical equipment must undergo long and extensive regulatory approval.<br /> <br /> To learn more about Computed Tomography solutions from GE Healthcare, please visit <a href="http://www.gehealthcare.com/euen/ct/products/discovery_ct750hd/index.html" target="_blank">http://www.gehealthcare.com/euen/ct/products/discovery_ct750hd/index.html</a><br /> <br /> To learn more about Intel's solutions for Embedded Computing, please visit <a href="http://www.intel.com/go/medical">www.intel.com/go/medical</a>.<br /> <br />
<h2 class="sectionHeading">About the Authors</h2>
Kerry Evans is a Platform Architect in the Intelligent Systems Group at Intel Corporation where he works with customers to design and implement solutions based on Intel hardware and software technologies. Kerry joined Intel in 2005 with 30 years of experience in software development, healthcare and medical research. He received his B.S. in Electrical Engineering in 1975 and M.Eng. and Ph.D. in Bioengineering in 1977 and 1979, respectively, from the University of Utah. He holds 4 US patents.    <br /><br />Terry Sych is a Staff Software Engineer in the Platform Architecture Enabling group at Intel Corporation. He joined Intel in 1992, and has worked on performance analysis and software optimization of enterprise applications for the last 10 years. Terry works with software vendors analyzing, tuning, and optimizing applications. He received a B.S. degree in Computer Engineering from the University of Michigan in 1981 and an MSEE from the University of Minnesota in 1988.  He holds 3 US patents.<br /><br />Steven Johnson is the GE Healthcare Alliance Manager at Intel.  <br /> <br /> 
<hr />
<br /> <sup>4</sup> Source: <a href="http://nobelprize.org/nobel_prizes/medicine/laureates/1979/perspectives.html" target="_blank">http://nobelprize.org/nobel_prizes/medicine/laureates/1979/perspectives.html</a><br /> <br /> <sup>5</sup> Source: Xiaochuan Pan, Emil Y Sidky and Michael Vannier, 2009 <a href="http://iopscience.iop.org/0266-5611/25/12/123009" target="_blank">http://iopscience.iop.org/0266-5611/25/12/123009</a><br /> <br /> <sup>8</sup> Intel® Hyper-Threading Technology (Intel® HT Technology) requires a computer system with an Intel® processor supporting Intel HT Technology and an Intel HT Technology enabled chipset, BIOS, and operating system. Performance will vary depending on the specific hardware and software you use. See <a href="http://www.intel.com/products/ht/hyperthreading_more.htm" target="_blank">www.intel.com/products/ht/hyperthreading_more.htm</a> for more information including details on which processors support Intel HT Technology.<br /> <br /> <sup>9</sup> Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors.  Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions.  Any change to any of those factors may cause the results to vary.  You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.<br /> <br /> <sup>10</sup> For more information go to <a href="http://www.intel.com/performance" target="_blank">http://www.intel.com/performance</a>.<br /><br /> <sup>11</sup> For an application example, see <a href="http://software.intel.com/en-us/articles/intel-hyper-threading-technology-analysis-of-the-ht-effects-on-a-server-transactional-workload">http://software.intel.com/en-us/articles/intel-hyper-threading-technology-analysis-of-the-ht-effects-on-a-server-transactional-workload</a><br /><br />
<div id="vc-meta" >
<div id="vc-meta-author">
<div></div>
</div>
<div id="vc-meta-pubdate">01-25-2012</div>
<div id="vc-meta-modificationdate">01-25-2012</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product">
<div></div>
</div>
<div id="vc-meta-category">
<div></div>
</div>
<div id="vc-meta-thumb">http://software.intel.com/file/41109</div>
<div id="vc-meta-abstract">Veo, GE Healthcare's new CT Scanner reconstruction technology, provides high resolution CT images allowing radiologists to maximize diagnostic accuracy at an optimized low dose to the patient. GE's advanced algorithms and Intel® Xeon® high performance processor technology reduce noise, decrease reconstruction time, lower radiation dose and produce higher quality CT images.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/new-levels-of-ct-image-performance-and-new-levels-in-radiation-dose-management/</link>
      <pubDate>Wed, 25 Jan 2012 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/new-levels-of-ct-image-performance-and-new-levels-in-radiation-dose-management/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/new-levels-of-ct-image-performance-and-new-levels-in-radiation-dose-management/</guid>
      <category>Visual Computing</category>
      <category>Visual Computing Source</category>
    </item>
    <item>
      <title>Intel® Graphics Performance Analyzers Instrumentation Walkthrough</title>
      <description><![CDATA[ <b>NOTE: This article was written using Intel GPA 4.3; though not the latest version of the product, many of the techniques outlined here are useful with recent versions of GPA. To download the latest release, see the <a href="http://www.intel.com/software/gpa/">Intel GPA Home Page</a>.</b><br /><br /> 
<hr />
<hr />
<br /><br />
<h2 class="sectionHeading">Download Article</h2>
Download <a href="http://software.intel.com/file/40463">Intel® Graphics Performance Analyzers Instrumentation Walkthrough</a> [PDF 1MB]<br /> <br />
<h2 class="sectionHeading">Intel® GPA Platform Analyzer Overview</h2>
Intel® Graphics Performance Analyzers (Intel® GPA) Platform Analyzer is an instrumentation-based tool.  The fundamental data element of the instrumentation API is a task.  A task is a logical group of work on a specific thread.  A task may correspond to code in functions, scope blocks, case blocks in switch statements, or any significant piece of code as determined by the developer.  The instrumentation API provides functionality to describe various constructs such as dependencies between tasks.  Instrumented tasks are displayed in a timeline view by Intel GPA Platform Analyzer.  Besides your defined tasks, you’ll see other information displayed on the timeline.  Intel® graphics drivers, the DirectX* interceptor used by Intel GPA, and other Intel libraries like the Intel® Media SDK come pre-instrumented and will display relevant information on Intel GPA Platform Analyzer.  Even if you don’t add any instrumentation to your code, you will at the very least see pre-instrumented libraries and/or graphics driver information.  By default, you will be able to see the amount of time and the order in which frames are processed on the CPU and the GPU.  This is helpful when determining if the application is CPU or GPU bound.<br /> <br />
<p ><img src="http://software.intel.com/file/40455" /></p>
<div ><b>Figure 1. </b><i>Intel® GPA Platform Analyzer user interface</i><br /></div>
<br /> Intel GPA Platform Analyzer is made up of several panels.  The figure above has the various panels numbered that make up Intel GPA Platform Analyzer.  Panel 1 is the timeline view that displays threads and tasks per thread.  Panels 2 and 3 show metrics related to selected tasks in the timeline.  Panel 4 shows hardware tracks for each of the processor cores. Panel 4 is not displayed by default but can be enabled from the Profiles window of the Intel GPA Monitor as described in Appendix A : Intel® GPA Monitor. Refer to the Intel GPA help file for further details about the user interface.<br /> <br /> The workflow for using Intel GPA Platform Analyzer is simple.  Identify sections in code whose execution time and order are better understood visually. Good candidates for instrumentation are AI simulation, rigid body physics, collision detection, rendering, or any other system that requires detailed understanding.  Add instrumentation to the identified sections as described in the rest of this article.  Run the Intel GPA Monitor and the instrumented application.  From the Intel® GPA System Analyzer HUD overlaid on screen, capture a trace for Intel GPA Platform Analyzer.  Open captured trace in Intel GPA Platform Analyzer and begin analysis.  Instrumentation can be used to visually confirm execution order and distribution of tasks, profiling an application, and even to describe complex relationships between systems.  Unlike other tools like Intel® GPA Frame Analyzer that focus on specific frames, Intel GPA Platform Analyzer casts a much wider net and shows not only the GPU activity but what is happening on the CPU as well.  As the name suggests, Intel GPA Platform Analyzer shows a better picture of the platform.<br /><br /> <br />
<h2 class="sectionHeading">Leading Game Middleware Instrumented for Intel® GPA Platform Analyzer</h2>
Most games in the industry use middleware solutions for various subsystems such as user interface, physics, vegetation, and many others.  Because middleware companies generally specialize on solving a particular problem, they are able to provide high-quality solutions.  Game developers want to release a game with many subsystems that work together to create an engaging and fun experience for the players.  Game middleware can be thought of as a black box that takes some input, does some work with the input, and returns results that can be used in a game.  Of course, nothing is free, and even highly optimized middleware products can end up being the bottleneck, or at the very least taking up more CPU time than desired.  In some cases, profiling the game can help the developer understand where issues might be that are holding up the middleware.<br /> <br /> We’ve worked with some of the leading middleware companies in the game industry to take off the veil and show developers exactly what the middleware is doing behind the scenes.  The following middleware companies are a few that have instrumented their latest products for Intel GPA Platform Analyzer:<br /> <br />
<p ><img src="http://software.intel.com/file/40456" /></p>
<div ><b>Figure 2. </b><i>Instrumented game middleware</i><br /></div>
<br /> If you are using any of these middleware solutions, you’ll see details about the work each is doing when you capture a trace for Intel GPA Platform Analyzer, as shown in Figures 3-5 below.  The instrumentation is probably not enabled on the “Release” version of these products, but is likely associated with a “Development” or “Profile” build. Refer to each product’s help or support for further information about enabling instrumentation for Intel GPA Platform Analyzer.<br /> <br /> The benefits of knowing and being able to see what these middleware products are doing range from being able to confirm whether middleware is the bottleneck, to being able to see what effect different input has on the middleware without having to add any instrumentation to your own code. The following screen captures show three examples of middleware products’ instrumentation.  These traces are from tech samples created by each of the middleware providers to show their own product features.<br /> <br />
<p ><img src="http://software.intel.com/file/40457" /></p>
<div ><b>Figure 3. </b><i>Autodesk Scaleform GFx* 4</i><br /><br /></div>
<br />
<p ><img src="http://software.intel.com/file/40458" /></p>
<div ><b>Figure 4. </b><i>Unity Web Player*</i><br /><br /></div>
<br />
<p ><img src="http://software.intel.com/file/40459" /></p>
<div ><b>Figure 5. </b><i>Geomerics Enlighten*</i><br /><br /></div>
<br />
<h2 class="sectionHeading">Simple and flexible instrumentation API</h2>
With a better idea of how Intel GPA Platform Analyzer displays an instrumented project, it is time to instrument some code.  Intel GPA Platform Analyzer comes with 32-bit and 64-bit libs found in <code>&lt;Intel GPA install dir&gt;\sdk\libs</code>.  Use the appropriate lib according to the project’s build requirements.  To begin adding instrumentation, include the ittnotify.h header in the files that will have instrumented code, and wrap the code in <code>__itt_task_begin/__itt_task_end</code> calls.<br /> <br />
<div class="dp-highlighter">
<div class="bar">
<div class="tools"><a onclick="dp.sh.Toolbar.Command('CollapseSource',this);return false;" href="http://software.intel.com#">- collapse source</a><a onclick="dp.sh.Toolbar.Command('ViewSource',this);return false;" href="http://software.intel.com#">view plain</a><a onclick="dp.sh.Toolbar.Command('CopyToClipboard',this);return false;" href="http://software.intel.com#">copy to clipboard</a><a onclick="dp.sh.Toolbar.Command('PrintSource',this);return false;" href="http://software.intel.com#">print</a><a onclick="dp.sh.Toolbar.Command('About',this);return false;" href="http://software.intel.com#">?</a></div>
</div>
<ol start="1" class="dp-cpp">
<li class="alt"><span><span class="comment">// include the header in the file that will be instrumented</span><span> </span></span></li>
<li><span><span class="preprocessor"><b>#include &lt;ittnotify.h&gt;</b></span><b><span> </span></b></span></li>
<li class="alt"><b><span><span class="keyword">static</span><span> __itt_domain* g_pDomain = __itt_domain_create( </span><span class="string">"Domain.Name"</span></span></b><span> ); </span></li>
<li><span> </span></li>
<li class="alt"><span><span class="keyword">void</span><span> System::DoWork( </span><span class="keyword">void</span><span> ) </span></span></li>
<li><span>{ </span></li>
<li class="alt"><b><span> __itt_string_handle* szStringHandle = __itt_string_handle_create(<span class="string">"System::DoWork"</span><span>); </span></span></b></li>
<li><b><span> __itt_task_begin( g_pDomain, __itt_null, __itt_null, szStringHandle ); </span></b></li>
<li class="alt"><span> <span class="comment">// do work</span><span> </span></span></li>
<li><span><b> __itt_task_end( g_pDomain ); </b></span></li>
<li class="alt"><span>} </span></li>
</ol></div>
Calls to <code>__itt_task_begin/__itt_task_end</code> can be nested if necessary to create a hierarchical construct.  In the following sections, some organization techniques available in the instrumentation API —task groups, markers, and relations—will be discussed to help organize instrumented code.<br /><br /> <br />
<h2 class="sectionHeading">Create or Integrate into Instrumentation System</h2>
Most game engines have an instrumentation/profiling system.  Intel® Instrumentation and Tracing Technology (Intel® ITT) can integrate well into instrumentation systems.  If your game engine does not have an instrumentation system, you can build it using the Intel ITT instrumentation API described above.  Most instrumentation systems use macros to mark code sections for profiling.  All the examples above call the Intel ITT functions directly.  You can abstract the calls by either creating your own macros or integrating them into your existing instrumentation system.  Most instrumentation systems work in one of two ways:<br /> <br /> <ol>
<li>Begin/End markers: Delimit specific code sections with begin/end calls.</li>
<li>Scoped objects: Macros called at the beginning of the profiled code section create an instrumentation object that goes out of scope at the end of the code section.</li>
</ol> The Intel ITT instrumentation API lends itself out of the box to the Begin/End scenario.  As shown in the examples above, <code>__itt_task_begin</code> and <code>__itt_task_end</code> are the two major API calls.  The second scenario can be accomplished by creating a class that calls <code>__itt_task_begin</code> in the constructor and <code>__itt_task_end</code> in the destructor, creating an instrumentation system that supports both scenarios.  For example, here’s some code that should accomplish the feat:<br /> <br />
<pre name="code" class="cpp">#include &lt;ittnotify.h&gt;

class GPAScopedTask
{
  public:
  GPAScopedTask( __itt_domain* pDomain, const char* szTaskName )
  : m_pDomain( pDomain )
  { 
    __itt_string_handle* pTaskName = __itt_string_handle_createA( szTaskName );
    __itt_task_begin( m_pDomain, __itt_null, __itt_null, pTaskName ); 
  }
  ~GPAScopedTask( void ) 
  { 
    __itt_task_end( m_pDomain ); 
  }
  
  private:
  __itt_domain* m_pDomain;
}

#define GPA_TASK( Domain, TaskName ) \
  GPAScopedTask _gpa_scoped_task_( Domain, TaskName )
#define GPA_TASK_BEGIN( TaskName ) \
  __itt_task_begin( Domain, __itt_null, __itt_null, TaskName )
#define GPA_TASK_END( Domain ) __itt_task_end( Domain )<br /> </pre>
<h2 class="sectionHeading">Organize Your Instrumentation</h2>
Using the calls and concepts presented in this document, you are able to get started instrumenting at will.  The last thing we'll cover in this document are some suggestions to organize your instrumentation.  As you get going with instrumentation, you will quickly realize you will need some organization scheme in order to understand what your code is doing.  A consistent naming scheme for your tasks is a good place to start, but the instrumentation API provides some useful organization schemes.  Looking at the concepts we've already covered, you can organize your instrumented code enough to not just make sense, but to also begin making assessments.  If your product is a middleware solution, using the following organization schemes will make the lives of your licensees much easier.<br /> <br /> A sane task naming scheme will keep your task hierarchy understandable and decipherable.  An easy way to keep your task naming scheme sane is to either use the function name or use <code>__FUNCTION__</code>.  This is easy to implement, and will not only give you names that you're familiar with, but will also help you identify code sections that might be a bottleneck.<br /> <br /> Once you have a sane naming scheme, the next step is to associate your instrumentation with an appropriate domain.  The domain is one of the parameters passed into <code>__itt_task_begin</code>.  With domains, you are able to control which tasks are saved into the trace from the Profiles window in the Intel GPA Monitor.  For this reason, name your domain something that correctly describes the associated tasks.  In the examples above, "Domain.Name" was used, but yours should be clearer.  For example, if you're instrumenting a middleware solution, use "CompanyName.ProductName" as the domain name.  You can also have multiple domains active and associate tasks appropriately.  A list of all domains will appear in the Profiles window.<br /> <br /> After you've associated tasks with the appropriate domains, you will be ready to begin making assessments about where time is spent.  The concept of task groups is useful for organizing tasks, as well as to help understand performance on a subsystem level.  Create a task group per subsystem and associate tasks from that subsystem with the task group.  Creating the task group per frame will give you an idea of where time is spent per frame, as well as a visual representation of dips and spikes.  If you are instrumenting middleware, create a task group per frame that associates the middleware task.  This will not only help licensees understand per frame how much time your solution is taking, but will help results make more sense when several middleware solutions exist in a single game.  Follow your domain naming scheme—CompanyName.ProductName—to name your task group.<br /> <br /> Now you should be ready not only to begin adding instrumentation to your game, middleware solution, or whatever code you wish, but to add it an a way that will help you take advantage of the visual representation.<br /><br /> <br />
<h2 class="sectionHeading">The rest (and then some) of the instrumentation API</h2>
In its latest release, Intel GPA supports the Intel® Instrumentation and Tracing Technology (Intel® ITT) API, a unified instrumentation API with other Intel® tools.  Intel ITT provides several constructs for organizing code instrumentation.  This section will describe the use of three of these constructs: task groups, markers, and relations.  Task groups can be useful to describe collections of tasks that may all serve a similar purpose, like AI.  A task group could encompass all tasks over several threads that involve AI.  Markers represent events in the execution time.  Markers can be used to signal specific events such as calling ID3DDevice::Present.  Relations can be used to describe complex interactions between tasks, such as dependencies between tasks even across multiple threads.<br /> <br /> Task groups are useful to define logical groups of work.  For example, AI tasks may be executed on several threads, and thus not easily contained within a task or nested task hierarchy.  With task groups, the execution time of AI tasks can be aggregated and easily accessible from Intel GPA Platform Analyzer.  Creating task groups is achieved with <code>__itt_task_group</code>, and then tasks can be added to the task group with a call to <code>__itt_relation_add_to_current</code> or __itt_relation_add.  Read the Intel GPA SDK Reference to understand the subtle difference between these two functions.<br /> <br />
<div class="dp-highlighter">
<div class="bar">
<div class="tools"><a onclick="dp.sh.Toolbar.Command('CollapseSource',this);return false;" href="http://software.intel.com#">- collapse source</a><a onclick="dp.sh.Toolbar.Command('ViewSource',this);return false;" href="http://software.intel.com#">view plain</a><a onclick="dp.sh.Toolbar.Command('CopyToClipboard',this);return false;" href="http://software.intel.com#">copy to clipboard</a><a onclick="dp.sh.Toolbar.Command('PrintSource',this);return false;" href="http://software.intel.com#">print</a><a onclick="dp.sh.Toolbar.Command('About',this);return false;" href="http://software.intel.com#">?</a></div>
</div>
<ol start="1" class="dp-cpp">
<li class="alt"><span><span class="comment">// include the header in the file that will be instrumented</span><span> </span></span></li>
<li><span><span class="preprocessor">#include &lt;ittnotify.h&gt;</span><span> </span></span></li>
<li class="alt"><b><span><span class="keyword">static</span><span> __itt_domain* g_pDomain = __itt_domain_create( </span><span class="string">"Domain.Name"</span><span> ); </span></span></b></li>
<li><b><span><span class="keyword">static</span><span> __itt_string_handle* g_pStringHandle = __itt_string_handle_create(</span><span class="string">"TaskGroupName"</span><span></span></span></b>); </li>
<li class="alt"><span><span class="keyword">void</span><span> System::DoWork( </span><span class="keyword">void</span><span> ) </span></span></li>
<li><span>{ </span></li>
<li class="alt"><span> <span class="comment">// Init task group</span><span> </span></span></li>
<li><b><span> m_nTaskGroupID = __itt_id_make( g_pStringHandle, 1 ); </span></b></li>
<li class="alt"><b><span> __itt_id_create( g_pDomain, m_nTaskGroupID );  <span class="comment">// should be destroyed in destructor</span><span> </span></span></b></li>
<li><b><span> __itt_task_group( g_pDomain, m_nTaskGroupID, __itt_null, g_pStringHandle ); </span></b></li>
<li class="alt"><span> __itt_string_handle* szStringHandle = __itt_string_handle_create(<span class="string">"System::DoWork"</span><span>); </span></span></li>
<li><b><span> __itt_task_begin( g_pDomain, __itt_null, __itt_null, szStringHandle ); </span></b></li>
<li class="alt"><b><span> __itt_relation_add_to_current( g_pDomain, __itt_relation_is_child_of, m_nTaskGroupID ); </span></b></li>
<li><span> <span class="comment">// do work</span><span> </span></span></li>
<li class="alt"><span> __itt_task_end( g_pDomain ); </span></li>
<li><span>} </span></li>
</ol></div>
Markers, as the name suggests, help to point to when discrete events occur in the execution time.  For example, the end of a frame is usually defined by a call to D3DDevice::Present.  Adding a marker when Present is called adds a visual representation for the end of the frame in the timeline view of Intel GPA Platform Analyzer.  Markers can be created with several scopes:  global, process, thread, and task.  The Intel GPA SDK Reference provides more details for choosing the appropriate scope.  Adding a marker is as simple as calling <code>__itt_marker</code>.<br /> <br />
<div class="dp-highlighter">
<div class="bar">
<div class="tools"><a onclick="dp.sh.Toolbar.Command('CollapseSource',this);return false;" href="http://software.intel.com#">- collapse source</a><a onclick="dp.sh.Toolbar.Command('ViewSource',this);return false;" href="http://software.intel.com#">view plain</a><a onclick="dp.sh.Toolbar.Command('CopyToClipboard',this);return false;" href="http://software.intel.com#">copy to clipboard</a><a onclick="dp.sh.Toolbar.Command('PrintSource',this);return false;" href="http://software.intel.com#">print</a><a onclick="dp.sh.Toolbar.Command('About',this);return false;" href="http://software.intel.com#">?</a></div>
</div>
<ol start="1" class="dp-cpp">
<li class="alt"><span><span class="comment">// include the header in the file that will be instrumented</span><span> </span></span></li>
<li><span><span class="preprocessor">#include &lt;ittnotify.h&gt;</span><span> </span></span></li>
<li class="alt"><span><span class="keyword">static</span><span> __itt_domain* g_pDomain = __itt_domain_create( </span><span class="string">"Domain.Name"</span><span> ); </span></span></li>
<li><span><span class="keyword">void</span><span> System::DoWork( </span><span class="keyword">void</span><span> ) </span></span></li>
<li class="alt"><span>{ </span></li>
<li><span> __itt_string_handle* szStringHandle = __itt_string_handle_create(<span class="string">"System::DoWork"</span><span>); </span></span></li>
<li class="alt"><span> __itt_task_begin( g_pDomain, __itt_null, __itt_null, szStringHandle ); </span></li>
<li><span> <span class="comment">// do work</span><span> </span></span></li>
<li class="alt"><span> <span class="comment">// end of frame...call Present</span><span> </span></span></li>
<li><span> __itt_string_handle* szEndFrameMarker = __itt_string_handle_create(<span class="string">"EndFrameMarker"</span><span>); </span></span></li>
<li class="alt"><b><span> __itt_marker( g_pDomain, __itt_null, szEndFrameMarker, __itt_marker_scope_task ); </span></b></li>
<li><span> D3DDevice::Present(); </span></li>
<li class="alt"><span> __itt_task_end( g_pDomain ); </span></li>
<li><span>} </span></li>
</ol></div>
As described earlier, tasks are a logical group of work on a specific thread.  Tasks are associated with a specific section of code that takes some amount of time to execute.  When instrumenting code, it might be important to describe more symbolic relations between tasks beyond log levels and categories.  The Intel ITT API provides functionality to describe other semantic relations between tasks such as:<br /> <br /> 
<ul>
<li><code>__itt_relation_is_dependent_on</code></li>
<li><code>__itt_relation_is_sibling_of</code></li>
<li><code>__itt_relation_is_parent_of</code></li>
<li><code>__itt_relation_is_continuation_of</code></li>
<li><code>__itt_relation_is_child_of</code></li>
<li><code>__itt_relation_is_continued_by</code></li>
<li><code>__itt_relation_is_predecessor_to</code></li>
</ul>
With these relations, instrumenting a task scheduling system, for example, can fully describe the distribution of work on different threads and the dependencies between tasks.<br /> <br /> As demonstrated above, instrumenting code is not difficult with the Intel GPA Platform Analyzer instrumentation API, and the benefits of understanding how the code is behaving are vast.  Being able to understand at a high level what the low-level code is doing is important when targeting heterogeneous platforms.  Visualizing work distribution and execution order is important when work can be done on multiple CPU threads, and will become more important when work is distributed amongst various computing devices.<br /><br /> <br />
<h2 class="sectionHeading">Appendix A : Intel® GPA Monitor</h2>
This appendix provides more detail about the Intel® GPA Monitor; in particular, some of the features that are relevant to Intel GPA Platform Analyzer traces. This appendix is by no means an extensive description of all the features of GPA Monitor. Refer to the Intel GPA help file, which is the definitive guide for all Intel GPA features. For the purpose of this document, this appendix covers three features of Intel GPA Monitor:<br /> <br /> <ol>
<li>Running applications from Intel GPA Monitor</li>
<li>Enabling Hardware Context Data</li>
<li>Viewing and enabling domains</li>
</ol>
<p> </p>
<b>Running applications from Intel® GPA Monitor</b><br /> In order for Intel® GPA to attach to your instrumented application and provide the functionality to capture traces and frames and execute state overrides, you must run your application from the Intel GPA Monitor, as shown in Figure 6 below.<br /> <br />
<p ><img src="http://software.intel.com/file/40460" /></p>
<div ><b>Figure 6. </b><i>Running applications from Intel® GPA Monitor</i><br /></div>
<br /> An application is run by entering the path to the executable, command line parameters, and the working folder and clicking the Run button. The “Add/Edit Profiles…” button opens up the Profiles window that is described in more detail in the following sections.<br /><br /> <br /> <b>Enabling Hardware Context Data</b><br /> Panel 4 in Figure 1 shows the Hardware Context Data of a trace in Intel GPA Platform Analyzer. This panel is not enabled by default, but is easily enabled from the Profiles window of the Intel GPA Monitor, as shown in Figure 7 below.<br /> <br />
<p ><img src="http://software.intel.com/file/40461" /></p>
<div ><b>Figure 7. </b><i>Enabling hardware context data in the Profiles window</i><br /></div>
<br /> <b>Viewing and enabling domains</b><br /> Throughout the document, the concept of domains is described and used. The full list of domains associated with an instrumented application can be viewed in the Domains tab of the Profiles window as show in Figure 7 below. The application must first be run from the Analyze Application window, as shown in Figure 6 above. While the application is running, open the Domains tab in the Profiles window to view and enable the domains you are interested in making part of the Intel GPA Platform Analyzer trace.<br /> <br />
<p ><img src="http://software.intel.com/file/40462" /></p>
<div ><b>Figure 8. </b><i>Domains tab in the Profiles window<br /></i></div>
<br /> <span >* All screenshots in this document were captured using Intel® Graphics Performance Analyzers 4.3<br /><br /><br /></span>
<h2 class="sectionHeading">About the Author</h2>
Omar A. Rodriguez is a software engineer in the Intel Software and Services Group, where he supports Intel graphics solutions in the Visual Computing Software Division. He holds a B.S. in Computer Science from Arizona State University. Omar is not the lead guitarist for the Mars Volta.<br /><br />
<div id="vc-meta" >
<div id="vc-meta-author">
<div></div>
</div>
<div id="vc-meta-pubdate">01-13-2012</div>
<div id="vc-meta-modificationdate">01-13-2012</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product">
<div class="gpa">Intel® GPA</div>
</div>
<div id="vc-meta-category">
<div></div>
</div>
<div id="vc-meta-thumb">http://software.intel.com/file/40457</div>
<div id="vc-meta-abstract">Intel® GPA Platform Analyzer is an instrumentation-based tool. The Intel® GPA instrumentation API provides functionality to describe various constructs such as dependencies between tasks. This article describes how to instrument your code to use this tool.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/intel-graphics-performance-analyzers-instrumentation-walkthrough/</link>
      <pubDate>Tue, 10 Jan 2012 23:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/intel-graphics-performance-analyzers-instrumentation-walkthrough/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/intel-graphics-performance-analyzers-instrumentation-walkthrough/</guid>
      <category>Visual Computing</category>
      <category>Intel® Graphics Performance Analyzers Knowledge Base</category>
      <category>Intel® Graphics Performance Analyzers (Intel® GPA)</category>
      <category>Visual Computing Source</category>
    </item>
  </channel></rss>
