<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Sun, 08 Nov 2009 02:32:41 -0800 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/articles/multi-core/type/technical-article/feed/" rel="self" type="application/rss+xml" />
    <title>Intel Software Network articles feed</title>
    <link>http://software.intel.com/en-us/articles/multi-core/technical-article//all</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>Visualize This! on Intel Software Network TV</title>
      <description><![CDATA[ <table border="0" width="100" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td valign="top">
<div id="left_container">
<div id="header_content"><a href="http://software.intel.com/en-us/visual-computing/" title="Visual Computing Developer Community"><img border="0" width="727" src="http://software.intel.com/file/20493" height="96" /></a></div>
<div id="left_content_container2"><!-- START left content -->
<div id="showcase_01">
<p>
<object height="341" width="700" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000">
<param name="src" value="http://blip.tv/play/hK0ki9Rqldp%2B%2Em4v" />
<param name="allowfullscreen" value="true" /><embed allowfullscreen="true" src="http://blip.tv/play/hK0ki9Rqldp%2B%2Em4v" type="application/x-shockwave-flash" height="341" width="700"></embed>
</object>
</p>
<p style="font-size:12px"><a href="http://www.intel.com/software/arti"><img border="0" align="left" src="http://software.intel.com/file/21976" alt="Arti Gupta" style="padding-right:10px;" /></a> <br />Watch Visualize this alternate Tuesdays at noon PST. <a href="http://software.intel.com/en-us/profile/334096/">Arti Gupta</a> your community manager talks with Intel experts and external luminaries on visual computing trends and Intel tools and technologies.</p>
<br />
<p style="font-size:12px"><b><br /><br />Show Schedule</b></p>
<table border="0" width="100%">
<tbody>
<tr>
<td width="110">
<div align="center"><img width="90" src="http://software.intel.com/file/23508" height="93" /></div>
</td>
<td valign="middle">Steve Pitzel community manager for the Artist/Animator area on the Visual computing community will speak with Son Kim, winner of the 1st “user created content” contest @ Project Offset. His creation the Bug-Back toad can be found <a href="http://www.projectoffset.com/forums/viewtopic.php?f=44&amp;t=905 ">here</a><br /></td>
<td valign="middle">11/10/2009</td>
</tr>
<tr>
<td width="110">
<div align="center"><img src="http://software.intel.com/file/23082" /></div>
</td>
<td valign="middle">Dr. Peter E. Raad, Professor and Executive Director of the Guildhall at SMU, will talk about game development in academia</td>
<td valign="middle">12/1/2009</td>
</tr>
<tr>
<td width="110">
<div align="center"><img src="http://software.intel.com/file/23291" /></div>
</td>
<td valign="middle">Lakshmi Narasimhan – Senior Application Engineer at Intel will speak with Arti on the tips and challenges with Cross platform game development</td>
<td valign="middle">12/15/2009</td>
</tr>
<tr>
<td width="110"><br /><br /></td>
<td valign="middle">Holiday</td>
<td valign="middle">12/29/2009</td>
</tr>
<tr>
<td width="110"><br /><img width="74" src="http://software.intel.com/file/23082" height="90" /></td>
<td valign="middle">Topic tbd</td>
<td valign="middle">1/12/2010</td>
</tr>
<tr>
<td width="110">
<div align="center"><img src="http://software.intel.com/file/23248" /></div>
</td>
<td valign="middle">Dr. Michael Gourlay of the University of Central Florida will talk about fluid simulation</td>
<td valign="middle">1/29/2010</td>
</tr>
<tr>
<td width="110">
<div align="center"><img src="http://software.intel.com/file/23249" /></div>
</td>
<td valign="middle">Professor DJ Kehoe will talk about Artificial Intelligence engines</td>
<td valign="middle">2/9/2010</td>
</tr>
</tbody>
</table>
<p> </p>
<p><b>Past Episodes<br /></b></p>
<table border="0" width="100%">
<tbody>
<tr>
<td width="110">
<div align="center"><img src="http://software.intel.com/file/23083" /></div>
</td>
<td valign="middle">Hansoft CEO Patric Palm on why agile development tools are needed in game development</td>
<td valign="middle">11/3/2009</td>
</tr>
<tr>
<td width="110">
<div align="center"><img src="http://software.intel.com/file/23193" /></div>
</td>
<td valign="middle">Drew Sikora, Executive producer at GameDev.net, spoke about <a href="http://software.intel.com/en-us/blogs/2009/10/26/visualize-this-gamedevnet-trends-and-challenges-in-game-development/">Game Development trends and challenges</a></td>
<td width="100" valign="middle">10/20/2009</td>
</tr>
<tr>
<td width="110">
<div align="center"><img width="90" src="http://software.intel.com/file/23084" height="96" /></div>
</td>
<td valign="middle">Arti spoke with Intel researcher Robert Adams on <a href="http://software.intel.com/en-us/blogs/2009/10/07/visualize-this-building-the-future-virtual-worlds/">Building the future Virtual worlds<br /></a></td>
<td valign="middle">10/6/2009</td>
</tr>
<tr>
<td width="110" valign="top">
<div align="center"><img src="http://software.intel.com/file/23057" /></div>
</td>
<td valign="middle">From Intel Developer Forum in San Francisco. Arti will spoke with Paul Lindberg – Intel engineer on <a href="http://software.intel.com/en-us/blogs/2009/09/23/visualize-this-using-intel-parallel-studios-in-game-development/">How Parallel Studio can become an indispensable tool for game development.</a></td>
<td width="100" valign="middle">9/23/2009</td>
</tr>
<tr>
<td width="110" valign="top">
<div align="center"><img src="http://software.intel.com/file/23060" /></div>
</td>
<td valign="middle">Arti spoke with Intel Application Engineer Charles Congdon on the work he and his team have done with Dreamworks Animation.<br /><a href="http://software.intel.com/en-us/blogs/2009/09/15/visualize-this-dreamworks-and-intel-optimization-process/">Intel and Dreamworks optimization and re-architecture work</a>.</td>
<td valign="middle">9/15/2009</td>
</tr>
<tr>
<td width="110" valign="top">
<div align="center"><img src="http://software.intel.com/file/23064" /></div>
</td>
<td valign="middle">Intel's Kath Knobe and Ganesh Rao spoke about <a href="http://software.intel.com/en-us/blogs/2009/08/28/visualize-this-concurrent-collections-for-cc/" title="http://software.intel.com/en-us/blogs/2009/08/28/visualize-this-concurrent-collections-for-cc/">Concurrent Collections for C/C++</a></td>
<td valign="middle">8/28/2009</td>
</tr>
<tr>
<td width="110" valign="top">
<div align="center"><img src="http://software.intel.com/file/23070" /></div>
</td>
<td valign="middle">Arti spoke with Evelyn Watts Field services manager at Corel about Movie Factory 7 and their use of the Intel Media SDK. <br /><a href="http://software.intel.com/en-us/blogs/2009/08/17/visualize-this-live-from-siggraph-corel-movie-factory-7-and-intel-media-sdk/" title="http://software.intel.com/en-us/blogs/2009/08/17/visualize-this-live-from-siggraph-corel-movie-factory-7-and-intel-media-sdk/">Visualize this! Live from Siggraph - Corel Movie Factory 7 and Intel Media SDK</a></td>
<td valign="middle">8/17/2009</td>
</tr>
<tr>
<td width="110" valign="top">
<div align="center"><img src="http://software.intel.com/file/23071" /></div>
</td>
<td valign="middle">John Civatte Director of Sales at BOXX technologies spoke about the 4850 workstation, 10300 Render farm and Intel processors<br /><a href="http://software.intel.com/en-us/blogs/2009/08/17/visualize-this-live-from-siggraph-boxx-products-and-their-use-of-intel-processors/" title="http://software.intel.com/en-us/blogs/2009/08/17/visualize-this-live-from-siggraph-boxx-products-and-their-use-of-intel-processors/">Visualize this! Live from Siggraph - BOXX products and their use of Intel processors</a></td>
<td valign="middle">8/17/2009</td>
</tr>
<tr>
<td width="110" valign="top">
<div align="center"><img src="http://software.intel.com/file/23072" /></div>
</td>
<td valign="middle">Arti spoke with Carl Jacobson VP of Marketing at Cakewalk about Sonar 8 and their use of threading with Intel processors and tools<br /><a href="http://software.intel.com/en-us/blogs/2009/08/17/visualize-this-live-from-siggraph-a-talk-with-cakewalk/" title="http://software.intel.com/en-us/blogs/2009/08/17/visualize-this-live-from-siggraph-a-talk-with-cakewalk/">Visualize this! Live from Siggraph - A talk with Cakewalk</a></td>
<td valign="middle">8/17/2009</td>
</tr>
<tr>
<td width="110" valign="top">
<div align="center"><img src="http://software.intel.com/file/23192" /></div>
</td>
<td valign="middle">Arti spoke with Brad Peebler Vice President at Luxology about modo, Nexus and how their use of threading, Intel processors and threading tools has enabled them to meet their scale needs<br /><a href="http://software.intel.com/en-us/blogs/2009/08/17/visualize-this-live-from-siggraph-luxology-and-its-use-of-multicore/" title="http://software.intel.com/en-us/blogs/2009/08/17/visualize-this-live-from-siggraph-luxology-and-its-use-of-multicore/">Visualize this! Live from Siggraph - Luxology and its use of multicore</a></td>
<td valign="middle">8/17/2009</td>
</tr>
<tr>
<td width="110" valign="top">
<div align="center"><img src="http://software.intel.com/file/23073" /></div>
</td>
<td valign="middle">I spoke with Robert Hoffmann Senior Product Marketing Manager at Autodesk about Maya, 3ds Max and Intel processors and tools<br /><a href="http://software.intel.com/en-us/blogs/2009/08/17/visualize-this-live-from-siggraph-autodesk-maya-and-3ds-max/" title="http://software.intel.com/en-us/blogs/2009/08/17/visualize-this-live-from-siggraph-autodesk-maya-and-3ds-max/">Visualize this! Live from Siggraph - Autodesk Maya and 3ds Max</a></td>
<td valign="middle">8/17/2009</td>
</tr>
<tr>
<td width="110" valign="top">
<div align="center"><img src="http://software.intel.com/file/23311" /></div>
</td>
<td valign="middle">For todays show Arti shared a new Intel product announced at Siggraph last week, the beta version of the Intel Media SDK. <br /><a href="http://software.intel.com/en-us/blogs/2009/08/14/visualize-this-intel-media-sdk-beta-launch/" title="http://software.intel.com/en-us/blogs/2009/08/14/visualize-this-intel-media-sdk-beta-launch/">Visualize this! Intel Media SDK beta launch</a></td>
<td valign="middle">8/14/2009</td>
</tr>
<tr>
<td width="110" valign="top">
<div align="center"><img src="http://software.intel.com/file/23074" /></div>
</td>
<td valign="middle">Our topic today is the Kaboom project, and the new whitepaper multi threaded fluid simulation for games. Joining us today are Jeff Freeman and Quentin Froemke software engineers in the Visual computing software division at Intel.<br /><a href="http://software.intel.com/en-us/blogs/2009/07/28/visualize-this-05-the-kaboom-project/" title="http://software.intel.com/en-us/blogs/2009/07/28/visualize-this-05-the-kaboom-project/">Visualize this! 05 the Kaboom project</a></td>
<td valign="middle">7/28/2009</td>
</tr>
<tr>
<td width="110" valign="top">
<div align="center"><img src="http://software.intel.com/file/23075" /></div>
</td>
<td valign="middle">My guest for this show is Chris Cormack, product designer. Chris will talk to us about the release 2.1 of the Intel Graphics Performance Analyzer toolset<br /><a href="http://software.intel.com/en-us/blogs/2009/07/16/visualize-this-chris-cormack-gpa-product-designer-on-gpa-21/" title="http://software.intel.com/en-us/blogs/2009/07/16/visualize-this-chris-cormack-gpa-product-designer-on-gpa-21/">Visualize this! Chris Cormack GPA product designer on GPA 2.1</a></td>
<td valign="middle">7/16/2009</td>
</tr>
<tr>
<td width="110" valign="top">
<div align="center"><img src="http://software.intel.com/file/23076" /></div>
</td>
<td valign="middle">Our guest today is Chris Taylor CEO Gas Powered Games. Chris will talk to us about the making of Demigod<br /><a href="http://software.intel.com/en-us/blogs/2009/07/06/visualize-this-gpg-ceo-chris-taylor-on-demigod-and-its-use-of-gpa/" title="http://software.intel.com/en-us/blogs/2009/07/06/visualize-this-gpg-ceo-chris-taylor-on-demigod-and-its-use-of-gpa/">Visualize this! GPG CEO Chris Taylor on Demigod and its use of GPA</a></td>
<td valign="middle">7/6/2009</td>
</tr>
<tr>
<td width="110" valign="top">
<div align="center"><img src="http://software.intel.com/file/23077" /></div>
</td>
<td valign="middle">Scott Crabtree, Engineering manager in the Visual Computing Software Division shared the game demos developed by his team and discussed tips and techniques on how to use parallel programming techniques combined with the power of multi core processors for enhanced game performance.<br /><a href="http://software.intel.com/en-us/blogs/2009/06/19/visualize-this-game-demos-smoke-pet-me-destroy-the-castle-and-horsepower/" title="http://software.intel.com/en-us/blogs/2009/06/19/visualize-this-game-demos-smoke-pet-me-destroy-the-castle-and-horsepower/">Visualize this! Game Demos - Smoke, Pet Me, Destroy the Castle and Horsepower</a></td>
<td valign="middle">6/19/2009</td>
</tr>
<tr>
<td width="110" valign="top">
<div align="center"><img src="http://software.intel.com/file/23078" /></div>
</td>
<td valign="middle">Guest for this show was Steve Pitzel, Community manager for Visual Computing on the Intel Software Network. Steve and Arti talked about the new Artist / Animator resources area on the Visual computing community<br /><a href="http://software.intel.com/en-us/blogs/2009/06/12/visualize-this-artistanimator-resources/" title="http://software.intel.com/en-us/blogs/2009/06/12/visualize-this-artistanimator-resources/">Visualize This! Artist/Animator Resources</a></td>
<td valign="middle">6/12/2009</td>
</tr>
<tr>
<td width="110" valign="top">
<div align="center"><img src="http://software.intel.com/file/23079" /></div>
</td>
<td valign="middle">Joining me for this show is Steve Winburn – Senior Graphics Product Evangelist. Our topic - Intel’s recently announced Graphics Performance Analyzer toolset.<br /><a href="http://software.intel.com/en-us/blogs/2009/06/08/visualize-this-intel-graphics-performance-analyzer-toolset/" title="http://software.intel.com/en-us/blogs/2009/06/08/visualize-this-intel-graphics-performance-analyzer-toolset/">Intel Graphics Performance Analyzer toolset</a></td>
<td valign="middle">6/8/2009</td>
</tr>
<tr>
<td width="110" valign="top">
<div align="center"><img src="http://software.intel.com/file/23080" /></div>
</td>
<td valign="middle">
<p><a href="http://software.intel.com/en-us/blogs/2009/06/08/visualize-this-the-intel-visual-adrenaline-program-show/">Arti and Mandy Mock Program Manager for Intel's Visual Adrenaline Program talked about the Visual Adrenaline program. </a></p>
<p><a href="http://software.intel.com/en-us/blogs/2009/06/08/visualize-this-the-intel-visual-adrenaline-program-show/">The Intel Visual Adrenaline Program Show</a></p>
</td>
<td valign="middle">6/8/2009</td>
</tr>
</tbody>
</table>
<p> </p>
<br /></div>
</div>
</div>
</td>
<td valign="top" style="background-color: #E6E6E6;"><!-- RHC -->
<table border="0" width="100%" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td width="215" align="center">
<table border="0" align="center" width="223" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td height="4"><img width="232" src="http://software.intel.com/file/20516" height="4" /></td>
</tr>
<tr>
<td>
<table border="0" align="center" width="223" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td align="center" valign="top"><a href="http://www.intelsoftwaregraphics.com/?lid=5ceakfXf8Ho=&amp;siteid=cqMoF5H/37o="><img border="0" width="223" src="http://software.intel.com/file/20512" alt="Intel Visual Adrenaline" height="71" title="Intel Visual Adrenaline" /></a></td>
</tr>
<tr>
<td valign="top" style="background-image: url(http://software.intel.com/file/20513); background-repeat: repeat-x; height: 69px; background-color: #11436b;">
<table border="0" width="223" cellpadding="0" cellspacing="0" style="padding-top: 8px;">
<tbody>
<tr>
<td height="8" width="11"></td>
<td width="10" align="center"><img width="5" src="http://software.intel.com/file/20514" height="5" /></td>
<td align="left"><a href="http://software.intel.com/en-us/visual-computing/" style="color:#FFFFFF;" title="Intel Adrenaline Developer Community">Developer Community</a></td>
<td width="10"></td>
</tr>
<tr>
<td height="8"></td>
<td align="center"><img width="5" src="http://software.intel.com/file/20514" height="5" /></td>
<td align="left"><a href="http://www.intel.com/cd/software/partner/asmo-na/eng/index.htm" style="color:#FFFFFF;" title="Intel Adrenaline Software Partner Program">Intel® Software Partner Program</a></td>
<td></td>
</tr>
<tr>
<td height="8"></td>
<td align="center"><img width="5" src="http://software.intel.com/file/20514" height="5" /></td>
<td align="left"><a href="http://www.intel.com/Consumer/Game/index.htm" style="color:#FFFFFF;" title="Intel Adrenaline Game On">Game On</a></td>
<td></td>
</tr>
<tr>
<td height="8"></td>
<td align="center"><img width="5" src="http://software.intel.com/file/20514" height="5" /></td>
<td align="left"><a href="http://www.intelsoftwaregraphics.com/?lid=5ceakfXf8Ho=&amp;siteid=cqMoF5H/37o=" style="color:#FFFFFF;" title="Intel Adrenaline Showcase">Showcase</a></td>
<td></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td height="7" valign="top"><img width="223" src="http://software.intel.com/file/20515" height="7" /></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td height="4" valign="top"><img width="6" src="http://software.intel.com/file/20494" height="6" /></td>
</tr>
<tr>
<td></td>
</tr>
</tbody>
</table>
<div id="right_container3"><center><a href="http://software.intel.com/en-us/tv/"><img border="0" width="215" src="http://software.intel.com/file/20520" alt="Intel Visual Computing TV Show" height="172" title="Intel Visual Computing TV Show" /></a><br /><br /></center></div>
<div id="right_container3"><center><a href="http://software.intel.com/en-us/contests/thread-like-wildfire/contests.php"><img border="0" width="215" src="http://software.intel.com/file/21944" alt="Intel Thread Like Wildfire" height="172" title="Intel Thread Like Wildfire" /></a> </center></div>
<br /><center>
<table border="1" cellpadding="0" cellspacing="0" id="nav_table">
<tbody>
<tr>
<td>
<table border="0" width="190" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td width="9" class="right_container_hdr"></td>
<td class="right_container_hdr">
<h4>Related Links</h4>
</td>
<td class="right_container_hdr"></td>
</tr>
<tr>
<td height="4" colspan="3" valign="top"><img width="4" src="http://software.intel.com/file/20494" height="4" /></td>
</tr>
<tr>
<td height="15"></td>
<td valign="middle"><a href="http://www.intel.com/software/graphics" title="Intel Visual Computing Home">Visual Computing Home</a></td>
<td></td>
</tr>
<tr>
<td></td>
<td>
<h3>Intel<sup>®</sup> Technologies</h3>
</td>
<td></td>
</tr>
<tr>
<td></td>
<td valign="top"><a href="http://software.intel.com/en-us/articles/integrated-graphics/" title="Intel Visual Computing Technologies Integrated Graphic">Integrated Graphic</a><br /><a href="http://software.intel.com/en-us/articles/larrabee/" title="Intel Visual Computing Technologies Larrabee">Larrabee</a><br /><a href="http://software.intel.com/en-us/articles/parallel-programming-vc/" title="Intel Visual Computing Technologies Parallel Programming">Parallel Programming</a></td>
<td></td>
</tr>
<tr>
<td height="4" colspan="3" valign="top"><img width="4" src="http://software.intel.com/file/20494" height="4" /></td>
</tr>
<tr>
<td></td>
<td>
<h3>Focus Areas</h3>
</td>
<td></td>
</tr>
<tr>
<td></td>
<td valign="top"><a href="http://software.intel.com/en-us/articles/game-dev/" title="Intel Game Development Focus Area">Game Development</a><br /><a href="http://software.intel.com/en-us/articles/artist-animator/" title="Intel Visual Computing Artist/Animator Focus Area">Artist/Animator</a><br /><a href="http://software.intel.com/en-us/articles/media/" title="Intel Visual Computing Media Focus Area">Media</a></td>
<td></td>
</tr>
<tr>
<td height="4" colspan="3" valign="top"></td>
</tr>
<tr>
<td></td>
<td>
<h3>Develop</h3>
</td>
<td></td>
</tr>
<tr>
<td></td>
<td valign="top"><a href="http://software.intel.com/en-us/articles/tools-vc/" title="Intel Visual Computing Devlopment Tools">Tools</a><br /><a href="http://software.intel.com/en-us/articles/code/" title="Intel Visual Computing Devlopment Code">Code</a></td>
<td></td>
</tr>
<tr>
<td height="4" colspan="3" valign="top"></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</center><!--END right column Content --></td>
</tr>
</tbody>
</table>
 ]]></description>
      <link>http://software.intel.com/en-us/articles/visualize-this</link>
      <pubDate>Fri, 06 Nov 2009 14:15:34 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/visualize-this#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/visualize-this</guid>
      <category>Visual Computing</category>
      <category>Intel® Software Network TV</category>
      <category>Game Development</category>
    </item>
    <item>
      <title>Intel® Graphics Performance Analyzers (Intel® GPA) FAQ</title>
      <description><![CDATA[ <i><b>Intel® Graphics Performance Analyzers (Intel® GPA), Frequently Asked Questions (FAQ)</b></i><br /><br /><i>Q: What is Intel® GPA, and why would I want to use it?</i><br />A: Intel® Graphics Performance Analyzers (Intel® GPA) is a customizable suite of software tools provide an in-depth analysis of a game or graphics application, allowing developers to quickly and efficiently pinpoint bottlenecks and optimize their games for Intel® Integrated Graphics–based PCs. GPA allows developers to analyze the game in its normal environment, and perform experiments without modifying the code. Therefore, GPA can help developers ensure their games play well on a broader range of PCs, allowing them to reach new customers and prepare their software for the future of mobile gaming.<br /><br /><i>Q: I’ve seen that a new version of GPA, version 2.2, has been released; should I switch to this new version?</i><br />A: Yes! Download it today and start using version 2.2, as it contains a number of enhancements, as well as improved stability and reliability compared to the previous versions. As to enhancements, the 2.2 release adds the following key features: support for Microsoft Windows* 7 OS, support for Microsoft DirectX* 10, and GPA System Analyzer will report frame-based metrics (rather than time-based). Also, 2.2 includes all features from the 2.1 release (such as new DX metrics in System Analyzer, and a pixel history option and enhanced buffer viewing options in Frame Analyzer). To see the full list of new features in 2.2, see the documentation provided with the 2.2 download; for 2.1 features, refer to this <a title="announcing GPA 2.1" href="http://software.intel.com/en-us/articles/GPA-version2dot1/">GPA Knowledge Base article</a>.<br /><br /><i>Q: What kinds of problems can GPA find?</i><br />A: If you have "hotspots" within your game, GPA can help pinpoint them either at the system level, or by analyzing all or part of a frame (and within that frame analyze each portion of the rendering pipeline). Once you've identified these issues, GPA can let you try different experiments to see if you can eliminate them. The benefit is that GPA can help improve your frame rate and/or allow you to add new visual effects while still providing an acceptable level of user interactivity. <br /><br /><i>Q: How do System Analyzer and Frame Analyzer help identify optimization opportunities in my game?</i><br />A: The System Analyzer application provides access to system-wide metrics for your game, including the CPU, GPU, Microsoft DirectX*, and the graphics driver. Within System Analyzer you perform various "what-if" experiments to diagnose at a high level whether your game's performance bottlenecks are concentrated within one or more of these areas, helping you determine whether additional fine-tuning of your application using Frame Analyzer (if GPU-bound) or other Intel performance optimization products would be helpful. Once you've determined that the issue is within the GPU, the Frame Analyzer application allows you to drill down within a single graphics frame to pinpoint specific rendering problems, such as texture bandwidth, pixel shader performance, level-of-detail issues, or other bottlenecks within each portion of the rendering pipeline. For example, using the "simple pixel shader" you can determine what portion of the rendering time is being spent within the shaders; editing the shaders within Frame Analyzer itself you can determine whether you can achieve a faster rendering time at the same level of visual quality.<br /><br /><i>Q: What are the key advantages of GPA?</i><br />A: Intel has worked extensively with game developers to create a product that precisely meets their needs, so they can quickly optimize games. The key advantages of using GPA are:<br />
<blockquote>
<ul>
<li><span style="text-decoration: underline;">Intuitive interface:</span> Quickly find issues, without a lot of clutter; GPA’s easy work flow fits the way game developers want to optimize their games.</li>
<li><span style="text-decoration: underline;">In-depth, real-time analysis:</span><i> </i>Identify bottlenecks, experiment with changes, and see results in real time — all within GPA and without modifying the game code.</li>
<li><span style="text-decoration: underline;">Remote network model:</span> Eliminate the processing overhead of running tools on the same system as your game, thereby improving the overall accuracy of your results (so that you don't end up trying to optimize portions of your code that aren't an issue).</li>
<li><span style="text-decoration: underline;">Extensive API:</span> Extend the tools for your specific needs by adding your own metrics and/or using metrics gathered by GPA in your own analysis tools.</li>
<li><span style="text-decoration: underline;">Intel Integrated Graphics support:</span> Optimize games and graphics-intensive applications for Intel Integrated Graphics–based systems; GPA is the only toolset that can help you optimize your game on these devices.</li>
</ul>
</blockquote>
<i>Q: What are the GPA system requirements?</i><br />A: GPA requires a PC with a 1GHz or faster processor,  2GB of system memory is recommended, and 512MB of video RAM is required; 100MB is required for product installation and you'll need 5GB disk space for all product features and all architectures. GPA also requires either Microsoft Windows XP* OS 32 bit edition with Service Pack 3, Microsoft Windows Vista* OS (32 or 64 bit version) with Service Pack 2, or Microsoft Windows 7* OS. For full GPA support, the game or application should use the Microsoft DirectX* 9 or Microsoft DirectX* 10 API (see below for comments on DX10.1 and DX11 support).<br /><i><br />Q: What graphics devices does GPA support?</i><br />A: GPA supports the Intel® G45 Express Chipset and the Mobile Intel® GM45 Express Chipset. However, while Intel does not test or support the use of these tools on non-Intel graphics or older generations of Intel® Integrated Graphics chipsets, the tools do not block, or otherwise prevent use with non-Intel graphics.<br /><br /><i>Q: What's the cost of GPA?</i><br />A: The GPA tool is available at no charge to members of Intel’s Visual Adrenaline Developer Program. For more information on membership in this free program, visit the <a title="Visual Adrenaline Home Page" href="http://www.intel.com/software/visualadrenaline" target="_blank">Visual Adrenaline Home Page</a>. Additionally, GPA can be purchased for $299 from the <a title="IBX for GPA" href="http://sx.intel.com/p-744-intel-graphics-performance-analyzers.aspx" target="_blank">Intel Business Exchange</a>. <br /><br /><i>Q: How do I start using GPA?</i><br />A: It's pretty easy to get started with GPA... most users can start using GPA immediately after installing the package, since GPA uses standard graphics drivers and doesn't require modifications to your game code. To get you up and running quickly, check out the <a title="GPA Quick Start Guide" href="http://software.intel.com/en-us/articles/intel-graphics-performance-analyzers-quick-start-guide/" target="_blank">GPA Quick Start Guide</a>, which takes you through the installation process, then shows you how to run the key GPA applications with a simple graphics application.<br /><br /><i>Q: How difficult is it to learn how to use the product?</i><br />A: The GPA product features an intuitive user interface that does not require extensive training to quickly access key performance metrics. Therefore, many users are able to realize the benefits of GPA very quickly. However, as GPA allows you to perform precise analysis and experiments for every portion of the graphics pipeline, users with a detailed knowledge of DX will be more readily able to utilize these advanced options within GPA.<br /><br /><i>Q: How do I get support for GPA?</i><br />A: The primary support model for GPA is through the <a title="GPA Support Forum" href="http://software.intel.com/en-us/forums/intel-graphics-performance-analyzers/" target="_blank">GPA Support Forum </a>and the <a title="GPA Knowledge Base Articles" href="http://software.intel.com/en-us/articles/intel-gpa-kb/all/1/" target="_blank">GPA Knowledge Base</a>. At the Support Forum you can ask questions about the product, share your experiences with other GPA users, and ask for assistance should you encounter issues when using the product. The Knowledge Base area contains various “tips &amp; tricks”, training material, and pointers to other information that may be of interest to GPA users.<br /><br /><i>Q: Where do I find out more information about GPA?</i><br />A: To find out more about the GPA tool suite, visit the <a title="GPA Home Page" href="http://www.intel.com/software/gpa" target="_blank">GPA Home Page</a>. The product’s home site provides detailed information about the tool, including information on how to download the tool, training and support resources, and videos on the product to help you get started quickly.<br /><br /><i>Q: Though GPA seems to be targeting game developers, will GPA work with other graphics applications?</i><br />A: GPA was specifically developed to meet the needs of game developers. However, the features of GPA could be used to analyze the performance of other visual computing applications. In other words, our expectation is that anyone developing graphics applications, both "expert" and "novice" alike, should be able to take advantage of the analysis and optimization capabilities of GPA. <br /><br /><i>Q: Will GPA eventually support older Intel graphics chipsets? </i><br />A: The latest Intel® graphics chipsets, namely the Intel® G45 Express Chipset and the Intel® Mobile GM45 Express Chipset, have hardware support for various GPU metrics that are not available in the older graphics chipsets. Therefore, even if these graphics drivers were updated, GPA would not be able to provide these metrics in these older devices. <br /><br /><i>Q: Will GPA support all future Intel graphics devices, including Larrabee?</i><br />A: Intel intends to continue offering the tools that allow developers to take the best advantage of Intel graphics devices, both now and into the future. We will continue to identify, with close cooperation from developers, the best tools to enable optimization and performance of these devices. <br /><br /><i>Q: What should I expect to see if I attempt to run GPA on non-supported graphics devices? </i><br />A: GPA will warn you when running on unsupported hardware, and then continue to work as best as possible. We have not tested GPA extensively on non-Intel hardware, but we have had reports of customers running GPA very successfully on some graphics hardware. Features will vary based upon the hardware capabilities of these devices. For example, System Analyzer on non-supported devices is not able to show metrics gathered from the graphics device (such as pixel draw rate), but most of the Frame Analyzer functions should work on any graphics device.<br /><br /><i>Q: What is your plan for supporting DX10.1 and DX11?</i><br />A: We are actively exploring enhancing GPA to support DX10.1 and DX11. Specific plans for these features will be announced at a later date. <br /><br /><i>Q: What is your plan for supporting OpenGL?</i><br />A: At this time, Intel does not have any plans to support OpenGL. In order to build the best DX tool possible, GPA is directly tied to DX.<br /><br /><i>Q: How does GPA compare with other Intel products such as Intel® VTune™ Performance Analyzer and Intel® Parallel Studio?</i><br />A: GPA is complementary to other Intel tools. It can help determine whether a potential graphics performance bottlenecks exist, and then helps you analyze and perform “what if” experiments to help optimize the graphics portion of your application. If the bottlenecks are determined to be CPU issues, then the other Intel tools mentioned here can help identify those performance bottlenecks and optimize your application for the CPU. <br /><i><br />Q: Does GPA provide an API so that I can add my own metrics, or "grab" the metrics for use in my own analysis tool?</i><br />A: Advanced users can install the GPA SDK; this includes an API that allows users to either create their own metrics, or access the various CPU and GPU metrics from within their own analysis tools. To access this interface, be sure to select "GPA SDK" when you install GPA; for help in using these API functions, refer to the "Intel® Graphics Performance Analyzer Core Services API Reference" from the Start Menu under the Graphics Performance Analyzers submenu.<br /><br /><i>Q: Have developers been able to use GPA to improve the performance of "real world" games?</i><br />A: Many developers have utilized GPA to demonstrate improved performance of games on Intel integrated graphics-based PCs. Many of these games can be found in the <a title="Game Gallery" href="http://software.intel.com/sites/billboard/index.php" target="_blank">Game Gallery</a>. A partial list of titles includes <i>Demigod</i>™ from Gas Powered Games*, <i>Empire: Total War </i>from Sega*, and <i>Ghostbusters, The Video Game </i>from Terminal Reality*. The performance gains in these games include both increased frame rate and additional visual features that improve the user experience.<br /><br /><i>Q: Do I have to modify the software for my game, or install special drivers, in order to be able to use GPA?</i><br />A: Using the latest standard graphics drivers available from Intel, your game can be analyzed without any modification by GPA. This is possible because GPA can access the CPU, driver, DX, and GPU metrics directly from the game environment, and therefore does not need you to insert special calls or load special drivers to analyze the game.<br /><br /><i>Q: How do I submit suggestions or feedback to the GPA team?</i><br />A: Use the <a target="_blank" title="submit feedback and suggestions on GPA" href="http://software.intel.com/en-us/forums/intel-graphics-performance-analyzers/">Intel GPA Support Forum </a>to submit suggestions on new features, and/or to comment on the features currently in the product.<br /><br /><i>Q: How fast is frame capturing if I need to catch a single frame in a fast-moving scene?</i><br />A: In the GPA 2.1 release, we've added a capability of being able to "single-step" one frame at a time, which allows you to capture a specific frame of interest to you. <br /><br /><br /><i>* Other names and brands may be claimed as the property of others.</i><br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/articles/gpa-faq</link>
      <pubDate>Wed, 04 Nov 2009 12:02:46 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/gpa-faq#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/gpa-faq</guid>
      <category>Visual Computing</category>
      <category>Intel® Graphics Performance Analyzers Knowledge Base</category>
      <category>Intel® Graphics Performance Analyzers (GPA)</category>
      <category>Game Development</category>
    </item>
    <item>
      <title>Do-it-yourself Game Task Scheduling</title>
      <description><![CDATA[ I attended my first demo party in 2008: Evoke in Germany. I was giving a talk about multi-core optimization in games and how to use Intel® Threading Building Blocks (Intel® TBB) to efficiently spread work over threads, when this question came up: Can I use this in 64K? The rules for 64K demos are simple, "65536 bytes maximum, one self contained executable," and the results are often unbelievable. Intel® TBB happens to be a really elegant and slim library but, at 200KB, it just won't do. But, I hate to say no. Inevitably, I couldn't help but contemplate the idea of a sort of working scale model of Intel® Threading Building Blocks. It would be a minimal task scheduler, something that would be easy to study, tear apart, and play with. I was on a mission!<br /><br />Nulstein is the demo I created to address this need. It shows a simple but effective method for implementing task scheduling that can be adapted to most game platforms. <a href="http://software.intel.com/file/23093/">Click this link to download the code to Nulstein</a>.<br /><br /><br />
<h1 class="sectionHeading">Scheduling Tasks</h1>
If you are not familiar with task schedulers and why they are useful in games, the key lies in the difference between a thread and a task. A thread is a virtually infinite stream of operations which blocks when it needs to synchronize with another thread. A task, on the other hand, is a short stream of operations that executes a fraction of the work independently of other tasks and doesn't block. These properties make it possible to execute as many tasks simultaneously as the processor can run physical threads, and the work of the task scheduler mainly comes down to finding a new task to start when one finishes. This becomes truly powerful when you add that a task can itself spawn new tasks, as part of its execution or as a continuation. If the idea of splitting work in a collection of smaller tasks is straightforward, dealing with situations where a thread would normally block can be trickier. Most of the time a task can simply consume other tasks until the expected condition arises, and otherwise it is usually a simple matter of splitting the work in two tasks around the waiting point and letting the synchronization happen <i>implicitly</i>. But we'll come back to this later on.<br /><br />Breaking work down into tasks and using a scheduler with task stealing is a convenient, powerful, and efficient way to make use of multi-core processors.<br /><br />From a programming standpoint, on a system with n logical cores, Nulstein creates <i>n-1</i> worker threads to assist the game's main thread with running the tasks. Each worker manages its own "pile of work," a list of tasks that are ready to run. Every time one task finishes the worker picks the next one from the top of its pile; similarly, when a task is created it is dropped directly on to the top of the pile. This is much more efficient than having one global job queue, as each thread can work independently without any contention. But there is a catch: some piles might become empty much faster than others. In these cases, the scheduler steals the bottom half of the pile of a busy thread and gives it to a starving thread. This turns out to limit contention considerably because only two threads are impacted by the mutual exclusion necessary to carry out this operation.<br /><br /><br />
<h1 class="sectionHeading">Tasks and Task Pool Overview</h1>
<p style="text-align: center;"><img src="http://software.intel.com/file/23493" /></p>
<br />
<div style="text-align: center;"><b>Figure 1</b><br /></div>
<br />The task engine code is in TaskScheduler.h/.inl/.cpp (header, inlines, and code). <span style="font-family: courier;">CTaskPool</span> is primarily a collection of <span style="font-family: courier;">CWorkerThread</span>, where the bulk of the logic resides. <span style="font-family: courier;">CInternalTask</span> is the abstract superclass for all tasks; you will use subclasses of this class in the implementation of <span style="font-family: courier;">ParallelFor</span> and <span style="font-family: courier;">CSorter</span>.<br /><br /><span style="font-family: courier;">ParallelFor</span> is the simplest form of parallel code: a loop where iterations can execute independently of each other. Given a range and a method to process a section, work is spread over available threads and <span style="font-family: courier;">ParallelFor</span> returns once it has covered the full range.<br /><br /><span style="font-family: courier;">CSorter</span> implements a simple parallel merge sort, spawning new tasks for blocks bigger than a given threshold. Although the goal is to reduce code size, this is done as a C++ template to avoid the overhead of calling a function every time two items need to be compared.<br /><br />There is a very convenient effect here: code using these functions can still be understood as serial code. Code around a <span style="font-family: courier;">ParallelFor</span> executes before and after it, just as it reads.<br /><br />For uses beyond simple looping and sorting, you will need to spawn your own tasks. This is quite simple too:<br /><br />
<pre name="code" class="cpp">{<br />    CTaskCompletion Flag;<br />    CMyTask* pTask;<br />	<br />    pTask = new CMyTask(&amp;Flag,…);<br />    pThread-&gt;PushTask(pTask);<br />    …<br />    pThread-&gt;WorkUntilDone(&amp;Flag);<br />}<br /></pre>
<br />Your specific task is implemented by <span style="font-family: courier;">CMyTask</span> and you use Flag to track when it is done. (Note that <span style="font-family: courier;">pThread</span> must be the current thread.) Once <span style="font-family: courier;">PushTask</span> has been called, the task is eligible to be executed by the scheduler, or may be stolen by another thread. The current thread can continue to do other things, including pushing more tasks, until it calls <span style="font-family: courier;">WorkUntilDone</span>. This last call will run tasks from the thread's pile, or attempt to steal from other threads, until the completion flag is set. Again, it looks as if your task had been executed serially as part of the call (and it might have, indeed).<br /><br />
<pre name="code" class="cpp">{<br />     CMyTask* pTask;<br />	<br />     pTask = new CMyTask(pThread-&gt;m_pCurrentCompletion,…);<br />     pThread-&gt;PushTask(pTask);<br />}<br /></pre>
<br />In this alternative form, the task is created as a continuation, and you don't wait for it to complete as whatever waits for you will now also wait for this new task. When possible, this is a better approach as this is less synchronization work.<br /><br />These basic blocks are enough to implement all sorts of parallel algorithms used in games, in a fashion that reads serially. You still have to worry about access to shared data, but you can continue to write code that works in a series of steps which remain easy to read.<br /><br /><br />
<h1 class="sectionHeading">Inside the scheduler</h1>
Looking at what is happening inside, you see <span style="font-family: courier;">CTaskPool</span> is the central object; it creates and holds the worker threads. Initially, these are blocked waiting on a semaphore, the scheduler is idle and consumes no CPU. As soon as the first task is submitted, it is split between all threads (if possible) and the semaphore is raised by <i>worker_count</i> in one step, waking all threads as close to immediately as possible. The pool keeps track of the completion flag for this root task and workers keep running until it becomes set. Once done, all threads go idle again and a separate semaphore is used to make sure all threads are back to idle before accepting any new task. Conceptually, all workers are always in the same state: either all idle or all running.<br /><br />The role of <span style="font-family: courier;">CWorkerThread</span> is to handle tasks, which can be broken down into processing, queuing, and stealing them.<br /><br />Processing - The <span style="font-family: courier;">threadproc</span> handles the semaphores mentioned earlier and repeatedly calls <span style="font-family: courier;">DoWork(NULL)</span> when active. This method pops tasks from the pile until there are no more, and then it tries to steal from other workers and returns if it can't find anything to steal. <span style="font-family: courier;">DoWork</span> also can be called by <span style="font-family: courier;">WorkUntilDone</span> if a task needs to wait for another to finish before it can continue; in this case the expected completion flag is passed as a parameter and <span style="font-family: courier;">DoWork</span> returns as soon as it finds it set.<br /><br />Queuing - Because of stealing, there is a risk of contention. Since operations on the queue require a lock, use a spinning mutex, because you need to protect only a few instructions. <span style="font-family: courier;">PushTask</span> increments the task's completion flag and puts the task at the top of the queue. In the special case when the queue is full, run the task immediately as this produces the correct result. It's also worth noting that if the queue is full, then other workers must be busy too or they'd be stealing from you. Tasks also get executed immediately in the special case of a single core system because there is no point in queuing work when there is nothing to steal it; the whole scheduler is bypassed and the overhead of the library disappears.<br /><br />Stealing - <span style="font-family: courier;">StealTasks</span> handles the stealing. It loops on all other workers checking if one wants to <span style="font-family: courier;">GiveUpSomeWork</span>. If a worker has only one task queued, it will attempt to split it in two and transfer a "half-task" to the idle thread. If it didn't split or if there is more than one, it will return half the tasks (rounding up). The fact that workers are spinning on <span style="font-family: courier;">StealTasks</span> when their queue is empty enables them to return to work as soon as a task becomes available. This is important in the context of a game where latency tends to be more important than throughput.<br /><br />There isn't much more to the scheduler than that. The rest is implementation details best left to discover in the source code. But before you do that, you should know how the Nulstein demo uses the scheduler to take maximum advantage of implicit synchronization and to achieve most of the frame in parallel.<br /><br /><br />
<h1 class="sectionHeading">A Parallel Game Loop</h1>
<p style="text-align: center;"><img src="http://software.intel.com/file/23494" /></p>
<br />
<div style="text-align: center;"><b>Figure 2</b><br /></div>
<br />There are traditionally two main phases in a frame: the <i>update</i> that advances time and the <i>draw</i> that makes an image. In Nulstein, these phases have been subdivided further to achieve parallelism.<br /><br />The update is split into two phases. The first is a pre-update phase where every entity can read from every other but cannot modify its public state. This allows every entity to make decisions based on the state at "previous frame." They will then apply changes during the second phase, which is the actual <i>Update</i>. The rule for this second phase is that an entity can write to its state but must not access any other entity. This enables both of these phases to run as simple <span style="font-family: courier;">ParallelFor</span>'s and is trivial to implement unless there is a hard dependency between entities and you can't use the previous frame's state. A classic example would be a camera attached to a car: you don't want the viewport to move inside the car and need to know its exact position and orientation before you can update the camera. In these cases an entity can declare itself dependent on another entity (or several entities) and be updated only once it has been updated. And because you know they have finished updating, it's okay for the dependent object to read the updated states. In the demo, this is how the small cubes manage to stay tightly attached to the corners of bigger cubes.<br /><br />
<p style="text-align: center;"><img src="http://software.intel.com/file/23495" /></p>
<br />
<div style="text-align: center;"><b>Figure 3</b><br /></div>
<br />The draw phase is split into three phases. During the Draw, every entity is called to list items it needs to render and adds the items to a display list. The list has a 64 bits key that encodes an ID and other data such as z-order, alpha-blending, material, and so on. This is done through a <span style="font-family: courier;">ParallelFor</span>, with each thread adding to independent sections of an array. During this phase, things like visibility culling and filling of dynamic buffers can be done in parallel. Once every entity has declared what it wants to draw, the array goes through <i>Sort</i> which can be done in parallel too (although here, with entities in the order of a thousand, it doesn't make a difference). Finally, the scene is rendered, each item in the sorted list calling back the parent entity which does the actual draw calls.<br /><br />
<p style="text-align: center;"><img src="http://software.intel.com/file/23496" /></p>
<div style="text-align: center;"><b>Figure 4</b><br /></div>
<br />In figure 4, there are two Intel® Thread Profiler captures of a release build running on a Intel® Core™ i7 processor at 3.2GHz, at the same scale, with the task scheduler on and off. Because this is a release build, there is no annotation but the benefit of using the task scheduler is nevertheless quite clear; work is shown as green bars, with the serial case above and the parallel case below. The phase that remains serial is the <i>render</i> phase and it is mainly spent in DirectX and the graphics driver, with the gray line representing time spent waiting for vblank.<br /><br />Figure 5 below, shows the demo in profile mode where it is instrumented to show actual work as sections in solid. This gives an idea of how tasks spread over all threads (timings are not accurate: instrumentation has a massive impact on the performance of our spinning mutex).<br /><br />
<p style="text-align: center;"><img src="http://software.intel.com/file/23497" /></p>
<br />
<div style="text-align: center;"><b>Figure 5</b><br /></div>
<br />The resulting executable for this project is under 40K and if you use an exe packer, like kkrunchy by Farbrausch, it actually gets down to 16K. So, today, if you were to ask me whether you can use a task scheduler with stealing in a 64K, I can give you a definite yes! Beyond this feat, and because I believe we need to experiment with things to really understand them, I'm hoping that this project will provide people interested in parallel programming with a nice toy to mess around with.<br /><br />(For any project with less drastic size constraints, I recommend you turn to Intel® Threading Building Blocks as it provides a lot more optimizations and features.)<br /><br /><br />
<h1 class="sectionHeading">Bibliography</h1>
Reinders, James. Intel Threading Building Blocks. USA: O'Reilly Media, Inc., 2007.<br /><br />Pietrek, Matt. Remove Fatty Deposits From Your Applications Using Our 32-Bit Liposuction Tools. Microsoft Systems Journal, October 1996 issue.<br /><br />Ericson Christer. Order your graphics draw calls around! <a href="http://realtimecollisiondetection.net/blog/?p=86">http://realtimecollisiondetection.net/blog/?p=86</a><br /><br /><br />
<h1 class="sectionHeading">About the Author</h1>
Jérôme Muffat-Méridol has been writing software for the past twenty years with a focus on applications with a graphic side to them. Before joining Intel, he wrote deepViewer a photo browser built on a very innovative point &amp; zoom interface, applying the know-how gained in ten years of video games development: he previously was Technical Director at Bits Studios, a London based studio specialized in console games.<br /><br />
<p style="text-align: center;"><img width="311" src="http://software.intel.com/file/23498" height="258" /></p> ]]></description>
      <link>http://software.intel.com/en-us/articles/do-it-yourself-game-task-scheduling</link>
      <pubDate>Tue, 03 Nov 2009 12:03:04 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/do-it-yourself-game-task-scheduling#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/do-it-yourself-game-task-scheduling</guid>
      <category>Visual Computing</category>
      <category>Game Development</category>
    </item>
    <item>
      <title>Parallelization of SMOKE Gaming Demo via Intel® Threading Building Blocks</title>
      <description><![CDATA[ <h1 class="sectionHeading">Abstract</h1>
This paper describes the steps in the process of characterizing and optimizing the already parallel SMOKE Gaming Demo [1] using Intel's software suite of tools. Code characterization was done mainly with the Intel® Thread Profiler component of the Intel® VTune™ Performance Analyzer optimization tool, while the parallel optimizations were done mainly using the Intel® Threading Building Blocks (Intel® TBB) [2] template library. It is demonstrated that the use of Intel® TBB's work-stealing task scheduler [3, 4] can significantly improve CPU utilization and frame update rate for the SMOKE Gaming-Demo. We argue that the techniques used here are applicable to other gaming codes as well.<br /><br />
<h1 class="sectionHeading">Introduction</h1>
The main frame processing loop of a typical computer game is composed of a few functional blocks. Typically these are the Artificial Intelligence (AI), Physics, Particles, and Rendering computational functions. A popular strategy for parallelizing this loop is to represent the functional blocks as stages of a pipeline which can then be run in parallel across available processors, where the normal dependencies between the stages of the pipeline apply. Naturally, in this approach theoretical scalability is also limited by the number and weights of the stages in the pipeline. Most implementations of this approach assign one stage to a physical processor. As a result, on machines where the number of processors is greater than the number of pipeline stages, this approach obtains no benefit from the additional processor resources available. On the plus side, the method can be implemented to maintain data locality with respect to processor cache as each data item is run through the pipeline.<br /><br />In pure serial implementations of the main computational loop, the heterogeneous tasks which make up the body of the loop are executed in an ordered fashion. In the parallel case one may attempt to relax this ordering and attempt a more asynchronous execution of the tasks as long as the fidelity of the output is not affected or is maintained to some acceptable degree. Such an approach to parallelization of the frame loop is intriguing in that scalability is now limited only by the availability of tasks ready to execute and cores to process them - subject of-course to any and all constraints that apply between the different tasks. Further, one may parallelize each of the high level tasks into subtasks and exploit any fine grained parallelization that may be available. If one employs a thread pool, to avoid oversubscription of the cpus, the problem then becomes one of scheduling a pool of tasks to a pool of threads subject to any constraints that may apply.<br /><br />This paper examines the realizable benefits of the afore-mentioned task parallel approach as applied to the SMOKE computer game demo. The task parallel implementation is done with the Intel® TBB C++ template library. In a nutshell, the library provides generic parallel algorithms and concurrent containers [5, 6] which enable users to write parallel programs without directly creating and managing threads. Indeed, with this library, users need only focus on representing code in terms of tasks. This step that can be done implicitly via library provided high level algorithms or explicitly via derivations of a base task class, also provided by the library. All aspects of thread management and the mapping of tasks to threads are handled by the library in a manner transparent to the user. Internally, the library treats tasks as user-level objects that are scheduled for execution by the Intel® TBB task scheduler. The task scheduler maintains a pool of native threads and a set of queues (one queue per thread) of tasks ready for execution. At initialization, the Intel® TBB task scheduler creates an appropriate number of threads in the pool (by default, 1 per hardware thread). During code execution the scheduler distributes tasks to threads using a randomized work-stealing algorithm. The decentralized (each thread has its own queue of tasks) work stealing mechanism is what enables the scheduler to achieve near optimal load balance and high scalability of parallel programs [3, 4]. All Intel® TBB algorithms are tested and tuned for the current generation of multi-core processors, and they are designed to scale as the core count continues to increase.<br /><br />A detailed review of Intel® TBB features specifically as they apply to computer games has already been provided by one of us in the past [7]. In this paper, we only focus on how Intel® TBB has been implemented in an optimized version of the SMOKE gaming demo. The source code we discuss here is publicly available on the Intel Software Network [1].<br /><br />The organization of the remainder of the paper is as follows: the next section describes the use of Intel® Thread Profiler to obtain a detailed characterization of the SMOKE code. The section also covers the code changes made to alleviate the main bottlenecks detected by Intel® Thread Profiler. Following that, we report the performance improvements obtained as a result of the code changes. We end with a summary of the present work and point to some directions for future investigations.<br /><br />
<h1 class="sectionHeading">Code Characterization, Parallelization</h1>
SMOKE was originally developed as a multi-threaded computer game demo implementing most of the major features of modern 3D games. It has been used to demonstrate efficient parallelization techniques applicable to the wide range of CPU-intensive computer games. As always, one can find room for improvement, and indeed initial runs of the SMOKE code under Intel® Thread Profiler revealed a sub-optimal overall concurrency level. A screenshot of the actual profile obtained is shown in Figure 1. The profile view (top pane) shows a concurrency level of slightly greater than 2 on 4 cores. The timeline view (bottom pane) shows the time based execution flow of the parallel code, namely the initialization step, the main computational loop, and the application wrap-up before termination. For SMOKE, this view also shows a region of considerable contention (marked in yellow by the tool). This part of the timeline corresponds to the main computation loop of the SMOKE code. <br /><br />
<p style="text-align: center;"><img src="http://software.intel.com/file/23500" /></p>
<br /><br /><b>Figure 1:</b> <em>Intel Thread Profiler output for the initial run of the SMOKE binary. The top pane of the picture is the "Profile View"; the bottom pane is the "Timeline View".</em><br /><br />With the initial profile in hand, the next step is to take a closer look at the timeline view. This is conveniently done using the zoom feature of Intel® Thread Profiler. One can select and magnify any region of the timeline view and, in the process, not only get a better view of the behavior of the threads but also automatically get the average concurrency level for the selected region. A zoomed view of just a few iterations (Figure 2) of the main computation loop showed two obvious things:<br />
<ul>
<li>The concurrency level did not change from iteration to iteration, implying that a very small subset could be chosen for detailed analysis</li>
<li>The iterations suffer from ending with a critical section, where changes to the game scene get distributed among worker threads before they start working on the next frame</li>
</ul>
<br />
<p style="text-align: center;"><img src="http://software.intel.com/file/23501" /></p>
<br /><br /><b>Figure 2:</b> <em>A zoomed-in view of a section of the timeline view is shown in the lower pane. This view focuses on only a few sample iterations of the main computational loop in the SMOKE code. The serialization between iterations is also clearly identified in this view. The concurrency level for these few iterations is shown in the top pane.</em><br /><br />Ultimately, any analysis with Intel® Thread Profiler has only one main goal: to uncover issues with an application's multi-threading design or implementation and associate those issues with particular lines in source code. As such, the last step of our investigation is to isolate the source code corresponding to the serialization mentioned above. Intel® Thread Profiler allows one to do this via a menu item (obtained by right clicking anywhere near but not on the yellow line of a transition) called "Transition Source View". In this case the view shows the transition from four worker threads to one main thread, when it enters the critical section. The source view of the tool also displays the function call stack, and walking up this stack for the main thread, we arrive at the location of two function calls at the top level of the main computation loop. These function calls distribute changes made to various game objects in the iteration just completed. This is as far as Intel® Thread Profiler can take the analysis, and now the scrupulous work of reading the code begins.<br /><br />Further investigation of the source showed that the "DistributeChanges()" functions responsible for updating game world objects and scenes cannot be parallelized in the manner written. Both of the functions contain data dependencies and use shared data structures. Their underlying data structures and the way threads work with them had to be modified in order for this functionality to become parallelizable. The changes to the relevant data structures were approached in the following manner:<br />
<ul>
<li>SMOKE uses an associative container (a map) to collect the change notifications made to game world objects. Find-and-Erase functionality in a thread-safe map introduces noticeable overhead. This overhead can be avoided if the map is replaced by a vector for which each thread has its own range of elements holding notifications for only it to process.</li>
<li>The number of change notifications is not known in advance, so in order to be able to divide up the vector between all threads, each thread first needs to generate the notifications in Thread Local Storage (TLS), count them, grow the vector atomically to the necessary length, and then concurrently copy change notifications from TLS to the common container. For all of this no synchronization is required.</li>
<li>After the above changes were implemented it turned out those functions that process different types of change notifications are not thread-safe either. Guarding access to certain common resources fixed this problem.</li>
</ul>
<br />The changes outlined above were introduced into the code and Intel® Thread Profiler was used to assess the effect. Figure 3 below depicts a zoomed-in view of a previously single-threaded section of application run time. It is seen that a majority of the application run time is now spent at the concurrency levels of three or four.<br /><br />
<p style="text-align: center;"><img src="http://software.intel.com/file/23502" /></p>
<br /><br /><b>Figure 3:</b> <em>Zoomed in view of a previously single threaded section of code. The changes described in the text have rendered the section parallel with a concurrency level of ~3 to 4.</em><br /><br />The yellow bar in Profile View represents the time spent on synchronization between worker threads (i.e. overhead). However, it should also be noted that even with this overhead the changes made deliver an improvement to overall code performance. As with any synchronization point, the mechanism chosen to effect the synchronization is quite critical. The most commonly used mechanisms are: atomic operations, user-level and kernel-level mutexes, fair mutexes, and reader-writer mutexes. Each has its own advantages and disadvantages, and some experimentation is required when making the choice of mechanism for a particular application and workload. Here, Intel® Thread Profiler can assist in the following way. The timeline view of the region of interest shows that for this application and workload, worker threads "ping-pong" on the mutex/lock often and for a very short period of time. This indicates that user-level spin mutexes could be a good fit. Naturally, to be sure of the choice, one has to benchmark the application with respect to all possible mutex types and over the most commonly processed workloads on each of the target platforms. This approach will help account for hardware and software variations in the many possible environments that the code may be run in. In the case of SMOKE, these kinds of benchmarks confirmed the advantage of user-level synchronization. For this purpose then either an Intel® TBB spin_mutex or a Critical Section with a spin count would be good candidates.<br /><br />Having addressed the serialization-between-iterations issue, we now turn to something fundamental in parallel processing: load balance. For the case at hand, the main loop is composed of a few different types of tasks each of which take different times to execute and offer different levels of potential parallelism. For example, the Rendering Task usually takes a long time and, barring any low level parallelism, is a single instance for each frame. AI tasks on the other hand usually take relatively small amounts of time to complete, with several instances possible per frame. AI tasks represent every "thinking" object in the game, and usually all of them can be executed in parallel on all available worker threads. Temporally, Physics tasks represent something in between AI and Rendering tasks. Physics tasks are therefore of "average" size and can appear in large numbers. The collision computation associated with them usually implies communication, which limits the concurrency level achievable in the handling of these tasks.<br /><br />To understand the implications of all this in practice, we once again rely on the timeline view of Intel® Thread Profiler. We utilize a standard feature that allows one to mark certain activities for each worker thread. This allows a detailed analysis of tasks sizes, their execution order, and therefore worker thread load balance. Figure 4 illustrates the situation for SMOKE.<br /><br />
<p style="TEXT-ALIGN: center"><img width="322" src="http://software.intel.com/file/23503" height="162" /></p>
<br /><br /><b>Figure 4:</b> <em>Intel® Thread Profiler screenshot showing load imbalance</em><br /><br />To address this situation, we use the task scheduler available in Intel® TBB library. The scheduler maps available tasks to a thread pool in a manner designed to maximize concurrency. As mentioned in the introduction, the pool of threads is created and managed by the library in a fashion transparent to the user, so the user only focuses on representing code in terms of tasks. Tasks themselves are distributed to thread local queues and when a thread runs out of tasks in its own queue it can steal from another thread's queue. This is the feature that enhances concurrency [2-4].<br /><br />This is only part of the story, because the order in which tasks are spawned (and thus submitted to the scheduler) will also be expected to have an effect on execution times. Indeed, if AI tasks get spawned first, Intel® TBB will balance the load via task stealing, and the threads of the thread pool will finish their execution of this group of tasks at almost the same time. Spawning the Rendering Task last will mean that when all worker threads finally finish with the AI tasks for example, only one thread will be able to pick up the serial Rendering task. The rest of the team will have to wait for the Rendering task to be executed.<br /><br />With these considerations in mind the Intel® TBB task scheduler was used to improve the load balance on the CPUs. For the present work, explicit task-to-thread affinity was enforced to effect task prioritization. Indeed this could also have been done using Intel® TBB's affinity partitioner. In this approach, an affinity ID (a relative Intel® TBB thread ID) is set for a certain task making it a prioritized task in a way. A thread with this ID picks up the task as soon as it runs out of local work (tasks in the local queue). Thus, the high priority tasks, which in our case are Rendering and Physics, would get assigned an affinity ID. As soon as an iteration starts, one of the worker threads picks up the Rendering task, and then another thread picks the main Physics task, while the AI, Geometry and Particles tasks get divided among the rest of the worker threads by normal stealing. This would presumably result in finer balanced load and reduced wait times.<br /><br /><br />
<h1 class="sectionHeading">Results</h1>
In the preceeding section, two main optimizations to SMOKE were described. First the underlying data structures for object change notifications and the way worker threads accessed and processed these notifications were restructured. This gave roughly a 12-15% overall performance improvement. <br /><br />The other optimization was the inclusion of the Intel® TBB task scheduler to improve the load balance of the parallel execution. Prior to rework, the CPU load on a 2x 4 core machine [8] was in the range of 55% to 65%. After rework the load on the cores was in the 90% to 95% range. Frame rate improvements due to rework also improved by roughly 45% to 60%. <br /><br /><br />
<h1 class="sectionHeading">Summary and Conclusions</h1>
In summary, straightforward use of Intel® Thread Profiler identified that (i) the code spent a noticeable amount of time undersubscribed (ii) a significant amount of serialization existed in the main computational loop (iii) the concurrency levels did not change from iteration to iteration of the main computational loop (iv) under subscription occurred as a result of synchronization between iterations (v) under-subscription was root caused to two functions responsible for object change notifications.<br /><br />Examination of the source code pointed the way for the functions to be restructured and parallelized with limited points of synchronization. The resulting code however still suffered from load imbalance. The Intel® TBB task scheduler was used to improve the overall CPU load and the balance of the code.<br /><br />This work has demonstrated the effectiveness of the Intel® Thread Profiler in conjunction with the Intel® TBB library to achieve performance improvements to the SMOKE gaming demo code. One future opportunity for performance gain could be to examine the use of Intel® TBB's affinity partitioner. Other avenues have been suggested and discussed in [7]. All of these considerations are expected to apply to gaming codes in general with only limited specificity if any to SMOKE itself. <br /><br /><br />
<h1 class="sectionHeading">Acknowledgments</h1>
We acknowledge the generous support received from Intel's Developer Products Division, Visual Computing Software Division, and Software Solutions Group during the course of this work &amp; the writing of this paper.<br /><br /><br />
<h1 class="sectionHeading">References</h1>
<ol>
<li>SMOKE Game-Technology Demo, Intel Software Network, available at <a href="http://software.intel.com/en-us/articles/smoke-game-technology-demo/">http://software.intel.com/en-us/articles/smoke-game-technology-demo/</a></li>
<li>James Reinders, Intel Threading Building Blocks, O'Reilly Media, Inc, Sebastopol, CA, 2007.</li>
<li>Robert D. Blumofe and Charles E. Leiserson, "Scheduling Multithreaded Computations by Work-Stealing," in Proceedings of the 35th Annual IEEE Conference on Foundations of Computer Science, Sante Fe, New Mexico, November 20-22, 1994.</li>
<li>Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall and Yuli Zhou, "Cilk: An Efficient Multithreaded Runtime System," in Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '95), Santa Barbara, California, July 19-21, 1995.</li>
<li>Michael Voss, "Demystify Scalable Parallelism with Intel Threading Building Block's Generic Parallel Algorithms," DevX.com, Jupiter Media, October 2006, at http://www.devx.com/cplus/Article/32935.</li>
<li>Michael Voss, "Enable Safe, Scalable Parallelism with Intel Threading Building Block's Concurrent Containers," DevX.com, Jupiter Media, December 2006, at <a href="http://www.devx.com/cplus/Article/33334">http://www.devx.com/cplus/Article/33334</a>.</li>
<li>Bradley Werth, "Optimizing Game Architectures with TBB", at <a href="http://www.gamasutra.com/view/feature/3970/sponsored_feature_optimizing_game_.php">http://www.gamasutra.com/view/feature/3970/sponsored_feature_optimizing_game_.php</a>.</li>
<li>The system used for testing was an Intel X5355: 2x4, 2.66 GHz, 8G RAM, Windows XP x64 Pro SP2, GeForce 8800 GTX. For more info see: <a href="http://en.wikipedia.org/wiki/Xeon#5300-series_.22Clovertown.22">http://en.wikipedia.org/wiki/Xeon#5300-series_.22Clovertown.22</a></li>
</ol><br />
<h1 class="sectionHeading">About the Authors</h1>
<b>Andrei Marochko</b> is a Senior Development Engineer in the Performance Analysis and Threading (PAT) group in Intel's Developer Products Division. With almost 20 years of experience as a software developer, he has worked in a wide range of areas including numerical analysis, GUI development for distributed client-server applications, threading libraries, threading runtimes and threading tools. He holds a MS degree from Ivanovo State Chemistry and Technology University. Bradley Werth is a Senior Software Engineer in Intel's Entertainment Technical Marketing Engineering group. <br /><br /><strong>Bradley Werth</strong> is a Senior Software Engineer in Intel's Entertainment Technical Marketing Engineering group<em>.</em> He received his Computer Science MS from University of Oregon in 2005 and his BS from USC in 1996. His focus at Intel is on developing and optimizing game features that take maximum advantage of the PC platform. He has spoken at GDC and Austin GDC about effective methods for threading game architectures.<br /><br /><b>Anton Pegushin</b> has worked for Intel, Russia, for a little over 6 years. He started as part of the development team for the Intel® MPI Library and related cluster tools products. He is currently a Senior Technical Consulting Engineer for Intel's Threading and Performance Analysis group specializing in threading and in the use of Intel® Threading Building Blocks. He holds a MS in Applied Math from Nizhny Novgorod State University (NNSU) which he received in 2003. In 2007 he earned a PhD from Saratov State University.<br /><br /><b>Michael D'Mello</b> has spent the last 18 years in the computer industry specializing in Parallel Computing. Past employers include Thinking Machines Corporation, Convex Computer Corporation, and the Hewlett-Packard Company. He has been with Intel since 2003 and is currently a Senior Technical Consulting Engineer in Intel's Developer Products Division. He holds a Ph.D. in Chemical Dynamics from the University of Texas, Austin.<br /><br /><br /><br />
<p align="left">*Other names and brands may be claimed as the property of others.</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/parallelization-of-smoke-gaming-demo-via-intel-threading-building-blocks-1</link>
      <pubDate>Mon, 02 Nov 2009 16:09:37 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/parallelization-of-smoke-gaming-demo-via-intel-threading-building-blocks-1#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/parallelization-of-smoke-gaming-demo-via-intel-threading-building-blocks-1</guid>
      <category>Visual Computing</category>
      <category>Game Development</category>
    </item>
    <item>
      <title>Deferred Mode Image Processing Framework: Simple and efficient use of Intel® multi-core technology and many–core architectures with Intel® Integrated Performance Primitives</title>
      <description><![CDATA[ <h1 class="sectionHeading">Introduction</h1>
In recent years, the resolution of image sensors has increased significantly, but image processing algorithms have not improved in efficiency. The bottleneck for nearly all image processing algorithms is memory access, and access time increases significantly when image data resides outside the L2 cache. Even with more computing power available, the increased size of the images impacts the ability to achieve high performance due to the increased number of cache misses. To improve performance of image processing applications, developers need to ensure data is kept in L2 cache as long as possible, and that parallelize image processing execution uses a slice size comparable to the L2 cache size. The Deferred Mode Image Processing (DMIP), available with the Intel® Integrated Performance Primitives (Intel® IPP) product, provides an efficient software solution for this dilemma.<br /><br />DMIP provides both data parallel and task parallel frameworks built on Intel® Integrated Performance Primitives for full utilization of the computational power available on the modern multi-core and many-core Intel architectures for image processing tasks.<br /><br /><br />
<h1 class="sectionHeading">DMIP in IPP 6.1</h1>
Intel® IPP version 6.1 introduces the DMIP component's support for the following image processing operations:<br />
<ul>
<li>Basic arithmetic unary and two-place operations (+, -, *, /, Abs, Ln, Min, Max, Sqrt etc)</li>
<li>Logical operations (&amp;, |, ~, etc)</li>
<li>Thresholding operations</li>
<li>Image type and channels conversion</li>
<li>Color conversion</li>
<li>Statistical operations</li>
<li>Image filters: 
<ul>
<li>General filters</li>
<li>Special filters (Box, Min, Max, Median, etc)</li>
<li>Fixed kernel filters (Sobel, Prewitt, Schar, etc)</li>
<li>Morphology (erosion, dilation)</li>
</ul>
</li>
<li>Linear transform DFT/FFT</li>
<li>Polyadic operations with up to 5 arguments</li>
</ul>
DMIP has more then thirty built-in nodes to support these operations, which translates to around 1800 atom operations (IPP functions) across data types and color channels for image processing. In addition to the built-in nodes, developers can add user-defined nodes to DMIP.<br /><br />To see how DMIP works, let's consider a simple image processing task, edge detection using differentiation operators:<br /><br /><img src="http://software.intel.com/file/23398" /><br /><br />where I(x,y) is the source image and e(x,y) is the destination image with detected edges. For this case the source image is a 3-channel 8-bit image and the output image is an 8-bit grayscale image which requires additional color conversion of the source image. For calculation of partial derivatives, we will use Sobel operators.<br /><br />As a first step, consider the following IPP code for the Sobel operators' edge-based detector:<br /><br />
<pre name="code" class="cpp">IppiSize roi; // Input and output image sizes<br />Ipp8u* pSrcImg; int srcImgStep; // Input C3, 8u image in IPP image format<br />Ipp8u* pDstImg; int dstImgStep; // Output C1, 8u image in IPP image format<br /><br />Ipp8u* pGrayData; int grayDataStep; // Input grayscale image<br /><br />Ipp16s* pIntBuffer; int IntBufferStep; // Intermediate buffer<br /><br />Ipp8u* pSBuffer; int SBufferSize; // Sobel operators temporary memory<br /><br />// Allocate memory for grayscale converted image<br />pGrayData = ippiMalloc_8u_C1(roi.width,roi.height,&amp;NewSrcStep);<br /><br />// Calculate and allocate temporary filters memory<br />ippiFilterSobelHorizGetBufferSize_8u16s_C1R(roi,ippMskSize3x3,&amp;SBufferSize);<br />ippiFilterSobelVertGetBufferSize_8u16s_C1R(roi,ippMskSize3x3,&amp;SBufferSize1);<br /><br />SBufferSize = SBufferSize &lt; SBufferSize1 ? SBufferSize1 : SBufferSize;<br /><br />pSBuffer = ippsMalloc_8u(SBufferSize);<br /><br />// Allocate memory for intermediate buffer. <br />// Its size is twice the image size and is used for dx and dy storage.<br />pIntBuffer = ippiMalloc_16s_C1(roi.width,roi.height*2,&amp;IntBufferStep);<br /><br />// Convert input image into grayscaled image<br />ippiRGBToGray_8u_C3C1R((const Ipp8u*)pSrcImg, srcImgStep, pGrayData, grayDataStep, roi);<br /><br />// Calculate dx<br />ippiFilterSobelHorizBorder_8u16s_C1R((const Ipp8u*) pGrayData, grayDataStep,<br />               pIntBuffer, IntBufferStep, roi, ippMskSize3x3, ippBorderRepl, 0, pSBuffer);<br />// Calculate dy<br />ippiFilterSobelVertBorder_8u16s_C1R((const Ipp8u*) pGrayData, grayDataStep,<br />               pIntBuffer+roi.height*(IntBufferStep/2), IntBufferStep, roi, ippMskSize3x3, ippBorderRepl, 0, pSBuffer);<br /><br />// Take absolute values of dx and dy.<br />ippiAbs_16s_C1IR(pIntBuffer, IntBufferStep, roi);<br />ippiAbs_16s_C1IR(pIntBuffer+roi.height*(IntBufferStep/2), IntBufferStep, roi);<br /><br />//Add them. Edges are detected.<br />ippiAdd_16s_C1IRSfs((const Ipp16s*) pIntBuffer+roi.height*(IntBufferStep/2), IntBufferStep, pIntBuffer, IntBufferStep, roi, 0);<br /><br />// Covert image with edges in dst image format<br />ippiConvert_16s8u_C1R( pIntBuffer, IntBufferStep, (Ipp8u*)pDstImg, dstImgStep, roi );<br /><br />// Free resources<br />ippiFree(pGrayData);<br />ippiFree(pIntBuffer);<br />ippsFree(pSBuffer);<br /><br /></pre>
<b>Figure 1.</b> IPP Sobel edge detector code<br /><br />This approach uses the highly optimized, parallelized IPP operations for processing the whole image. Since two intermediate buffers are required for storing dx and dy partial derivatives, the intermediate memory requirement grows proportional to the image size. Each operation is performed independently, and internal threads in IPP are synchronized after each operation has completed. In the Sobel edge detector above in Figure 1, there are seven sync points per task and a significant number of cache misses.<br /><br />To minimize cache misses and increase performance, we will execute the same task with DMIP. Figure 2 shows a task graph of the Sobel edge detector. The circles in the graph designate operations, while the edges indicate data flow.<br /><br /><img src="http://software.intel.com/file/23399" /><br /><br /><b>Figure 2.</b> Sobel edge detector task graph<br /><br />Figure 3 shows DMIP code in symbolic API.<br /><br />
<pre name="code" class="cpp">Image Src(pSrcImg, Ipp8u, IppC3…); // Source image in DMIP format<br />Image Dst(pDstImg, Ipp8u, IppC1…); // Destination image in DMIP format<br /><br />Kernel Kh(idmFilterSobelHoriz,ippMskSize3x3,ipp8u,ipp16s); // Dx operator<br />Kernel Kv(idmFilterSobelVert,ippMskSize3x3,ipp8u,ipp16s); // Dy operator<br /><br />Graph O = ToGray(Src); // To get detected as common expression.<br /><br />// Compile end execute task<br />Dst = To8u(Abs(O*Kh)+ Abs(O*Kv));<br /></pre>
<br /><br /><b>Figure 3.</b> The Sobel edge detector, DMIP implementation<br /><br />By default, the DMIP engine determines the optimal slice size to fit into L2 cache. In some cases, the optimal slice size can be changed according to preferences set by the application developer. Figure 4 shows DMIP code in symbolic API with separate compilation and execution processes to control slice size and limit the number of threads, with varying input image.<br /><br />
<pre name="code" class="cpp">Image Src(pSrcImg, Ipp8u, IppC3…); // Source image in DMIP format<br />Image Dst(pDstImg, Ipp8u, IppC1…); // Destination image in DMIP format<br /><br />Kernel Kh(idmFilterSobelHoriz,ippMskSize3x3,ipp8u,ipp16s); // Dx operator<br />Kernel Kv(idmFilterSobelVert,ippMskSize3x3,ipp8u,ipp16s); // Dy operator<br /><br />Graph O = ToGray(Src); // To get detected as common expression.<br /><br />// Build task graph<br />Graph G = (To8u(Abs(O*Kh)+ Abs(O*Kv))) &gt;&gt; Dst;<br /><br />// Compile<br />G.Compile(slice size);<br />// Execute task<br />G.Execute(threads number);<br /></pre>
<br /><br /><b>Figure 4.</b> The modified DMIP Sobel edge detector<br /><br />DMIP uses a compiler to split an image into small data portions (slices), then calls the same IPP functions as in Figure 1. Compilation and execution processes are described in the paper: <i><a href="http://www.actapress.com/Abstract.aspx?paperId=32623">Deferred Image Processing in Intel® IPP Library</a></i> (presented at the 2008 Computer Graphics and Imaging Conference) and in this article: <a href="http://software.intel.com/sites/billboard/archive/dmip.php">A Landmark in Image Processing: DMIP</a>). <br /><br />Let us review the image processing task. Image processing begins with a filtration (Flt operation), changing the filtered data type to unsigned char (To8u operation). "Src" refers to the operation of reading from source image buffer, and "Dst" refers to the operation of writing to the destination image buffer. A parallel execution algorithm splits a slice into sub-slices to allow parallel processing of the sub-slices on the available CPUs, as shown on Figure 5. DMIP does not analyze thread dependencies, so the threads must be synchronized after each operation to avoid a data race condition.<br /><br /><img src="http://software.intel.com/file/23400" /><br /><br /><b>Figure 5.</b> Parallel slice execution.<br /><br /><br />
<h1 class="sectionHeading">Graph optimization technique</h1>
Performance analysis on this DMIP algorithm showed that parallel slice execution shown in Figure 5 worked well on systems with 2 cores. However, due to the number of sync points, this example of parallel slice execution did not perform well on systems with more than 4 cores. In the previous example, during parallel slice execution there are 9 sync points (8 intermediate and 1 between slices) per slice and 9* slice number per task. By using DMIP's graph optimization technique for whole algorithm analysis, sync points were reduced or eliminated, improving performance significantly on multi-core and many-core systems.<br /><br />DMIP uses an internal representation of the graphs to analyze a whole task, providing four levels of graph optimization. Each level is designed for specific aspects of workloads and, in general, is targeted to reduce number of sync points. DMIP was designed to execute an arbitrary image processing task represented as a directed acyclic graph with no restriction on the task graph while developing the optimization modes.<br /><br />The DMIP graph optimization support several modes:<br /><br />0. Normal mode - simple parallel execution. See Figure 5.<br /><br />1. Light optimization mode - extraction of a branch with linear chains of nodes.<br /><br />2. Medium optimization mode - translation to the graph multilayer form representation to extract independent or locally independent operations. This mode provides operations for task parallelization.<br /><br />3. High optimization mode - merging of graph layers.<br /><br />4. Aggressive optimization mode - removing sync points between slices.<br /><br />In the normal mode (0), a task is executed "as is", i.e., sequentially with synchronization after each operation.<br /><br />In the light optimization mode (1), the DMIP graph compiler extracts the chains of nodes and combines them into a container, which we will call 'SuperNode'. The nodes in the extracted linear chain are executed without intermediate synchronization, reducing the number of nodes and the number of sync points. For the Sobel edge detector example, 4 linear chains are extracted: <br /><br />1. Src-&gt;ToGray<br />2. Sv-&gt;Abs<br />3. Sh-&gt;Abs<br />4. Add-&gt;To8u-&gt;Dst<br /><br />There is one restriction: a filter operation must be the first node in a linear chain, which is needed to prepare the border pixels before filtering the fully processed data slice on subslice level.<br /><br />In the medium optimization mode (2), the DMIP graph compiler uses linear chains to extract locally independent operations, i.e., a layer. All nodes on the same layer are independent, so each can either be executed in 1) parallel (task parallelism) or 2) sequentially but without sync points between them. Task execution strategy depends on target micro-architecture. Mode (2) introduces two major task characteristics: task width and task height.<br /><br /><b>Definition.</b> <i>Task height - number of layers.</i><br /><b>Definition.</b> <i>Task width - maximum number of the nodes inside a layer.</i><br /><b>Definition.</b> <i>Thin Task - task width is equal to 1.</i><br /><br />A wider, thinner graph is more parallel, having less sequential code. Thin tasks cannot be accelerated using mode (2) because there are no parallel regions inside such tasks. For the Sobel edge detector example, the task height is 3 and the task width is 2. First SuperNode Src-&gt;ToGray belongs to the first layer. The second and third SuperNodes Sv-&gt;Abs and Sh-&gt;Abs belong to the same layer, forming two task parallel regions (see Figure 2). Last SuperNode Add-&gt;To8u-&gt;Dst is in third layer.<br /><br />Other two graph optimization modes (high and aggressive) are generally targeted for acceleration of "thin" tasks. During execution in these modes the use of a filter operation in the task is analyzed to decide which layers can be merged and whether the slices can be executed without intermediate thread synchronization.<br /><br />DMIP API to control graph optimization mode:<br /><br />
<pre name="code" class="cpp">// Set desired desired opimization mode<br />idmStatus Control::SetGraphOptLevel(int level);<br />// Retrieve current graph optimization mode<br />int Control::GetGraphOptLevel(void);<br /></pre>
<br /><br />The performance benefit that can be achieved in each DMIP mode depends on the workload's graph configuration. In some cases, especially in pixel-wise workloads, the task can be executed without any intermediate sync points. Figure 6 compares the numbers of synchronization points in the Sobel edge detector implemented for each DMIP optimization mode. Figure 7 shows speed-up of the Sobel edge detector implemented with the aggressive optimization mode.<br /><br />
<table border="0" cellpadding="0" cellspacing="0" class="tableformat1">
<tbody>
<tr>
<td>Mode</td>
<td>Number of sync points</td>
</tr>
<tr>
<td>Normal</td>
<td>9 per slice</td>
</tr>
<tr>
<td>Light</td>
<td>4 per slice</td>
</tr>
<tr>
<td>Medium</td>
<td>3 per slice</td>
</tr>
<tr>
<td>High</td>
<td>2 per slice</td>
</tr>
<tr>
<td>Aggressive</td>
<td>1 per slice</td>
</tr>
</tbody>
</table>
<b><br />Figure 6.</b> Intermediate sync points per mode<br /><br /><img src="http://software.intel.com/file/23401" /><br /><br /><b>Figure 7.</b> Speed-up of the Sobel edge detector using aggressive optimization mode<br /><br />In the case of a grayscale input image we see improved performance in the Sobel-based edge detector if we prepare the image border pixels before graph execution, as shown below.<br /><br />
<pre name="code" class="cpp">Image SrcI(pSrcImg, Ipp8u, IppC1…); // Source image in DMIP format<br />Image DstI(pDstImg, Ipp8u, IppC1…); // Destination image in DMIP format<br /><br />Kernel Kh(idmFilterSobelHoriz,ippMskSize3x3,ipp8u,ipp16s); // Dx operator<br />Kernel Kv(idmFilterSobelVert,ippMskSize3x3,ipp8u,ipp16s); // Dy operator<br /><br />SrcI.CopyBorder(…);// Create image border.<br />Graph O = Src(SrcI); // To get detected as common expression.<br /><br />// Compile end execute task<br />DstI = To8u(Abs(O*Kh)+ Abs(O*Kv));<br /><br /></pre>
In this example, the aggressive optimization mode executes the entire task without any intermediate sync points because the filter node, which contains pre-calculated border pixels, executes immediately after SrcNode.<br /><br /><br />
<h1 class="sectionHeading">DMIP Task examples</h1>
The IPP DMIP component includes a simple console-based example "dmip_bench" which demonstrates DMIP benefits over traditional task implementation with IPP. One of the DMIP workloads is considered from the optimization point of view below.<br /><br /><b>Harmonization filter</b><br /><br /><img src="http://software.intel.com/file/23402" /><br /><br />Where <img src="http://software.intel.com/file/23403" /> is a box filter, <img src="http://software.intel.com/file/23404" /> and <img src="http://software.intel.com/file/23405" /> are thresholds and <i>c</i> is a constant.<br /><br />DMIP code:<br /><br />
<pre name="code" class="cpp">Image A(…); // Source image in DMIP format<br />Image D(…); // Destination image in DMIP format<br /><br />Kernel K(idmFilterBox,…); // Box filter kernel<br />Ipp32f c;<br />Ipp8u Tmax, Tmin;<br /><br />Graph O = To32f(A); // To get detected as a common expression.<br /><br />// Compile end execute task<br />D = Max(Min(To8u(O-(O-O*k)*c),Tmax),Tmin);<br /></pre>
<br /><br />This task has 3 linear chains; the number of nodes inside the longest linear chain is 5 operations. Task width is 1 and task height is 4, resulting in a thin task that has no locally independent operations, so medium optimization mode provides no benefit. Since the box filter resides in the second layer, inter-slice optimization cannot be done. Using the high optimization mode, sync points after the box filter operation were removed, resulting in 1 sync point per slice with the task speed-up shown in Figure 8.<br /><br /><img src="http://software.intel.com/file/23406" /><br /><br /><b>Figure 8.</b> Speed-up of the Harmonization filter with high optimization mode<br /><br /><b>Exponential brightness correction</b><br /><br /><img src="http://software.intel.com/file/23407" /> , where <img src="http://software.intel.com/file/23408" />, <img src="http://software.intel.com/file/23409" /> and <img src="http://software.intel.com/file/23410" /> are constants.<br /><br />This task has 1 linear chain that incorporates all operations in the graph. Task width is 1 and task height is 1, and as in the previous task, this is a thin task that will see no benefit from medium optimization mode. This is a pixel-wise task, so there are no inter-slice dependencies. Using the aggressive optimization mode there were 0 sync points per slice and a task speed-up shown in figure 9.<br /><br /><img src="http://software.intel.com/file/23411" /><br /><br /><b>Figure 9.</b> Speed-up of exponential brightness correction operation using aggressive optimization mode<br /><br /><b>Sepia toner</b><br /><br />A sepia filter operation calculates the pixel values according to the following formulae:<br /><br /><img src="http://software.intel.com/file/23412" /><br /><br />Where R, G, B is a pixel color triplet. <img src="http://software.intel.com/file/23413" /> , <img src="http://software.intel.com/file/23414" /> and <img src="http://software.intel.com/file/23415" /> - color transformation coefficients, <img src="http://software.intel.com/file/23416" />, <img src="http://software.intel.com/file/23417" /> and <img src="http://software.intel.com/file/23418" /> - sepia tone colors.<br /><br />
<p style="text-align: center;"><img src="http://software.intel.com/file/23419" /></p>
<br /><br /><b>Figure 10.</b> Original (on the left) and sepiaized image (on the right).<br /><br />This task has 1 linear chain which incorporates all 3 operations in the graph. Task width is 1 and task height is 1, so this thin task cannot benefit from the medium optimization mode. This is a pixel-wise task, so there are no inter-slice dependencies. Aggressive optimization mode results in zero sync points per slice. In this task, the slice size (in bytes) changes from operation to operation during image transformation from RGB to grayscale, then back to RGB color format. For each slice, half of the operations use one-third of the available cache size.<br /><br /><img src="http://software.intel.com/file/23420" /><br /><br /><b>Figure 11.</b> Speed-up of Sepia toner using aggressive optimization mode<br /><br /><br />
<h1 class="sectionHeading">Separable 2D FFT</h1>
Not all image operations are ready "as is" to be processed efficiently in the DMIP framework; DMIP supports various techniques for efficient restructuring of operations. For example, histogram calculation requires extraction of the prologue, accumulation, and epilogue phases. 2D FFT transform in DMIP is implemented in a separable way, consisting of three parts: 1D FFT calculation of the rows, transposition, and 1D FFT calculation of the columns as transposed rows. This computation structure utilizes pipelining to reduce overall jumps in memory during columnar FFT calculations. Figure 12 shows the performance of the DMIP 2D FFT transform in comparison with threaded IPP primitive.<br /><br /><img src="http://software.intel.com/file/23421" /><br /><br /><b>Figure 12.</b> Speed-up of the 2D forward FFT transform, IPP and DMIP+IPP versions, CPU clocks per element (fewer clocks = better performance)<br /><br /><br />
<h1 class="sectionHeading">Architecture topology utilization</h1>
For efficient operation, DMIP utilizes the available information detected for the microprocessor, i.e. CPU type, number of physical cores, Intel® Hyper-Threading™ Technology (Intel® HT Technology) enabled, number of cores sharing the same cache, etc. For example, the slice size is configured using knowledge of the cache size and the number of threads that share the same cache. DMIP is always balancing between data evicting from the cache vs. the full cache and core utilization. Thread affinity is used to minimize the cost of thread context save/restore and OS thread migration. Each working thread is assigned to a corresponding core to minimize data migration from cache to cache. For example, 2x Intel® Core™2 Quad processor-based architecture has 8 physical cores, so DMIP working threads are assigned to the cores in consecutive order.<br /><br />The Intel® Core™ i7 microprocessor has 4 physical cores with Intel® HT Technology enabled, so 8 hardware threads are supported. However, with this microprocessor, DMIP only uses 4 threads because DMIP is already highly optimized, so the additional Intel® HT Technology threads do not increase performance.<br /><br />Intel plans an API for Intel® IPP topology extraction and utilization for a future release of Intel® IPP.<br /><br /><br />
<h1 class="sectionHeading">Many-Core Architecture Challenges</h1>
Future Intel® architectures will have more than 8 cores. Video cards with many cores and threads currently being developed, as well as other many-core architectures, will require a slightly different parallelization approach since they support significantly improved thread synchronization and DMIP internal processing data unit type: slice or tile. When pipelining with slice processing may not be suitable for some tasks on many-core architectures, DMIP can extract a parallel region in the task and utilize all threads for execution of the parallel region to achieve best performance.<br /><br />For thin, high tasks, it is more efficient to transition from slice-based pipelining to tile-based pipelining since the number of sync points is minimal and equal to task height. A tile as an atomic data unit that increases the number of independent units processed and requires implementation of dynamic thread load-balancing.<br /><br /><br />
<h1 class="sectionHeading">Conclusion</h1>
The Intel IPP DMIP library provides a solution for effective multi- and many-core architecture utilization for image processing tasks. It incorporates algorithm analysis for task acceleration on many-core architecture, utilizing all available cores in the most efficient way. DMIP also provides an intuitive API to simplify development, optimization and parallelization of image processing applications.<br /><br /><br />
<h1 class="sectionHeading">Online Resources</h1>
<ul>
<li>Intel® IPP Web site:<br />(<a href="http://www.intel.com/software/products/ipp">http://www.intel.com/software/products/ipp</a>)</li>
<li>Intel® IPP Technical Support Resource<br />(<a href="http://www.intel.com/software/products/support/ipp">http://www.intel.com/software/products/support/ipp</a>)</li>
<li>Intel® IPP Forum:<br />(<a href="http://softwarecommunity.intel.com/isn/Community/en-US/forums/1274/ShowForum.aspx">http://softwarecommunity.intel.com/isn/Community/en-US/forums/1274/ShowForum.aspx</a>)</li>
<li>DMIP Manual (dmipman.pdf) is included with the Intel® IPP product files.</li>
<li>Dr. Dobb's Journal, 26 Apr 2009. Intel's Integrated Performance Primitives and Deferred Mode Image. (<a href="http://www.ddj.com/217100379?cid=RSSfeed_DDJ_All">http://www.ddj.com/217100379?cid=RSSfeed_DDJ_All</a>)</li>
</ul>
<h1 class="sectionHeading">About the Authors</h1>
<p><strong>Igor Belyakov</strong> is a software engineer in Intel's Integrated Performance Primitives Engineering team within the Software and Services Group (SSG). Igor is a technical leader of the DMIP project specializing in automatic optimization and parallelization of image processing and computer vision algorithms on multi-core and manycore architectures. Igor also worked in the Speech Coding Team and is an expert in speech coding and in the real-time media data transmission protocol stack. He holds a MS in Mathematics from the Moscow State University, Mathematic and Mechanical Department.</p>
<p><strong>Bonnie Aona</strong> is a software engineer in the Intel Compilers and Languages Group within the Software and Services Group (SSG) working on optimizing and testing applications to take advantage of the latest Intel software and hardware innovations to achieve high performance and parallelism.  Bonnie's career leverages Software Quality Assurance and program management with software design for complex high performance applications for computer graphics, real-time systems, scientific research, manufacturing, e-Commerce, aerospace and healthcare.  She holds Masters degrees in Electrical and Computer Engineering from University of California at Davis.</p>
<br /><br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/articles/deferred-mode-image-processing-framework-simple-and-efficient-use-of-intel-multi-core-technology-and-manycore-architectures-with-intel-integrated-performance-primitives</link>
      <pubDate>Thu, 29 Oct 2009 17:05:49 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/deferred-mode-image-processing-framework-simple-and-efficient-use-of-intel-multi-core-technology-and-manycore-architectures-with-intel-integrated-performance-primitives#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/deferred-mode-image-processing-framework-simple-and-efficient-use-of-intel-multi-core-technology-and-manycore-architectures-with-intel-integrated-performance-primitives</guid>
      <category>Visual Computing</category>
    </item>
    <item>
      <title>Miser – A Dynamically Loadable Memory Allocator for Multi-Threaded Applications</title>
      <description><![CDATA[ <b>by Barry Tannenbaum</b><br /><br />(This is a follow-up to our earlier post on <a href="http://software.intel.com/en-us/articles/multicore-storage-allocation">multicore storage allocation</a>.)
<p> </p>
<p>While working with an early Cilk++ adopter, it quickly became apparent that the default memory allocator shipped with the Windows C Run-Time Library can be a bottleneck in a multithreaded application. The Windows memory allocator has a single lock which it uses to serialize access to its internal structures. While this is safe, it proved to cause a serious loss of parallelism in the customer’s application.</p>
<p>There are lots of memory allocators available, both open source and commercially. We looked at using a number of them didn't find any that had the combination of features we were looking for. Ultimately we chose to write our own. Since the customer's program was a Windows application, we initially wrote concentrated on the Windows implementation.</p>
<h1 class="sectionHeading">Features borrowed from Hoard</h1>
<p>We decided to build the new allocator inspired by the principles in <a target="_blank" href="http://www.cs.umass.edu/~emery/hoard/asplos2000.pdf">Hoard: A Scalable Memory Allocator for Multithreaded Applications</a> by Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe and Paul R. Wilson. Hoard is designed to minimize lock contention between threads, false sharing and memory drift. Detailed information about Hoard can be found at the <a target="_blank" href="http://www.hoard.org/">Hoard website</a>.</p>
<p>We named our new memory allocator "Miser" to play on its relationship to Hoard. We have implemented Miser for both Linux and Windows, and this post describes data structures common to both, several Windows-specific implementation details, and several Linux-specific notes at the end.</p>
<h4>Superblocks</h4>
<p>The basic unit of allocation in Miser is a "superblock." A superblock holds a header, an array to hold the size of each requested block of memory and an array of same-sized memory blocks, or bins, which can be used to satisfy memory requests. The bins are powers of 2; 8 bytes through 256 bytes. Studies have shown that this will satisfy over 98% of memory requests. Requests for larger blocks are forwarded to the C Run-Time Library. The array of bins is placed against the back of the superblock, so memory blocks returned to the user are always naturally aligned.</p>
<h4>Global and Per-Thread Pools</h4>
<p>Miser, like Hoard, features a private pool of memory for each thread. Because the pool is thread-private, it can be accessed without locking. A pool is made up of superblocks which are kept in a roughly sorted order based on how much space is available in each superblock. The use of the per-thread pools is fundamental to improving application performance:</p>
<ul>
<li>Per-thread pools allow Miser to satisfy most memory requests without locking. </li>
<li>Per-thread pools limit false-sharing. Unless a block is passed from one thread to another, all use of the memory accessed through a cache-line should be from one thread. </li>
</ul>
<p>The global pool is not owned by any thread. If a per-thread pool cannot satisfy a request, it will take a superblock from the global pool. If the global pool is empty, Miser will allocate more memory from the operating system. As memory is released to a per-thread pool, it will contribute superblocks to the global pool to prevent memory drift.</p>
<h1 class="sectionHeading">Additional Features</h1>
<p>While the features inspired by Hoard provided a good foundation, we had the following additional requirements for Miser:</p>
<ul>
<li>It must be able to be loaded dynamically. This allows a component shipped as a shared library to use Miser instead of requiring that the application load it at startup. </li>
<li>It must be able to recognize blocks that it had not allocated, and pass those to the system C Run-Time Library. </li>
<li>It must be thread-safe and fast. </li>
<li>It must be able to simultaneously support the multiple versions of the C Run-Time Library that may be in use by an application. </li>
</ul>
<h4>Loading Miser dynamically</h4>
<p>The format for a Windows module, either an executable or Dynamic-Link Library (DLL) is defined by the Windows Portable Executable File Format. Almost every Windows module imports functionality from some other module. These dependencies are described by a pair of tables; the Import Name Table and the Import Address Table. When a module is loaded into memory, the loader will "snap" the addresses in the Import Address Table to the entrypoint in the destination module. All calls to that function from the module will be made indirectly through this cell.</p>
<p>When Miser is loaded into an application, it will go through each of the modules in the application and redirect the entries in the Import Address Table for each of the C Run-Time Library memory allocation functions to its own entrypoint. This technique was first described by Matt Pietrek in <a href="http://msdn.microsoft.com/en-us/library/ms809762.aspx">Peering Inside the PE: A Tour of the Win32 Portable Executable File Format</a>. Since the modification of the IAT entry is a 32-bit write, it’s an atomic operation; the IAT entry is either the C Run-Time Library address or the Miser address.</p>
<p>Once Miser has been loaded into a process, it can never be unloaded. While Miser could restore the original values of the C Run-Time Library functions in the IATs, the C Run-Time Library functions cannot handle memory allocated by Miser.</p>
<h4>Recognizing blocks not allocated by Miser</h4>
<p>When Miser loads into the application, the application may have already allocated memory. This is no way for Miser to replace those blocks with blocks from its own resources. Since Miser redirects all calls to <code>free()</code> and <code>_msize()</code> to itself, it must be able to recognize memory that has been allocated by the C Run-Time Library and pass those calls to the appropriate version of the C Run-Time Library.</p>
<p>The Windows OS provides a number of user-mode APIs to manage memory. One of the most fundamental is <code>VirtualAlloc()</code> which allocates memory in 64K blocks on 64K boundaries. Miser will subdivide these into 4K superblocks, each of which starts on a 4K boundary. Not coincidentally, this is the page size for the 32-bit x86 Windows OS. The first thing in each superblock is a header which contains a "magic number" and other control information. So by masking off the low 12 bits of a memory address, we should find the header for the superblock containing the memory block. There are two checks to validate a superblock.</p>
<ol>
<li>The superblock header contains the correct "magic number." </li>
<li>The superblock header contains a pointer to the per-thread pool owning the superblock and an index into an array of pointers to the superblocks owned by the per-thread pool. The entry in the array should match the address of the superblock. </li>
</ol>
<p>If either of these tests fails, Miser will pass the call off to the Windows C Run-Time Library.</p>
<p>An additional benefit of being able to detect memory allocated by the C Run-Time Library and pass the request on is that Miser can pass any requests for more than 256 bytes off to the C Run-Time Library.</p>
<h4>Supporting multiple versions of the C Run-Time Library</h4>
<p>Each module loaded into an application can be built against a different version of the C Run-Time Library. In addition to the side-by-side support built into the OS, recent versions of the C Run-Time Library have had their version number in the DLL name. Miser supports all versions of the C Run-Time Library since Visual Studio .NET 2002. Since each version of the C Run-Time Library has its own heap, Miser keeps track of which version of the C Run-Time Library it has hooked and will forward calls to the appropriate version.</p>
<h4>Speed and thread-safety</h4>
<p>Thread-safety and speed are intimately tied. The more the code needs to take out a lock, the greater the chance that it will have to wait for some other thread. For most operations, Miser does not need to lock; it can usually satisfy a request from resources already available in the per-thread pool. Miser must take out a lock in the following circumstances:</p>
<ol>
<li><b>Accessing the global pool</b> – only one thread may be allowed to access anything in the global pool at any time. Any thread that attempts to access the global pool must lock it first, serializing access. </li>
<li><b>Freeing a block allocated from another per-thread pool</b> – Unlike Hoard, Miser does not maintain a lock in the superblock header. Blocks to be freed in some other per-thread pool are placed on that per-thread pool's remote free queue using interlocked instructions. The remote free queue is drained before attempting to allocate a block in the hopes that a freed block may satisfy the request. </li>
<li><b>Supporting <code>_msize()</code> for a block allocated by another thread</b> - The <code>_msize()</code> function allows a Windows application to query the C Run-Time Library about the size of a block of memory. The value returned is the size in bytes that was specified when the block was allocated. If the block was allocated by Miser in the current thread, we can safely access the information in a superblock owned by the per-thread pool. However if the superblock is owned by another thread’s per-thread pool, then we must lock the per-thread pool before we attempt to validate the superblock against its superblock array. </li>
<li><b>Maintaining the list of superblocks</b> – Each per-thread pool maintains an array of the superblocks that it has allocated. The header for each superblock has a pointer to the per-thread pool that owns it, as well as an index into the per-thread pool superblock array. These are used to validate that a memory address is from a block that Miser allocated. Because another thread may access the array of superblocks to satisfy an <code>_msize()</code> call, the array of superblocks must be locked before it is updated. </li>
</ol>
<p>One important thing to note is that while Miser supports Cilk++ well, it is not tied in any way to the Cilk++ compiler or runtime. It should serve equally well for a multithreaded application using any of the threading technologies available.</p>
<h1 class="sectionHeading">Miser Limitations on Windows</h1>
<ul>
<li>Miser does not support any of the C Run-Time Library calls which allocate memory on an address boundary. </li>
<li>Miser assumes program correctness. It does not put any padding between blocks to attempt to detect writes beyond the bounds of the allocated block. Allocated blocks are butted directly against each other; Miser keeps the allocation size in a separate structure to allow it to maintain natural alignment in it's bin arrays. </li>
<li>Because of its dependence on the Import Address Table to hook into a module, Miser can only be used with modules that use the DLL forms of the C Run-Time Library. Modules that link against the static C Run-Time Library do not go through the IAT, so Miser cannot hook into them. </li>
</ul>
<h1 class="sectionHeading">Performance</h1>
<p>The following is an example of performance gains achievable with Miser, from an earlier blog post (<a href="http://software.intel.com/en-us/articles/multicore-enabling-the-n-queens-problem-using-cilk">Multicore-enabling the N-Queens Problem Using Cilk++</a>).</p>
<p><img src="http://software.intel.com/file/23125" align="center" /></p>
<h1 class="sectionHeading">Miser on Linux<br /></h1>
<p>Although Miser was initially written for Windows, since it performed better than Hoard in some of our tests, it was ported to Linux. The back-end code remained basically identical except for various system calls (e.g., changing calls to <code>VirtualAlloc()</code> to <code>mmap()</code>, etc.).</p>
<p>On the front-end, the mechanism for accessing the functions was rewritten. The Miser library exports <code>malloc()</code>, <code>free()</code>, <code>calloc()</code>, etc., and linking a binary with it will cause the loader to look for it at runtime. Additionally, preloading it by setting<br /><code>LD_LIBRARY_PATH</code> causes objects to find Miser's allocators before those of the C runtime library. Miser still uses the system<br /><code>malloc()</code> for large requests and it finds them in the default library using <code>dlsym()</code>.</p>
<p>In Linux, if Miser is loaded after a module has resolved the location of allocator functions, Miser does not hook itself in. You can still access Miser's <code>malloc()</code> using <code>dlsym()</code> and getting "malloc" from the library but, as in Windows, memory allocated with Miser must be freed with Miser. The system <code>free()</code> doesn't know what to do with Miser memory.</p>
<a name="Comments"></a>
<div id="listing">
<div class="post">
<h3>COMMENTS</h3>
<div class="Normal" align="left"><a name="comment27642"></a>Are you thinking about porting this to Mac OSX?</div>
<p class="postfoot">posted @ Thursday, January 08, 2009 6:01 PM by Mark</p>
<hr />
<div class="Normal" align="left"><a name="comment27643"></a>This is great but just wonder which version of Windows Memory Manager has used for the performance testing. Since I know Windows LFH (Low Fragmentation Heap: http://msdn.microsoft.com/en-us/library/aa366750(VS.85).aspx) is almost lock-free and more scalable with multicore CPU. From my testing its really hard to beat it.</div>
<p class="postfoot">posted @ Thursday, January 08, 2009 7:16 PM by Chae Lim</p>
<hr />
<div class="Normal" align="left"><a name="comment27648"></a>The comparison was against the C RunTime Library's malloc &amp; friends. Miser requires no code modifications except to load miser.dll into the process. <br /><br />- Barry</div>
<p class="postfoot">posted @ Friday, January 09, 2009 8:04 AM by Barry Tannenbaum</p>
<hr />
<div class="Normal" align="left"><a name="comment27654"></a>I think I watched a channel 9 interview where they said that someone on the kernel team at MS had taken on, and completed, the task of implementing object level locking. <br /><br />I would think that might alter the numbers.</div>
<p class="postfoot">posted @ Friday, January 09, 2009 2:33 PM by <a rel="nofollow" href="http://nickelcode.com">John Bender</a></p>
<hr />
<div class="Normal" align="left"><a name="comment27694"></a>The video you may be referring to is this one: &lt;a&gt;http://channel9.msdn.com/shows/Going+Deep/Mark-Russinovich-Inside-Windows-7/ <br /><br />where Mark Russinovich discusses several enhancements coming in Windows7, among them the work done to dismantle the contention in the dispatcher lock and how it enabled the Windows team to tune the Memory Manager.</div>
<p class="postfoot">posted @ Monday, January 12, 2009 8:21 PM by Rick</p>
<hr />
<div class="Normal" align="left"><a name="comment27732"></a>How does miser compare to the tcmalloc (part of the google performance tools)?</div>
<p class="postfoot">posted @ Thursday, January 15, 2009 9:10 AM by Bradley C. Kuszmaul</p>
<hr />
<div class="Normal" align="left"><a name="comment27733"></a>I've only looked briefly at tcmalloc, but it appears follow the same basic philosophy of a thread-local memory pool with a central pool that's shared. Differences are: <br /><br />- tcmalloc digs <b>much</b> more deeply into the application than Miser does on Windows, intercepting <b>ALL</b> memory allocation calls including the HeapAlloc and VirtualAlloc family of calls, as well as MapViewOfFile. <br /><br />- tcmalloc handles large allocations which Miser passes on to the CRTL. Our studies showed that allocations larger than 256 bytes were a small percentage of the allocations, so we didn't bother with them. Of course, that may not hold for your application. <br /><br />- tcmalloc will also patch into modules that are linked against the static CRTL, which Miser won't touch. <br /><br />I'll have to run some performance tests and see how they fare against each other. <br /><br />- Barry</div>
<p class="postfoot">posted @ Thursday, January 15, 2009 9:12 AM by <a rel="nofollow" href="http://www.cilk.com">Barry Tannenbaum</a></p>
<hr />
<div class="Normal" align="left"><a name="comment27737"></a>When you call MSVCRT malloc and free that eventually end up HeapAlloc and HeapFree. On Vista, Microsoft has changed memory manager so that normal heap can be converted to LFH on-the-fly when contention occurs between threads. So I recommend you run the perf test on Vista machine. <br /><br />Please check out following slide for LFH. <br /><br />http://www.i.u-tokyo.ac.jp/edu/training/ss/lecture/new-documents/Lectures/16-UserModeHeap/UserModeHeapManager.ppt</div>
<p class="postfoot">posted @ Thursday, January 15, 2009 11:52 AM by Chae Lim</p>
</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/miser-a-dynamically-loadable-memory-allocator-for-multi-threaded-applications</link>
      <pubDate>Wed, 28 Oct 2009 14:02:06 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/miser-a-dynamically-loadable-memory-allocator-for-multi-threaded-applications#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/miser-a-dynamically-loadable-memory-allocator-for-multi-threaded-applications</guid>
      <category>Parallel Programming</category>
    </item>
    <item>
      <title>Multicore Storage Allocation</title>
      <description><![CDATA[ <b>by Charles Leiserson</b><br /><br />When multicore-enabling a C/C++ application, it's common to discover that <code>malloc()</code>(or new) is a bottleneck that limits the speedup your parallelized application can obtain.  This article explains the four basic problems that a good parallel storage allocator solves:<br />
<p> </p>
<ol>
<li><b>thread safety,</b></li>
<li><b>overhead,</b></li>
<li><b>contention,</b></li>
<li><b>memory drift.</b></li>
</ol> 
<ul>
</ul>
<h1 class="sectionHeading">Thread safety</h1>
<p>Basic storage allocators are not thread safe, although recent efforts have started to remedy this problem for many concurrency platforms.  In other words, improper behavior due to <a href="http://software.intel.com/en-us/articles/are-determinacy-race-bugs-lurking-in-your-multicore-application">races</a> on the storage allocator's internal data structures can result from two parallel threads attempting allocate or deallocate at the same time.  When threads have unrestricted access to the storage allocator, as shown below, they may end up "stomping on each others' toes," leading to anomalous behavior.</p>
<div style="text-align: center;"><img src="http://software.intel.com/file/23158" alt="1" border="0" height="227" width="287" /></div>
<p>The simple solution to this problem is for applications to acquire a mutex (mutual exclusion) lock on the allocator before calling <code>malloc()</code> or <code>free()</code>, as illustrated below, which lets only one thread access the allocator's internal data structures at a time.</p>
<div style="text-align: center;"><img src="http://software.intel.com/file/23159" alt="2" border="0" height="314" width="287" /> <br /></div>
<p>If the storage allocator is thread safe, the locking protocol is incorporated into the logic of the storage allocator itself.</p>
<h1 class="sectionHeading">Overhead and contention</h1>
<p>Two problems may arise when an allocator is made thread safe by locking.  The first is that allocation and deallocation may now be slower due to the overhead of locking.  The second is that contention may arise in accessing the storage allocator, which can slow down the application and limit its scalability.  Contention may not be a big problem for 2 or 4 cores, but as <a target="_blank" href="http://en.wikipedia.org/wiki/Moore%27s_law">Moore's Law</a> brings us dozens and even hundreds of cores per chip, contention can threaten scalability.</p>
<p>Both problems can be solved using a distributed allocator, which provides a local storage pool per thread, as illustrated below.</p>
<div style="text-align: center;"><img src="http://software.intel.com/file/23160" alt="3" border="0" height="359" width="286" /></div>
<p>A distributed allocator allows allocation and deallocation to run out of the local storage pool most of the time.  In the uncommon case that a thread's local pool is exhausted, the thread can obtain additional storage, typically in large blocks, from the global pool.  The contention problem is solved, because threads only rarely access the global pool.  The overhead problem is solved as well, because no locking is needed to access the local pool.</p>
<h1 class="sectionHeading">Memory drift</h1>
<p>Unfortunately, local pools introduce yet another problem, especially in concurrency platforms where storage is actively shared among threads or which load-balance a computation across the threads.  One thread A may continually allocate storage out of its local pool and pass it off to another thread B which frees it into its local pool.  When thread A's local pool runs out, it allocates more storage from the global pool.  This storage is passed to B, which proceeds to free it into its local pool.  Over time, B's local pool grows unboundedly, creating something akin to a memory leak, where the virtual-memory footprint of the application continues to grow.<br /><br />This memory drift problem can be solved in two ways.  One solution is for a thread whose local pool becomes too large to return some of its storage to the global pool.   The other is for all threads to return storage to the thread pool where the storage was allocated.  Either method can be implemented with low overhead, and both provide satisfactory solutions to the memory drift problem.</p>
<h1 class="sectionHeading">Conclusion</h1>
<p><img src="http://software.intel.com/file/23161" align="right" />There are other problems that can arise with parallel storage allocators.  For example, <b><i>false sharing</i></b> is a particularly pernicious problem, where two threads access independent blocks of storage that happen to lie on the same cache line, leading to a thrashing of the cache coherency protocol in the processor.  A storage allocator that fails to respect cache line boundaries and gives blocks of storage that share the same cache line to different threads may induce false sharing, which is hard to detect, because the logic of the code shows that the threads are accessing independent locations.</p>
<p>Two examples of parallel storage allocators include <a target="_blank" href="http://www.cs.umass.edu/%7Eemery/hoard/">Hoard</a>, written by Emery Berger of the University of Massachusetts, and the <a href="http://software.intel.com/en-us/articles/miser-a-dynamically-loadable-memory-allocator-for-multi-threaded-applications">Miser </a>allocator, distributed by <a href="http://www.cilk.com">Cilk Arts</a> as part of our <a href="http://www.cilk.com/multicore-products/cilk-solution-overview/">Cilk++</a> distribution.  (More on Miser in an <a href="http://software.intel.com/en-us/articles/miser-a-dynamically-loadable-memory-allocator-for-multi-threaded-applications">upcoming post</a> - stay tuned!)</p>
<a name="Comments"></a>
 ]]></description>
      <link>http://software.intel.com/en-us/articles/multicore-storage-allocation</link>
      <pubDate>Wed, 28 Oct 2009 13:53:13 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/multicore-storage-allocation#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/multicore-storage-allocation</guid>
      <category>Parallel Programming</category>
    </item>
    <item>
      <title>Four Reasons Why Parallel Programs Should Have Serial Semantics</title>
      <description><![CDATA[ <b>by Steve Lewin-Berlin </b><br /><br />Some parallel programming environments require the developer to relearn the fundamentals of programming in order to think in parallel. Cilk++ takes a different approach. One basic design principle of Cilk++ is that Cilk++ programs have <i>serial semantics</i>, that is, a Cilk++ program can be understood (and executed) as a serial program. The Cilk++ keywords were designed to make this <i>serialization</i> look similar to the parallel Cilk++ program.
<p> </p>
<p>This simple but powerful principle guides more than just the appearance of Cilk++ programs. It also simplifies the process of developing and testing a Cilk++ program, and it allows Cilk Arts to provide powerful, efficient, and provably correct tools.</p>
<p>(Serial semantics should not be confused with sequential consistency, which is a concept invented by Leslie Lamport and discussed extensively in the literature of parallel computing. A parallel computing system with sequentially consistent memory guarantees that, in Lamport's words, "<i>... the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.</i>" In contrast, serial semantics means that the program can be executed using a single thread of control as an ordinary serial program - one does not need a conceptual parallel model to understand -- or execute -- the program.)</p>
<h2>Cilk++ Serial Semantics <br /></h2>
<p>First, let's take a closer look at how Cilk++ provides serial semantics.</p>
<p>Cilk++ provides three new keywords: <span style="font-family: Courier New; font-size: 12px;">cilk_spawn</span>, <span style="font-family: Courier New; font-size: 12px;">cilk_for</span>, and <span style="font-family: Courier New; font-size: 12px;">cilk_sync</span>. Each of these has an intuitive serial interpretation:</p>
<table border="1" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td style="border: 1pt solid windowtext; padding: 0in 5.4pt; width: 95.4pt;" valign="top" width="127">
<p class="MsoNormal"><b><i>Keyword<o:p></o:p></i></b></p>
</td>
<td style="padding: 0in 5.4pt; width: 2.5in;" valign="top" width="240">
<p class="MsoNormal"><b><i>Parallel meaning<o:p></o:p></i></b></p>
</td>
<td style="padding: 0in 5.4pt; width: 203.4pt;" valign="top" width="271">
<p class="MsoNormal"><b><i>Serial meaning<o:p></o:p></i></b></p>
</td>
</tr>
<tr>
<td style="padding: 0in 5.4pt; width: 95.4pt;" valign="top" width="127">
<p class="MsoNormal"><span style="font-family: &quot;Courier New&quot;;">cilk_spawn<o:p></o:p></span></p>
</td>
<td style="padding: 0in 5.4pt; width: 2.5in;" valign="top" width="240">
<p class="MsoNormal">Allow a called function to run in parallel with its caller<o:p></o:p></p>
</td>
<td style="padding: 0in 5.4pt; width: 203.4pt;" valign="top" width="271">
<p class="MsoNormal">Call a function normally, where the caller waits for the called function to return before resuming<o:p></o:p></p>
</td>
</tr>
<tr>
<td style="padding: 0in 5.4pt; width: 95.4pt;" valign="top" width="127">
<p class="MsoNormal"><span style="font-family: &quot;Courier New&quot;;">cilk_sync<o:p></o:p></span></p>
</td>
<td style="padding: 0in 5.4pt; width: 2.5in;" valign="top" width="240">
<p class="MsoNormal">Wait for parallel activity to complete<o:p></o:p></p>
</td>
<td style="padding: 0in 5.4pt; width: 203.4pt;" valign="top" width="271">
<p class="MsoNormal">Do nothing (there is no parallel activity)<o:p></o:p></p>
</td>
</tr>
<tr>
<td style="padding: 0in 5.4pt; width: 95.4pt;" valign="top" width="127">
<p class="MsoNormal"><span style="font-family: &quot;Courier New&quot;;">cilk_for<o:p></o:p></span></p>
</td>
<td style="padding: 0in 5.4pt; width: 2.5in;" valign="top" width="240">
<p class="MsoNormal">Allow multiple iterations of the body of a loop to run in parallel<o:p></o:p></p>
</td>
<td style="padding: 0in 5.4pt; width: 203.4pt;" valign="top" width="271">
<p class="MsoNormal">Execute a standard <span style="font-family: &quot;Courier New&quot;;">for</span> loop in which loop iterations execute one at a time<o:p></o:p></p>
</td>
</tr>
</tbody>
</table>
<p>As you can see, a Cilk++ program can be converted to a serial program quite easily: simply replace <span style="font-family: Courier New; font-size: 12px;">cilk_for</span> with <span style="font-family: Courier New; font-size: 12px;">for</span>, and delete the <span style="font-family: Courier New; font-size: 12px;">cilk_spawn</span> and <span style="font-family: Courier New; font-size: 12px;">cilk_sync</span> keywords. The Cilk++ development system provides a compile-time switch to build the "serialization" of a Cilk++ program. Using Cilk++ for Linux, the option <span style="font-family: Courier New; font-size: 12px;">[-fcilk-stub]</span> is used with cilk++ to "stub out" Cilk features. Using Cilk++ for Windows, run <span style="font-family: Courier New; font-size: 12px;">cilkpp</span> with the <span style="font-family: Courier New; font-size: 12px;">[/cilkp cpp]</span> option.</p>
<p>Okay, I hear you saying, "Fine, your fancy parallel program can devolve into a serial program. So what? If I wanted the serial version, I would have just used C++ in the first place!" Glad you raised the issue! A parallel programming model with serial semantics maintains four real advantages over program models without serial semantics.</p>
<ol>
<li><b>The equivalent serial C++ program can easily be debugged and analyzed using existing development tools.</b> This is an ideal way to debug a parallel program without worrying about parallel interaction, and without needing to consider multiple stacks and program counters. Also, as vanilla C++, all symbol and debug information is available to standard system and third-party tools. Tools that are easy to run on the serialization but might require custom versions to operate on the parallel program include performance profilers, code coverage tools, security analysis tools, memory leak checkers, and any other tools that work with C++ programs.</li>
<li><b>The serial semantics of Cilk++ programs allows Cilk Arts to build better tools that analyze the performance and correctness of Cilk++ programs, and these tools offer strong guarantees about the program.</b> For example, our Cilkscreen race detector runs the test program in a serialized mode on a single processor. With information about the parallel structure of the program (i.e., where the spawns and syncs occur), Cilkscreen can analyze all possible schedules of the program that might execute on any number of processors. In other words, Cilkscreen can detect whether race conditions exist that could manifest under any possible execution of the program. The serial semantics of Cilk++ permits Cilkscreen to analyze the parallel program while running in a single worker (the Cilk++ name for a thread of execution) on a single processor.</li>
<li><b>The serial semantics of Cilk++ provides a real advantage to program testing and quality assurance.</b> Cilk++ is designed to ensure that parallel programs written in Cilk++ can be deterministic, reliably producing results identical to the serialization. This guarantee makes it easy to leverage traditional testing tools that assume that multiple runs of a program always return the same result if given the same input. The serial semantics of Cilk++ eliminates the need to consider timing and scheduling considerations when evaluating program correctness, and it makes it simple to compare the output of multiple program runs for consistency. Repeatability is one of the key properties that make Cilk++ programs testable and reliable.</li>
<li><b>Serial semantics provides high performance while allowing you to design your program without specifying the number of processors on which the program will execute.</b> The Cilk++ runtime scheduler takes care of efficiently load balancing your program across however many processors are available. When a portion of your parallel computation executes on a single processor, Cilk++ can execute it just like ordinary C++ code, taking full advantage of all the compiler optimizations and runtime efficiencies that a good C++ system offers. By starting from good single-core performance, Cilk++ ensures that a program with sufficient parallelism gets good speedup whether it is run on a large number of processors or just a few.</li>
</ol>
<h2>What's the Catch? <br /></h2>
<p>Perhaps this all sounds too good to be true - and maybe now you're thinking, "Okay, so what‘s the catch?" All right, there are a few things I've glossed over. For one, since there are some kinds of parallelism that actually don't <i>have</i> serial semantics, you can't write programs using these parallelization strategies in Cilk++. For example, Cilk++ doesn't support producer/consumer, software pipelining, or message passing. Nevertheless, although we're giving something up to have to serial semantics, our experience is that less is more. The four advantages outlined above readily make up for any loss in generality.</p>
<p>There's a second catch, however. Note that in the third point above, I said that Cilk++ programs can be deterministic, not that they are deterministic. How so? Well, the output of a program with a determinacy race (see "<a href="http://software.intel.com/en-us/articles/are-determinacy-race-bugs-lurking-in-your-multicore-application">Are Determinacy Races Lurking in Your Multicore Application</a>") can depend on how the program is scheduled on multiple processors, and the race condition that leads to this nondeterministic behavior is usually a bug. In most programming environments, you'd be stuck at this point trying to find the race bug by laborious program inspection. In contrast, in the Cilk++ programming environment, Cilkscreen can help you find and eliminate these races, giving you your determinism back. Thus, this "catch," though problematic, can be dealt with effectively precisely because of serial semantics: in effect, Cilkscreen uses the serialization as a benchmark for correctness of a parallel execution.</p>
<p>Sometimes, however, a Cilk++ program is intentionally nondeterministic. For example, some search programs use a concurrent hash table to "remember" the results of intermediate subsearches so that the subsearch doesn't have to be reexecuted if it reoccurs. In this context, each slot of the hash table may contain a mutual-exclusion lock to allow parallel branches of a computation to safely access and update the items stored in the slot. Cilkscreen correctly ensures that races are not declared on operations protected by the same lock.</p>
<p>In practice, however, we have found that most Cilk++ programs require few, if any, locks. In particular, Cilk++'s parallel control constructs obviate the need for the locks that other parallel programming models require for interthread communication and synchronization. Moreover, for many common situations where locking would seem to be necessary to avoid a race, Cilk++ provides <a href="http://www.cilk.com/multicore-products/cilk-hyperobjects/">hyperobjects</a>, a novel data structure that can resolve races on global variables without sacrificing performance or determinism. By avoiding locks and using hyperobjects, Cilk++ sidesteps many performance anomalies caused by locks, such as lock contention, which can slow down parallel programs significantly, and deadlock, which may cause your application to freeze midexecution.</p>
<h2>The Bottom Line <br /></h2>
<p>To sum up, Cilk++'s serial semantics provide three key benefits - performance, reliability, and productivity:</p>
<ul>
<li>Scalable performance comes from a model that doesn't rely on the programmer knowing the number of available processors in advance. Developers can use traditional serial performance tuning tools to find hotspots and improve both serial and parallel performance.</li>
<li>Reliability is enhanced through the testing advantages of repeatability and determinism, as well as the ability to build an efficient and provably correct race detector that can compare the parallel semantics of the program with the serial semantics provided by the program's serialization.</li>
<li>Finally, programmers can be more productive when they use familiar paradigms for new development, and of course, legacy serial code can be converted into Cilk++ easily, because it does not need to be rewritten to adopt a completely different programming model such as data-parallel or message-passing styles.</li>
</ul>
<p class="postfoot"><span class="NormalBold">Tags: <a rel="tag" href="http://www.cilk.com/multicore-blog/?Tag=serial+semantics">serial semantics</a></span></p>
<a name="Comments"></a>
<div id="listing">
<div class="post">
<h3>COMMENTS</h3>
<div class="Normal" align="left"><a name="comment27229"></a>Nice article, and I'm always glad to hear about different approaches to concurrency issues.   <br /><br />The only thing that I see here is that it suffers from the master thread, worker thread problem of being confined to extremely well defined sets of data. Much like OpenMP this works great when you're talking about a multicore desktop a with shared memory.   <br /><br />Using primitives like this is great idea but it doesn't solve the whole problem which is both shared and non-shared memory types of concurrency.  <br /><br />/revealbias  <br /><br />Check out erlang :D</div>
<p class="postfoot">posted @ Friday, December 12, 2008 4:43 PM by <a rel="nofollow" href="http://nickelcode.com">John Bender</a></p>
<hr />
<div class="Normal" align="left"><a name="comment27230"></a>John,  <br /><br />Your observation that Cilk++ is designed for shared memory systems is absolutely correct, though not all shared memory systems fall in the desktop class. The trend that we see is toward larger shared memory systems.  <br /><br />I think Erlang offers some good ideas, but it doesn't solve the same problems that Cilk++ addresses, such as providing a parallel strategy for legacy code, easing the transition to parallelism for programmers trained to solve problems serially, and offering a high-performance and reliable solution that scales from one to many processors with minimal overhead.  <br /><br />Our goal is not to solve all concurrent programming problems with a sitngle solution, but to offer a great answer to what we think is the problem that most developers will need to tackle over the next few years - making the transition from single processor systems to multicore parallel computers.</div>
<p class="postfoot">posted @ Friday, December 12, 2008 5:33 PM by Steve Lewin-Berlin</p>
<hr />
<div class="Normal" align="left"><a name="comment27266"></a>@Steve  <br /><br />Spot on with both points. Shared Memory does not mean desktop class, and there is room for both solutions to be sure.   <br /><br />I certainly the agree that creating language constructs to ease the pain of parallelizing code is pretty important these days.</div>
<p class="postfoot">posted @ Monday, December 15, 2008 8:08 AM by <a rel="nofollow" href="http://nickelcode.com">John Bender</a></p>
<hr />
<div class="Normal" align="left"><a name="comment27307"></a>Hi Steve!  <br /><br />I disagree with the premise that scalable parallel programs can be expressed with control flow style on a shared memory space. Sure, we can upgrade legacy code for shared memory multiprocessors with a few markups, but at some scale, physical reality requires us to abandon the shared memory view (Matteo pointed me to the 'Horizons of Parallel  <br />Computation' paper which proves it).   <br /><br />When shared memory control flow style stops scaling, composable dataflow pipelines provide the only reasonable mechanism for optimizing process locality and interprocess communication (hardware designers call it "place and route").  <br /><br />For process graphs with each node having degree 2 (linear pipelines), the ratio between best and worst case placement of processes in a processor array grows as O(N^(3/2)) in the number of processing elements. At 8 cores you don't have a big problem synchronizing your shared memory space, but how will Cilk address communication issues among 64 cores?</div>
<p class="postfoot">posted @ Tuesday, December 16, 2008 12:48 PM by <a rel="nofollow" href="http://fpgacomputing.blogspot.com">Amir</a></p>
<hr />
<div class="Normal" align="left"><a name="comment27310"></a>Amir,  <br /><br />I have personally used shared-memory Cilk successfully on 256 processors ten years ago, when the 256 processors filled a room. (The room and the processors were known as ``SGI Origin''). We used Cilk to play chess at the time, and it was working perfectly well on that many cores.  <br /><br />That being said, I do find Gianfranco's argument persuasive, and it is probably true that in the fullness of time every computer system becomes a huge systolic array. However, we are nowhere close to that point (yet).</div>
<p class="postfoot">posted @ Tuesday, December 16, 2008 1:55 PM by Matteo Frigo</p>
<hr />
<div class="Normal" align="left"><a name="comment27312"></a>Hi Matteo,  <br /><br />These days 256 cores fills 4 slots on a PCI bus :)  <br /><br />Let me step back from my original statement a little. Many big problems (perhaps most of the biggest problems) do not have the sort of interprocess communication constraints to limit them in even widely distributed processor networks. Cilk will rock for these problems. Folding@home, crypto-cracking, chess, and many data parallel operations fit this space. Even the NP-Hard placement optimizations distribute very well with minimal interprocess communication so I don't need to optimize the placement of the parallel processes that are optimizing the placement of some other parallel processes...   <br /><br />Other problems, like N-body simulations, have a hopeless all-to-all IPC topology and can't be rescued by any placement optimizations.  <br /><br />But then there's this huge middle ground of problems like funnel sort, spectral transforms, filter banks, FDTD methods and production systems where process placement is built into the way we think about the problem. As we scale these problems, the widely held false assumption of O(1) memory access time breaks down to the physically accurate O((nln n)^(1/2)) from HOPC and the only option is to abandon the global memory model. Someone will have to rewrite these algorithms into communicating processes instead of synchronized tasks.  <br /><br />I'd like to explore this more. I'll have to talk to you when I'm done with my current contract.</div>
<p class="postfoot">posted @ Tuesday, December 16, 2008 3:31 PM by <a rel="nofollow" href="http://fpgacomputing.blogspot.com">Amir</a></p>
<hr />
<div class="Normal" align="left"><a name="comment27423"></a>Cilk appears, from what I can see, to be an evolution of the parallelization methods used in, say, Occam (ie: <b>simple</b>, <i>effective</i>, explicit parallelizing) with a large dose of the ideas from Manchester University's 1974 paper on parallel programming documenting how it should mostly be done in the compiler with minimal-to-no intrusion into the source.  <br /><br />Yes, these are old ideas, but let's face it. The Transputer failed to go anywhere useful and the Manchester paper was largely ignored. Parallel methods since then have been horribly complex and daunting to the point that the maxim has become Abandon Hope, All Ye Who Press Enter Here.  <br /><br />Yes, shared memory is a headache, as DSM (Distributed Shared Memory) is often slow or only available on very expensive iron. Some languages add a keyword of "mobile" to denote when a given block of parallel code need not communicate over shared memory and so isn't faced with this problem. I can't see going from three keywords to four killing the Cilk++ team, but if I were them, I'd want to make sure that was the right fourth keyword to add. If you want to keep the keywords down, you want to make sure that an addition is the best of all possible additions. A suggestion, even if it did sound neat, would be completely insufficient to base a fairly significant code change on.</div>
<p class="postfoot">posted @ Tuesday, December 23, 2008 5:52 PM by Jonathan Day</p>
</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/four-reasons-why-parallel-programs-should-have-serial-semantics</link>
      <pubDate>Wed, 28 Oct 2009 13:50:17 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/four-reasons-why-parallel-programs-should-have-serial-semantics#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/four-reasons-why-parallel-programs-should-have-serial-semantics</guid>
      <category>Parallel Programming</category>
    </item>
    <item>
      <title>Finding Performance Bottlenecks &amp; Data Races</title>
      <description><![CDATA[ <b>by Ilya Mirman</b><br /><br />At this point, we have several dozen organizations worldwide exploring Cilk++. When embarking on a multithreading project, the first question many folks ask is, "Where do I start?"
<p> </p>
<p>Because some applications have as little as 1,000 lines of code while others have 500,000 or more, there is a good deal of variation in the calendar time involved. The project can take as little as a couple days, or stretch to a couple months. But the workflow itself - illustrated below - is fairly consistent from project to project: start with the serial code and verify its correctness (steps 1 and 2); identify and address performance bottlenecks by adding Cilk++ keywords (steps 3 and 4); verify serial and parallel correctness, potentially employing hyperobjects to resolve races (steps 5 through 7); and potentially repeating this process, guided by performance targets.</p>
<p><img src="http://software.intel.com/file/23114" align="center" /></p>
<p>Because the initial projects typically target existing serial applications, the early Cilk++ projects typically jump to step 3, and the first question is often, "Where are the performance bottlenecks?" And shortly thereafter, "Did exposing the parallelism create a race condition?" Fortunately, there's some great tools available to answer both questions. Let's take a look at a recent example.</p>
<h3 class="sectionHeading">Identifying performance bottlenecks</h3>
<p><img src="http://software.intel.com/file/23115" align="left" />As an example of zeroing in on a hotspot, consider the case of Dan Mirman, a UConn Psychology Postdoctoral Fellow investigating how word meanings are organized and structured in the brain. As part of his research, Dan uses the <a rel="nofollow" target="_new" href="http://www.cnbc.cmu.edu/~mharm/research/tools/mikenet/">MikeNet Neural Network Simulator</a> library.</p>
<p>Dan's goal was to accelerate the research feedback loop. Running each simulation takes days, and dozens of neural network simulations are required for each research project. Speeding up simulations from days to hours could have a great impact on the research feedback loop. Software engineers at <a rel="nofollow" target="_new" href="http://www.persistentsys.com/">Persistent Ltd</a> (a Cilk Arts partner interested in providing multicore enablement services to their customers) used Intel's <a rel="nofollow" target="_new" href="http://www.intel.com/cd/software/products/asmo-na/eng/239144.htm">VTune Performance Analyzer</a> software to zero in on the hotspots in the code.</p>
<p>VTune helps you identify and characterize performance issues by collecting performance data from your application, organizing and displaying the data in a variety of interactive views (from system-wide down to source code or processor instruction perspective), identifying potential performance issues and suggesting improvements.</p>
<p>The sampling source view displays source code annotated with performance data. It turned out that three functions were responsible for 97% of the compute time:</p>
<blockquote><img src="http://software.intel.com/file/23116" align="center" /><br /></blockquote>
<blockquote>
<blockquote>
<p><img src="http://software.intel.com/file/23117" align="center" /></p>
</blockquote>
</blockquote>
<h3 class="sectionHeading"><span class="sectionBodyText">Exposing Parallelism with Cilk++ </span><br /></h3>
<p>Again with VTune's help, we can quickly drill down to the very lines responsible for the delay.</p>
<meta content="text/html; charset=utf-8" />
<meta name="ProgId" content="PowerPoint.Slide" />
<meta name="Generator" content="Microsoft PowerPoint 12" />
<p><img src="http://software.intel.com/file/23118" align="center" /></p>
<p>In this case, it was the outer <b>for </b>loops in the three functions. It was a trivial effort to multithread them by replacing the serial <b>for </b>loops with <b>cilk_for</b> loops:</p>
<blockquote>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">// parallelized<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">void mikenet_matrix_vec_mult_t_p (Real * outvec,int nout,Real *invec,<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">int nin,Real **mat) {<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;"><span style="color: #0000ff;"><b>cilk_for</b></span> (int i = 0; i &lt; nout; ++i) {<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">for (int j = 0; j &lt; nin; ++j) {<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">outvec[i] += mat[j][i] * invec[j];<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">}<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">}<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">}<o:p></o:p></span></p>
</blockquote>
<p>The following illustrates the performance gains achieved by multithreading MikeNet (4-core 1.7GHz system with 2GB of RAM and 512 KB cache/core).</p>
<p><img src="http://software.intel.com/file/23119" align="center" /></p>
<p>3.5X improvement on 4 cores is pretty darn good...but can we do better? It turns out we can: in one of the functions every memory access misses the cache, and there is thus an opportunity for a <a rel="nofollow" target="_new" href="http://software.intel.com/en-us/articles/making-your-cache-go-further-in-these-troubled-times">more cache-friendly algorithm</a>.</p>
<p>Swapping the inner and outer loop results in a more cache-friendly algorithm:</p>
<blockquote>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">// inverted, parallelized (with a race on outvec).<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">void mikenet_matrix_vec_mult_t_p (Real * outvec,int nout,Real *invec,<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">int nin,Real **mat) {<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;"><b><span style="color: #0000ff;">cilk_for</span></b> (int j = 0; j &lt; nin; ++j) {<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">for (int i = 0; i &lt; nout; ++i) {<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">outvec[i] += mat[j][i] * invec[j];<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">}<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">}<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">}<o:p></o:p></span></p>
</blockquote>
<p>But when we run the Cilkscreen race detector, we discover that there is now a race on <b>outvec</b>.</p>
<p>Fortunately, Cilk++ reducer hyperobjects are ideally suited for eliminating this race condition without requiring a lock on <b>outvec</b>. Here's the new code:</p>
<p> </p>
<blockquote>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">// inverted, parallelized with reducer.<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">void mikenet_matrix_vec_mult_t_p (Real * outvec,int nout,Real *invec,<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">int nin,Real **mat) {<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">array_reducer_t art(nout, outvec);<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;"><b><span style="color: #0000ff;">cilk::hyperobject&lt;array_reducer_t&gt; rvec(art)</span></b>;<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;"><span style="color: #0000ff;"><b>cilk_for</b></span> (int j = 0; j &lt; nin; ++j) {<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">Real *array = rvec().array;<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">for (int i = 0; i &lt; nout; ++i) {<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">array[i] += mat[j][i] * invec[j];<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">}<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">}<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size: 10pt; font-family: &quot;Courier New&quot;;">}<o:p></o:p></span></p>
</blockquote>
<p>The cache-friendly algorithm picked up a factor 3X relative to the serial implementation on a single processor, and on 4 cores reached 8X relative to the original serial version (for direct comparison, the data from the previous chart is included in the chart below). As usual, improvement to a serial algorithm paid dividends when multicore-enabling it.</p>
<p> </p>
<blockquote><img src="http://software.intel.com/file/23120" align="center" /><br /></blockquote>
<h3 class="sectionHeading">Conclusion</h3>
<p>A key part of a multithreading project is figuring out early on where the performance bottlenecks are, exposing parallelism, and then verifying that the parallelism has not created a race condition. The combination of familiar profiling tools such as VTune, and the Cilkscreen race detector, take a lot of guesswork out of multithreading.</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/finding-performance-bottlenecks-data-races</link>
      <pubDate>Wed, 28 Oct 2009 13:47:39 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/finding-performance-bottlenecks-data-races#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/finding-performance-bottlenecks-data-races</guid>
      <category>Parallel Programming</category>
    </item>
    <item>
      <title>Making Your Cache Go Further in These Troubled Times</title>
      <description><![CDATA[ <b>by Will Leiserson</b><br /><br />One of our summer interns, Matthew Steele, suggested a <a target="_new" href="http://software.intel.com/en-us/articles/a-tale-of-two-algorithms-multithreading-matrix-multiplication">matrix-multiplication algorithm</a> that more effectively used the cache than an algorithm that might be more intuitive to a mathematician or physicist. No doubt, the intuitive triply-nested loop is the preferred solution of many software engineers. But Matt's function beat the intuitive one - even after the latter was cilkified. What made his algorithm perform so well? The short answer is that, although Matt's algorithm accessed all of the same memory addresses the same number of times as the intuitive algorithm, his function caused fewer cache misses. The original function caused the computer to spend more time loading and storing cache lines than executing the program.
<p> </p>
<p>Your computer's cache is divided into lines. When your CPU accesses a certain memory address, if it isn't in the cache, it will fetch a line from the next level out rather than a single word. This is a slow process, but if subsequent accesses to memory are nearby, there is a high probability that what the CPU needs is already in the cache. However, certain structures are unlikely to fit in the cache all at once, or at least they may be spread across many lines. Matrices are a prime example and the consequence is that the way in which you access elements of a matrix has a significant bearing on performance. Consider the following two C++ snippets accessing the same matrix along different dimensions:</p>
<pre>for (int i = 0; i &lt; rows; ++i) {<br />    for (int j = 0; j &lt; cols; ++j) {<br />        out[i] += matrix[i][j]) * in[j];<br />    }<br />}<br /><br />for (int i = 0; i &lt; cols; ++i) {<br />    for (int j = 0; j &lt; rows; ++j) {<br />        out[i] += matrix[j][i] * in[j];<br />    }<br />}<br /></pre>
<p>The two snippets have substantially different performance, which I'll illustrate with a recent example from an early adopter of Cilk++. Dan Mirman, a UConn Psychology Postdoctoral Fellow, brought his neural network simulation code to us because each time he ran it, it would take him a day to get results. Naturally, he wanted to parallelize it to improve performance. The two snippets above correspond directly to two functions in the <a target="_blank" href="http://www.cnbc.cmu.edu/~mharm/research/tools/mikenet/">MikeNet Neural Network Simulator</a> library.</p>
<p>The former function is used to multiply a matrix by a vector, and the latter apparently was copied and pasted and the indices were reversed to perform the multiplication of a matrix transpose by a vector. In fact, there was a trivial modification, taking into account cache usage, that turned out to be nearly as significant as parallelization. To understand the nature of the change, consider the following layout of a matrix in memory:</p>
<p><img src="http://software.intel.com/file/23105" border="0" height="189" width="445" /></p>
<p>Sequential accesses in the first function point to adjacent values in memory. Thus, accesses are quick. And the only time it misses the cache (on matrix accesses) is when it is done with the old line and will never load it again.</p>
<p>In the second function, rather than iterating through sequential addresses, every access is on a different line. In short, every access misses the cache. Furthermore, by the time the outer loop completes the first iteration, the second iteration's memory accesses are looking for lines that have long since been flushed. In other words, where the first function pulls each line into the cache once, the second pulls each line into the cache for every iteration of the outer loop.</p>
<p>With this in mind, a simple modification to MikeNet achieved ~3.4x improvement to the overall program on a roughly 500x500 matrix without using Cilk++ at all! All that was required was to invert the two for-loops in the transpose function so that sequential iterations of the inner loop were accessing adjacent locations in memory:</p>
<pre>for (int j = 0; j &lt; rows; ++j) {<br />    for (int i = 0; i &lt; cols; ++i) {<br />        out[i] += matrix[j][i] * in[j];<br />    }<br />}<br /></pre>
<p>Incidentally, if you use MikeNet or otherwise have similar matrix code, this is a simple modification that you can perform to save yourself some cycles. This is already enough to change the way Dan interacts with his program. But can the work be spread across multiple cores?</p>
<p>The original algorithm has some hidden parallelism in both functions that is easy to expose. In the original code, merely converting the outer for-loops into cilk_for-loops provides some limited parallelism. The change is not quite as trivial in the second function after it has been modified for cache locality, however. Since the output vector is based on the old inner loop, and since it is the new outer loop, parallelizing it causes a data race on each of the elements in the vector. Furthermore, so little work is done in the new inner loop that the overhead to running it in parallel is significant. This, of course, is precisely the purpose of a reducer. Wrapping the output vector in a reducer eliminates the data race caused by parallelizing the new outer loop:</p>
<pre>array_reducer_t art(out, cols);<br />cilk::hyper_ptr<array_reducer_t></array_reducer_t><array_reducer_t></array_reducer_t><array_reducer_t></array_reducer_t><array_reducer_t></array_reducer_t> art_ptr(art); <br /><span style="color: blue;">cilk_for</span> (int j = 0; j &lt; rows; ++j) {<br />    float *tmp = art_ptr-&gt;get_value();<br />    for (int i = 0; i &lt; cols; ++i) {<br />        tmp[i] += matrix[j][i] * in[j];<br />    }<br />}<br /><array_reducer_t></array_reducer_t><array_reducer_t></array_reducer_t><array_reducer_t></array_reducer_t><array_reducer_t></array_reducer_t></pre>
<p>This reducer is about as expensive as a reducer can be. Every time it merges, it must walk down two vectors and sum each pair of elements (complexity: O(cols)). In general, one would much rather have reductions that perform trivial operations to cut down on overhead.Nevertheless, when I made the changes to the algorithm, even with the expensive reducer, and with the limited parallelism the cilk_for exposes, I saw significant performance improvement on one of our testing servers.</p>
<p><img src="http://software.intel.com/file/23106" align="center" border="0" /></p>
<p>Bear in mind that I parallelized three functions in the whole program, and the only difference between the two versions is that one of the functions has been altered to access its matrix along a different dimension.</p>
<p>To sum up, identifying parallelism is a key part of program analysis, but it isn't the only thing you should consider when thinking about performance. Especially for some of these large data structures (e.g., matrices), taking some time to look critically at how they interact with your computer's cache is just as significant.</p>
<p> </p>
<a name="Comments"></a>
<div id="listing">
<div class="post">
<h3>COMMENTS</h3>
<div class="Normal" align="left"><a name="comment26229"></a>Any relation to Guy Steele? :-)</div>
<p class="postfoot">posted @ Friday, October 31, 2008 1:52 AM by Chris Khoo</p>
<hr />
<div class="Normal" align="left"><a name="comment26235"></a>Matthew Steele is Guy Steele's son. He was one of our summer interns. The <a rel="nofollow" href="http://software.intel.com/en-us/articles/a-tale-of-two-algorithms-multithreading-matrix-multiplication">blog entry I referenced</a> was written by him.</div>
<p class="postfoot">posted @ Friday, October 31, 2008 10:24 AM by <a rel="nofollow" href="http://www.cilk.com/">William Leiserson</a></p>
</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/making-your-cache-go-further-in-these-troubled-times</link>
      <pubDate>Wed, 28 Oct 2009 13:44:24 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/making-your-cache-go-further-in-these-troubled-times#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/making-your-cache-go-further-in-these-troubled-times</guid>
      <category>Parallel Programming</category>
    </item>
  </channel></rss>