<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Blogs &#187; Mark Randel</title>
	<atom:link href="http://software.intel.com/en-us/blogs/author/mark-randel/feed/" rel="self" type="application/rss+xml" />
	<link>http://software.intel.com/en-us/blogs</link>
	<description></description>
	<lastBuildDate>Fri, 25 May 2012 22:49:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Sandy Bridge and Game Development</title>
		<link>http://software.intel.com/en-us/blogs/2011/01/11/sandy-bridge-and-game-development/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/01/11/sandy-bridge-and-game-development/#comments</comments>
		<pubDate>Tue, 11 Jan 2011 20:55:49 +0000</pubDate>
		<dc:creator>Mark Randel</dc:creator>
				<category><![CDATA[Game Development]]></category>
		<category><![CDATA[Graphics & Media]]></category>
		<category><![CDATA[Sandy Bridge]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/01/11/sandy-bridge-and-game-development/</guid>
		<description><![CDATA[When we were making Ghostbusters a few years ago, we were approached by Intel to try out their new 4 series graphics media accelerator systems. After working closely with Intel for over a year, we were very happy with the resulting performance of the game, and the Graphcis Performance Analyzer tool. Ghostbusters out of the [...]]]></description>
			<content:encoded><![CDATA[<p>When we were making Ghostbusters a few years ago, we were approached by Intel to try out their new 4 series graphics media accelerator systems.  After working closely with Intel for over a year, we were very happy with the resulting performance of the game, and the Graphcis Performance Analyzer tool.  Ghostbusters out of the box was able to run very well on the 4 series chips, but not with every feature turned on.</p>
<p>Now, it’s the beginning of 2011, and we have Sandy Bridge, the next graphics series processor from Intel.  Going back and trying Ghostbusters, I was in for a treat.  Not only did I enjoy going back to the Ghostbusters universe not having played the game in over a year (we have been busy working on the new Star Wars game for Kinect with the Infernal Engine), Ghostbusters ran smoothly with every feature turned on at 1024x768, the highest resolution I could get my beta hardware to run at.  More or less, this means great 720p gaming performance!</p>
<p>This is excellent news for console/PC game developers - your base system PC from Intel will be able to run your full graphics options (and then some) without the user having to purchase additional graphics hardware (unless they want to run at insane resolutions).  This is also good news because you will be able to purchase a laptop that you can do serious game development on, get great battery life, and not break your back when you need to travel overseas (or back home for Thanksgiving...).</p>
<p>Sandy Bridge isn’t just about graphics performance.  The high end Core i7 will give you 4 cores, 8 threads, and turbo boost.  What can we do with all of this extra horsepower?  Stan Melax (famous for his convex hull code, aptly named “StanHull”) has a great paper on cloth simulation on the Intel developer website that uses all available threads to demonstrate advanced cloth simulation.  This code uses the new 256-bit AVX vector instructions.  While the SOA (structure of array) approach may not be applicable to every algorithm due to memory access patterns, this is still quite impressive.</p>
<p>Zane Mankowski (and others) have a very interesting paper on offloading shadow map generation from the GPU to the CPU for added performance.  This is a very interesting technique for me in particular, as we used it in the nineties when graphics acceleration was in its infancy for spaceship and vehicle shadows.  It’s great to see this used for full blown shadow map generation - if your engine has all these threads available and isn’t using them for physics or animation, it’s a good example of load balancing - use all of the system, rather than overload one specific component.  You see a lot of PS3 developers lately using this technique on SPUs to speed up their performance as well for SSAO (screen space ambient occlusion), vertex transformations, HDR bloom generation, etc. All of these could be done on those extra CPU cores as well as needed.</p>
<p>In the Infernal Engine, we make good use of all available CPU and GPU resources as well.  Animation blending is something that we have used a lot of lately - it is typical for an AI character to have over 50 core animations in a blend tree, so that is a must to offload on other CPUs.  Animation is great for multithreading, because if you have a good parallel chain, you won’t need the data until next game loop iteration and it can fill in the leftover odd time gaps between the other parallel operations of physics, particles, and rendering.</p>
<p>No matter how much code you have in parallel, you still will have portions of your game engine code that remains in serial.  Sandy Bridge will address this code quite nicely with Turbo Boost.  Turbo Boost will run one core at a higher frequency than the rest.  Ideally, this will be the main thread of your game engine, which will be scheduling up the other threads, performing synchronization, and running the game logic that inevitably cannot be run in parallel.</p>
<p>2011 is going to be an exciting year for game developers (and players too!!!)</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/01/11/sandy-bridge-and-game-development/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Game Editor Parallelization in the Infernal Engine</title>
		<link>http://software.intel.com/en-us/blogs/2010/05/10/game-editor-parallelization-in-the-infernal-engine/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/05/10/game-editor-parallelization-in-the-infernal-engine/#comments</comments>
		<pubDate>Mon, 10 May 2010 23:01:08 +0000</pubDate>
		<dc:creator>Mark Randel</dc:creator>
				<category><![CDATA[Game Development]]></category>
		<category><![CDATA[Graphics & Media]]></category>
		<category><![CDATA[parallel programming]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/05/10/game-editor-parallelization-in-the-infernal-engine/</guid>
		<description><![CDATA[In previous blog entries, I have talked a lot about parallelization of the game loop and how to make good use of two or more threads while your game is running. But this is only the tip of the iceberg of what you can actually parallelize. While it is still ongoing (it may never end...), [...]]]></description>
			<content:encoded><![CDATA[<p>In previous blog entries, I have talked a lot about parallelization of the game loop and how to make good use of two or more threads while your game is running.  But this is only the tip of the iceberg of what you can actually parallelize.  While it is still ongoing (it may never end...), the Infernal Engine Editor has lots of parallelization too, still using the Infernal Engine Job Queue model previously discussed.  The Job Queue essentially is able to asynchronously queue up standard C code via callbacks in a first come, first served method.  It's a good model as most codebases are years old, and multiprocessing is still relatively new.  Here are some examples of major speedups in the Infernal Engine:</p>
<p>Shader Creation</p>
<p>When loading a level, precompiled binary shaders are loaded into memory.  This is a somewhat serial task - there is one disk drive and one file stream.  However, the time it takes for DirectX to return a handle to a created shader may not be trivial.  Shader compilation is queued up in the job queue, and lazily filled in.  It is highly probable that by the time your level loads, shaders are not complete, and the editor needs to render.  So if the handle for the shader has not been returned, anything with this material will not get rendered.  In fact, any DirectX asset can be created asynchronously in this fashion!  DirectX is threadsafe (on the PC - don't try this on the 360) and will block itself accordingly.</p>
<p>Optimized Collision Generation</p>
<p>In the Infernal Engine Editor, we load a quick version of the collision geometry for a level for speed.  However, this is not in an optimized format since collision geometry may be edited on the fly.  For each room/bsp node (see previous blog entries for more information), we need an optimized collision representation of that geometry so the game will run quickly.  A job to create a bounding volume tree (BVT) for each node is queued up as well.  This happens nicely also in the background as a level quickly loads.  (A topic for future discussion: What is an efficient method to store a BVT on a modern multicore CPU?)</p>
<p>Lightmap Generation</p>
<p>In the previous generation, we used shadow maps for shadow casting.  However, as lightmap size has increased (up to 16K!!!), for accurate shadow casting, we will raytrace the render BVT to see if a light can hit a point.  On your Core i7 980X, this can run up to 8 threads at once, for almost exactly an 8x speedup.  If you have a lightmap farm setup in the Infernal Engine, each computer on the farm will also run in parallel.   (Another topic for future discussion:  For modern video game lighting, are lightmaps even necessary?)</p>
<p>Conclusion</p>
<p>Quite obviously, parallelization in your game editor is just as important as parallelization in your game loop.  There is almost an endless amount of work that designers, artists, and programmers do actually inside of the game editor, and the faster this work gets done, the more time gets spent on polishing your game.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/05/10/game-editor-parallelization-in-the-infernal-engine/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Faster Scenery Displacement Mapping</title>
		<link>http://software.intel.com/en-us/blogs/2010/02/09/faster-scenery-displacement-mapping/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/02/09/faster-scenery-displacement-mapping/#comments</comments>
		<pubDate>Tue, 09 Feb 2010 20:52:26 +0000</pubDate>
		<dc:creator>Mark Randel</dc:creator>
				<category><![CDATA[Game Development]]></category>
		<category><![CDATA[Graphics & Media]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/02/09/faster-scenery-displacement-mapping/</guid>
		<description><![CDATA[Often times, the simplest and most elegant algorithms are obvious, but take the longest to figure out. In one level for an Infernal Engine game, we wanted to fully displacement map every pixel in our scenery, but we couldn't afford the complexity of the shader on each and every pixel. One solution would be to [...]]]></description>
			<content:encoded><![CDATA[<p>Often times, the simplest and most elegant algorithms are obvious, but take the longest to figure out.  In one level for an Infernal Engine game, we wanted to fully displacement map every pixel in our scenery, but we couldn't afford the complexity of the shader on each and every pixel.  One solution would be to have two sets of shaders - for the near polygons, use a displacement map shader, and for the far polygons, use the non-displacement map shader.  An easier way was to add an HLSL static branch hint, based upon the screen depth of the input pixel as follows:</p>
<pre><code>
struct PS_INPUT {
    float2 uv : TEXCOORD0;
    float4 clipPos : TEXCOORD1;
    .
    .
    .
};

struct PS_OUTPUT {
    float4 color;
};

sampler2D map;
sampler2D displacementMap;
.
.
.

PS_OUTPUT displacementMapShader(PS_INPUT input) {
    PS_OUTPUT output;
    .
    .
    .
    float2 uvDisp = input.uv;
    [branch]
    if (clipPos.z &lt; 20.0F) {
        // Expensive code follows to perform the actual displacement mapping
        .
        .
        .
    }
    // Perform displaced or non-displaced texture lookup(s)
    float4 color = tex2D(map,uvDisp);
    .
    .
    .
    return(output);
}
</code>
</pre>
<p>There are some different rules for the PC and the consoles when you are allowed to use a static branch, but you should be able to make this little trick work for your specific case!</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/02/09/faster-scenery-displacement-mapping/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Fun with CULV and Ubuntu 9.10</title>
		<link>http://software.intel.com/en-us/blogs/2010/01/11/fun-with-culv-and-ubuntu-910/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/01/11/fun-with-culv-and-ubuntu-910/#comments</comments>
		<pubDate>Mon, 11 Jan 2010 21:12:31 +0000</pubDate>
		<dc:creator>Mark Randel</dc:creator>
				<category><![CDATA[Game Development]]></category>
		<category><![CDATA[Graphics & Media]]></category>
		<category><![CDATA[CULV]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/01/11/fun-with-culv-and-ubuntu-910/</guid>
		<description><![CDATA[For the past few years, Ubuntu has been my favorite Linux distribution. With a fresh install of Ubuntu Desktop 9.10 on my CULV Core2 Duo SU7300 Timeline 1810T, I was getting around 3.5 hours of battery life. This was not even half the promised battery life (8 hours) from the manufacturer. Diagnosing and fixing the [...]]]></description>
			<content:encoded><![CDATA[<p>For the past few years, Ubuntu has been my favorite Linux distribution.  With a fresh install of Ubuntu Desktop 9.10 on my CULV Core2 Duo SU7300 Timeline 1810T, I was getting around 3.5 hours of battery life.  This was not even half the promised battery life (8 hours) from the manufacturer.</p>
<p>Diagnosing and fixing the battery life</p>
<p>To see how much battery you use at any given time, use the following command line:</p>
<p>$ cat /proc/acpi/battery/BAT0/state</p>
<p>To figure out what was chewing up my battery life, I started with Intel's Open Source PowerTOP application.  The latest version is available in the standard repositories, so all you have to do is install it with a simple Terminal command:</p>
<p>$ sudo apt-get install powertop</p>
<p>And run it with another:</p>
<p>$ sudo powertop</p>
<p>PowerTOP will let you know what is eating your battery life - for this computer, the result was is not surprising in the least as I would see the battery life greatly improve with it turned off.  The wifi was set to maximum power by default, and the current build of Ubuntu has a version of it with the power savings mode disabled to boot.  Fortunately, there is a workaround for this, install the backport modules.</p>
<p>$ sudo apt-get install linux-backports-modules-karmic</p>
<p>We can write a script that will automatically set the wifi driver into powersave mode, as well as do the other commands that PowerTOP recommends.  We'll create a file called "99-savings" with the following contents:</p>
<p>iwconfig wlan0 power on<br />
#!/bin/bash<br />
echo min_power &gt; /sys/class/scsi_host/host0/link_power_management_policy<br />
echo 1500 &gt; /proc/sys/vm/dirty_writeback_centisecs<br />
# Auto Suspend Unused USB Busses (Timeout in seconds)<br />
  for i in /sys/bus/usb/devices/usb?/power/autosuspend<br />
  do echo 1 &gt; $i<br />
  done</p>
<p>Then make it executable with the following command:</p>
<p>$ chmod +x 99savings</p>
<p>Then install it:</p>
<p>$ sudo install 99-savings /etc/pm/sleep.d<br />
$ sudo install 99-savings /etc/pm/power.d</p>
<p>A quick restart of the machine should put you into power savings mode.</p>
<p>Disable Swap</p>
<p>The easiest way to disable swap is to just not use it when you install.  I have not needed swap in the past 2 years of using Ubuntu, even with the lightweight app development and debugging that I do for fun.  If you installed swap, you can disable the partition in your /etc/fstab file.  Please make a backup copy before changing this important file.  Note that if you use any managed memory software like Java, you will probably need swap.</p>
<p>Disable Bluetooth</p>
<p>If you don't use Bluetooth, you can turn it off with the button on the front of the machine.  You can also turn off the Bluetooth service if you want, but it seems to stay asleep if not activated.</p>
<p>Install Adblock Plus into Firefox</p>
<p>Adblock plus will help your battery life by not loading advertisements, this will lower your wifi bandwidth and your cpu usage.  Also, you may not want to install the closed source Adobe Flash Player, so that will help your battery life as well.</p>
<p>Playback of HD Media</p>
<p>As far as I know, there is no way to use Intel's Clear Video technology on Linux.  However, you can still get great HD playback by using the power of your CULV processor.  If you are lucky enough to have a dual core CULV, you can compile the multithreaded version of mplayer yourself.  You will be able to flawlessly playback 1080p clips this way.  Follow the instructions at http://www.mplayer.hu to compile your own mplayer.</p>
<p>If you have a single core machine, you may have to turn off the loop filter for x264 playback.  So add "lavdopts=skiploopfilter=all" to your "./mplayer/config" file.</p>
<p>If all you want is easy 720p playback, vlc will do this for you:</p>
<p>$ sudo apt-get install vlc</p>
<p>Conclusion</p>
<p>I now have 7.5 hours of web surfing with a full battery under Ubuntu.  Considering the machine is rated at 8 hours and that is probably lowest brightness and wifi off, this is a great number.  It is a bit lower than I am getting under Windows 7, but there is more work that I could do to reach that number.  I am by no means a Linux guru.  Please comment below if you have other tips on Linux power savings!</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/01/11/fun-with-culv-and-ubuntu-910/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Fun with CULV</title>
		<link>http://software.intel.com/en-us/blogs/2010/01/04/fun-with-culv/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/01/04/fun-with-culv/#comments</comments>
		<pubDate>Mon, 04 Jan 2010 22:36:54 +0000</pubDate>
		<dc:creator>Mark Randel</dc:creator>
				<category><![CDATA[Game Development]]></category>
		<category><![CDATA[Graphics & Media]]></category>
		<category><![CDATA[CULV]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/01/04/fun-with-culv/</guid>
		<description><![CDATA[CULV is the line of Intel's new ultra low voltage processors for laptops - this means higher performance and great battery life. It gives you more computing power when you need it (it even can run the Infernal Engine Editor!!!), hardware HD video playback, and stellar battery life that you are used to with your [...]]]></description>
			<content:encoded><![CDATA[<p>CULV is the line of Intel's new ultra low voltage processors for laptops - this means higher performance and great battery life.  It gives you more computing power when you need it (it even can run the Infernal Engine Editor!!!), hardware HD video playback, and stellar battery life that you are used to with your Atom.  Being lucky enough to have purchased a new Acer Aspire Timeline 11.6" with a 1.3GHz Core 2 Duo SU7300 (they are still in very short supply) I will talk about some things you can do to get the most out of your hardware and battery life under Windows 7...</p>
<p>Hardware HD Playback</p>
<p>Probably the best feature of the CULV is the ability to playback 1080p HD video flawlessly with the GMA4500HD chipset using Intel's Clear Video technology.  With the right software, you have enough power with the CPU to playback, but you will get better battery life by letting the GPU do it.  Since I playback mostly x264 videos, here is what I did to make this work:</p>
<p>1.  Install the latest version of Media Player Classic Homecinema from the CCCP (Combined Community Codec Pack) Project.  You may have to download and install the latest DirectX runtime from Microsoft as well.</p>
<p>2.  In the Options/Playback/Output dialog, select "EVR" (or "EVR" Custom Presentation if you need subtitle support)</p>
<p>3.  In the Options/Playback dialog, select "Auto-load subtitles" if you need subtitle support</p>
<p>4.  In the Options/Internal Filters dialog, on the Hardware Transform side, select "H264/AVC (DXVA)".</p>
<p>5.  In the Options/External Filters dialog, you will need to Add Filter "ffdshow video decoder", and set it to "Block".  You may have to block "DirectVobSub (auto loading version)" as well if you have previously enabled it.  We'll be using MPC-HC's internal subtitle renderer anyhow.</p>
<p>6.  If you play other formats, enable them in the Options/Internal Filters dialog, such as "Xvid/MPEG-4".</p>
<p>To make sure this is working after setting the above settings, play a video clip and watch your CPU usage using Sysinternals.  It should take about 12% for a 720p video.  You can also right click on the video, click on MPC Video Decoder, and it should show the DXVA Mode as H.264 bitstream decoder, ClearVideo(tm).  You may want to switch from the Realtek audio driver to the default Windows audio driver to save CPU usage.</p>
<p>There is also a beta Adobe Flash 10.1 driver that will hardware accelerate video playback as well for you Hulu junkies...CPU usage on it is still a bit high, but hopefully they will get this fixed before the final release.</p>
<p>How to maximize your battery life...</p>
<p>Making sure your laptop sleeps and wakes up correctly</p>
<p>Out of the box, my laptop would sleep and wake up nicely, but somewhere along the way when updating the Atheros AR8131 Gigabit LAN driver was updated and it caused the computer to wake up out of sleep and consume the battery rapidly.  To verify this, download Sysinternals from Microsoft and make sure your CPU usage doesn't spike uncontrollably after coming out of sleep mode.  If it does, you probably need to rollback your driver to the version that came with your machine.  After this, I recommend testing optional driver updates instead of blindly installing them.</p>
<p>If you decide to reinstall your OS clean, make sure you install every driver including the webcam, network, etc.  They may all effect your battery life and wakeup from sleep usage.  Make sure your computer is indeed idle when idle!</p>
<p>Lowering your brightness</p>
<p>This monitor is extremely bright.  I keep my brightness at 1 click above the lowest setting.  It is plenty bright and saves battery life substantially.  Also, in the GMA Properties Tray, you can select "Display Settings" and put the GPU into power saving mode for added savings.</p>
<p>Replace your mechanical HDD with an SSD</p>
<p>I was lucky enough to pick up an 80GB Intel X25-M SSD from a Black Friday sale.  Adding this drive increased my battery life by about an hour.  And not to mention the fact that my boot time went down to about 10 seconds.   Since I went from 320GB to 80GB, I disabled hibernation, and reclaimed about 4GB of valuable disk space.  To disable hibernation and remove the file, start an Administrator Command Prompt, and type "powercfg -h off" to disable it.</p>
<p>I also disabled swap to save disk space and increase battery life.  Win7 is very good about reminding you when you are close to running out of memory.  So far, I haven't come close to running out with 4GB in this machine.</p>
<p>Bump up your memory speed</p>
<p>I replaced the dual channel DDR2-667 memory that the laptop came with with dual channel DDR2-800 memory.   CPU usage dropped proportionately when playing back 1080i MPEG2 video captures from my MythTV box.  This means the CPU spends more time in C6 idle mode, and you get better battery life.  Playback time went up from around 3.5 hours to 4.0 hours for 1080i videos because of less CPU usage.</p>
<p>Bluetooth and Wifi</p>
<p>If you don't have any Bluetooth devices, you can keep Bluetooth disabled.  Just use the button on the front of the laptop to turn it off.  With it on, it will use an incremental amount of power, and saving every little bit counts when maximizing battery life.  If you don't change it, your wifi adapter should already be on maximum power savings.  To verify this, use the Control Panel, go to Power Options, change your plan settings, and view the Advanced Power Options.</p>
<p>Use a lightweight anti-virus program</p>
<p>I recommend either Avast or Microsoft Security Essentials.  Both are very minimal on your system resources and both are free for home use.</p>
<p>Something unrelated to battery life, but related to viruses:  I highly recommend you set your UAC (User Account Control) settings to "Always Notify".  If you're a heavy web surfer like me, you won't be subject to a drive-by web attack without at least being warned first.  UAC works great under Windows 7.  I write and debug software daily and am never annoyed by it anymore after upgrading from Vista.  Go to Control Panel-&gt;User Accounts-&gt;Change User Account Settings.  This may be the single most important anti-virus step you can make.</p>
<p>Keep your machine cool</p>
<p>Make sure that you don't block the ventilation holes when using your laptop.  If the machine gets hot, the fan will start running and you will consume more battery life.  This machine is tricky - it takes air in from underneath, and blows it out the side.  So putting it on a table works better than leaving it on your lap.</p>
<p>Final results</p>
<p>Surfing the web, I can get up to 9 hours of battery life under actual use (10 hours if I stick to the media light Wikipedia), much higher than the generous 8 hours that the computer is rated at with the lowest brightness and wifi off.   With hardware video playback of 720p video, I can get around 5 hours and 30 minutes, and with 1080i or 1080p, I can get just over 4 hours.  Idling the laptop with a full battery, it shows around 12 hours with the display on, and around 16 hours after keeping the display closed for a while.  Amazing!</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/01/04/fun-with-culv/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Highlights and Challenges During Ghostbusters Development, Part 4</title>
		<link>http://software.intel.com/en-us/blogs/2009/07/07/highlights-and-challenges-during-ghostbusters-development-part-4/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/07/07/highlights-and-challenges-during-ghostbusters-development-part-4/#comments</comments>
		<pubDate>Tue, 07 Jul 2009 16:07:21 +0000</pubDate>
		<dc:creator>Mark Randel</dc:creator>
				<category><![CDATA[Game Development]]></category>
		<category><![CDATA[Graphics & Media]]></category>
		<category><![CDATA[GPA]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/07/07/highlights-and-challenges-during-ghostbusters-development-part-4/</guid>
		<description><![CDATA[Synchronization between threads in the Infernal Engine Thread synchronization is a complicated problem and rarely discussed in practice. We came to our own conclusions via experimentation and what worked well for us during the production of Ghostbusters. Ghostbusters used two kinds of synchronization primitives, "crude locks" and "critical sections". A crude lock is the lowest [...]]]></description>
			<content:encoded><![CDATA[<p>Synchronization between threads in the Infernal Engine</p>
<p>Thread synchronization is a complicated problem and rarely discussed in practice.   We came to our own conclusions via experimentation and what worked well for us during the production of Ghostbusters.</p>
<p>Ghostbusters used two kinds of synchronization primitives, "crude locks" and "critical sections".   A crude lock is the lowest form of synchronization - it lets one thread sit in a loop until the other thread lets it continue.  Here is how a simple implementation of a crude lock class could look in C++...</p>
<p>class CCrudeLock {<br />
    volatile long value;<br />
public:<br />
    CCrudeLock() { value = 1; }<br />
    void lock() {<br />
        for (;;) {<br />
            if (1 == InterlockedCompareExchangeAcquire(&amp;value,0,1)) {<br />
                return;<br />
            }<br />
        }<br />
    }<br />
    void unlock() {<br />
        value = 1;<br />
    }<br />
};</p>
<p>As you can see in the lock code, if one thread tries to acquire the lock, and it is busy, it will sit in a tight loop burning CPU time forever until it gets the lock.  The InterlockedCompreExchangeAcquire function is an atomic function that will synchronize the access of a variable across multiple threads.  In assembly language, it could look like the following:</p>
<p>    mov ecx,dword ptr [esp+4]        ; Get the address of value into ecx<br />
    mov edx,dword ptr [esp+8]        ; Get "0" into edx<br />
    mov eax,dword ptr [esp+12]        ; Get "1" into eax<br />
    lock cmpxchg dword ptr [ecx],edx ; Compare the value of eax with the destination, and if equal, write edx into the destination<br />
    ret 12                                    ; eax = return value</p>
<p>So the guts of the InterlockedCompareExchangeAcquire function is the "cmpxchg" instruction with the "lock" prefix, which preserves the order of operations so that more than one processor does not interrupt this single instruction, i.e., it functions atomically.</p>
<p>On a cache coherent multiprocessor, releasing the crude lock is as simple as writing back to the memory location.  If another thread is attempting to acquire the lock, the atomic interlocked function will guarantee the order of operations.</p>
<p>The other type of synchronization primitive we used is the standard Windows CRITICAL_SECTION structure.  This structure is well documented, although the implementation may not be, so we will discuss how it might work internally.  A standard critical section is similar to the crude lock, with the loop being finite - if a certain amount of loop iterations happen, and it cannot acquire the lock, then yield the thread for an amount of time and start over.</p>
<p>class CCriticalSection {<br />
    volatile int value;<br />
public:<br />
    CCriticalSection() { value = 1; }<br />
    void lock() {<br />
        int loopCount = 0;<br />
        for (;;) {<br />
            if (1 == InterlockedCompareExchangeAcquire(&amp;value,0,1)) {<br />
                return;<br />
            }<br />
            loopCount++;<br />
            if (loopCount &gt; 400) {<br />
                Yield();<br />
                loopCount = 0;<br />
        }<br />
    }<br />
    void unlock() {<br />
        value = 1;<br />
    }<br />
};</p>
<p>Note that the loopCount value is local to the lock function, so that more than two threads can attempt to access our critical section at any one time.</p>
<p>When is it appropriate to use a crude lock instead of a critical section?</p>
<p>In the Core i7 processor, Intel introduces 4 cores, with 2 threads each, or hyperthreading.  Although there are two sets of register contexts for each thread, there is only one execution unit.  So if thread 1 is blocked on a core, thread 2 can start to execute if possible.   If two code threads try to acquire the lock in a crude lock and they are both executing on the same processor, you can have a live lock situation occur.  If you are running on a Core2 Quad or other multi core, non hyperthreaded machine, you will not have this situation occur.</p>
<p>Ghostbusters was a multiplatform title, with the PC, Xbox 360, and PS3 supported.  For the PC and PS3, we used critical sections to synchronize code.  The PC could be hyperthreaded, and the PS3 is hyperthreaded.  Although the Xbox 360 is hyperthreaded, you are able to select which thread on which core you can run your code on, so in the end we were able to use the lighter crude lock for thread synchronization, because we could guarantee what hardware threads our game would run on.</p>
<p>Note that whenever you have a crude lock or critical section in your code, you need to guarantee that you will be accessing that resource for only a very short amount of time.  Staying inside of a critical section of your code for a long time will have negative effects on performance.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/07/07/highlights-and-challenges-during-ghostbusters-development-part-4/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Highlights and Challenges During Ghostbusters Development, Part 3</title>
		<link>http://software.intel.com/en-us/blogs/2009/06/30/highlights-and-challenges-during-ghostbusters-development-part-3/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/06/30/highlights-and-challenges-during-ghostbusters-development-part-3/#comments</comments>
		<pubDate>Tue, 30 Jun 2009 15:05:05 +0000</pubDate>
		<dc:creator>Mark Randel</dc:creator>
				<category><![CDATA[Game Development]]></category>
		<category><![CDATA[Graphics & Media]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[parallel programming]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/06/30/highlights-and-challenges-during-ghostbusters-development-part-3/</guid>
		<description><![CDATA[Game Optimization Challenges for Modern Hardware Although we seem to have hit a ~3GHz limit in processor speed, Moore's law may still be holding as more and more cores are added to a processor at this speed. As processors have gotten faster and faster, memory latency has gotten longer and longer over time. This means [...]]]></description>
			<content:encoded><![CDATA[<p>Game Optimization Challenges for Modern Hardware</p>
<p>Although we seem to have hit a ~3GHz limit in processor speed, Moore's law may still be holding as more and more cores are added to a processor at this speed.  As processors have gotten faster and faster, memory latency has gotten longer and longer over time.  This means that understanding how the architecture you are working on accesses memory is critical to the execution speed of your program, and for the first time, maybe even more important than the algorithms themselves.  Let me give you an example...</p>
<p>For collision detection with actors that move in the world, every level in Ghostbusters (and the Infernal Engine) is partitioned with a BSP tree.   Each node in the tree contains a linked list of actor pointers that are in that node.  Note that actors may not be in the leaves of the tree, but may be pushed up until one node fully contains the actor.</p>
<p>Here is a simplifed example of what our world and actors looked like in previous generations of code...</p>
<p>class CActor {<br />
    .<br />
    .<br />
    .<br />
    CVector wPos;<br />
    float wRadius;<br />
    .<br />
    .<br />
    .<br />
    CActor *nextActorInBSP;<br />
    .<br />
    .<br />
    .<br />
};</p>
<p>struct SBSPNode {<br />
    float a,b,c,d;<br />
    SBSPNode *left,*right;<br />
    CActor *firstActorInBSP;<br />
};</p>
<p>void    raytraceActorsInNode(SBSPNode *node,CVector *wRayStart,CVector *wRayEnd) {<br />
    CActor *a = node-&gt;firstActorInBSP;<br />
    while (a != 0) {<br />
        checkCollision(&amp;a-&gt;wPos,a-&gt;wRadius,wRayStart,wRayEnd);<br />
        a = a-&gt;nextActorInBsp;<br />
    }<br />
}</p>
<p>So this is a very simple code example running through a linked list of actors and performing a simplified raytrace on them.   The problem with this code though is execution speed.  While this type of code worked well in previous years, it does not peform well on modern hardware due to the potential of cache misses running through a linked list.   Certain hardware can have over a 500 clock tick penalty for a L2 cache miss and a 37 clock penalty for a L1 miss.  Neither is acceptable if you want to run your code at 3GHz.   To insure we run at maximum speed, we rearrange our structures as follows:</p>
<p>New BSP structure:</p>
<p>struct SActorPosRad {<br />
    CActor *actor;<br />
    CVector wPos;<br />
    float wRadius;<br />
};</p>
<p>struct SBSPNode {<br />
    float a,b,c,d;<br />
    SBSPNode *left,*right;<br />
    Array actorList;<br />
};</p>
<p>void    raytraceActorsInNode(SBSPNode *node,CVector *wRayStart,CVector *wRayEnd) {<br />
    int    i;<br />
    for (i =0; i wPos,a-&gt;wRadius,wRayStart,wRayEnd);<br />
    }<br />
}</p>
<p>The array template is a very simple dynamic memory allocator.  As long as actors don't move very far in one frame of a game, using a template array here mostly doesn't change from frame to frame.  Insertion and deletion from the array is treated as a linear list, and the array is preallocated to a minimum size, so that allocation (and deallocation) will almost never happen.  This could even be a static sized array, with actors overflowing this node pushed up in the tree.</p>
<p>This is a case where we will have to unlearn what has been taught in computer science classes for decades - the use of a linked list - may not be the fastest way to store data in memory anymore.  We're violating multiple rules we've been taught - not using a linked list, and using a linear search for inserting and deleting actors in this list.  Linear searches are slow, right?</p>
<p>Linear searches (if used properly) can be extremely fast due to the memory access pattern.  The cache line on a modern system is anywhere from 64 bytes to 128 bytes in size.  That means that to read any memory location in that vicinity, the memory around that will also have to be read in.  So you have free full speed access to the data around you as well.  If you are constantly hopping around in memory such as a linked list does, you will be stalled while the memory around your pointer is brought in as it is dereferenced.</p>
<p>So while linear searches may not be applicable for all types of data, they can certainly be used for tight game code.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/06/30/highlights-and-challenges-during-ghostbusters-development-part-3/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Highlights and Challenges during Ghostbusters Development, Part 2</title>
		<link>http://software.intel.com/en-us/blogs/2009/06/24/highlights-and-challenges-during-ghostbusters-development-part-2/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/06/24/highlights-and-challenges-during-ghostbusters-development-part-2/#comments</comments>
		<pubDate>Wed, 24 Jun 2009 17:42:05 +0000</pubDate>
		<dc:creator>Mark Randel</dc:creator>
				<category><![CDATA[Game Development]]></category>
		<category><![CDATA[Graphics & Media]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[parallel programming]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/06/24/highlights-and-challenges-during-ghostbusters-development-part-2/</guid>
		<description><![CDATA[Game Loop Parallelization in the Infernal Engine In the old days of single processor computers, your game loop would run every process for the game in single step, the results were 100% deterministic. Your game loop looked much like the following: 1. Run the tick code for every actor 2. Perform rigid body simulation 3. [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Game Loop Parallelization in the Infernal Engine</strong></p>
<p>In the old days of single processor computers, your game loop would run every process for the game in single step, the results were 100% deterministic.  Your game loop looked much like the following:</p>
<p>1.  Run the tick code for every actor<br />
2.  Perform rigid body simulation<br />
3.  Process particle effects<br />
4.  Figure out what is visible<br />
5.  Render your set<br />
6.  Render your actors<br />
7.  Render your particle effects<br />
8.  Show the frame<br />
9.  Repeat</p>
<p>With the advent of multiprocessor computers, game programming has to be a lot more complicated in order to take full advantage of all of the processors in the system.  Given a 3GHz Core2 Quad Extreme and fast enough video card, Ghostbusters will be able to keep all 4 cores 100% utilized when heavy action is occurring.   We'll discuss how we accomplished this feat and what you can do for your own game engine.</p>
<p>When we started on the next generation systems four years ago, we took a good look at the PS3, 360, and PC platforms.  The 360 had 3 general purpose cores, the PS3 had one general purpose core plus 6 coprocessors called SPUs, and the best PCs had two cores (we couldn't even imagine the Core i7 at this point).   As a cross platform engine, we had to come up with a model for multiprocessing that would limit the amount of specialized coding for each system.</p>
<p>We used the PS3 "job" model as the basis for our multithreading model for all systems.  The PS3 has one general purpose processor, which we used for our game loop, and for kicking of jobs that could run on the SPUs.  Since the PC and 360 do not have SPUs, we created as many extra job threads as CPUs in the systems.  Each job queue thread (whether running on the SPU on the PS3 or the PC) would sit in a suspended state, and be woken up only if there was a job ready to process.  The job would then be processed, and it would check for another job to grab.  If there was another job ready, it would start, otherwise the thread would go back to sleep.  Jobs also need the ability to queue up more jobs.  I'll talk about our job queue more in-depth in the future.</p>
<p>Our new parallel game loop looks like the following:</p>
<p>1.  Lock our physics simulation<br />
2.  Update each actors position from physics simulation, queue up animation jobs, run tick code on each actor<br />
3.  Unlock the physics simulation<br />
4.  Kick off physics simulation<br />
5.  Process particle effects<br />
6.  Queue up visible objects in a display list<br />
7.  Kick off display list rendering job<br />
8.  Repeat</p>
<p>Note that when we queue up the display list, it contains the full state of the what needs to be rendered without relying on any game data.  This requires copying data into the display list, such as the animation state of an actor, or instanced data for a particle effect.  Actor states need to be able to change while we are rendering the previous frame's data.  If there were multiple rendering passes, the display list data could be reused for those passes rather than entering them multiple times.</p>
<p>The Infernal Engine also had the distinct advantage for actor simulation - each actor was physically simulated as rigid body or constrained system of bodies, so the collision and movement would happen inside the physics engine.  To guarantee order of operations, especially for the AI, we still tick each actor in serial, but most of the actual work happens as jobs now.</p>
<p>Our physics engine, Velocity, was also rewritten to be massively parallel and run solely in the job queue.  Before we parallelized it, it looked like the following:</p>
<p>1.  Compute broad phase collision<br />
2.  Compute narrow phase collision one pair at a time<br />
3.  Divide up rigid bodies into islands<br />
4.  Solve islands one at a time</p>
<p>After converting Velocity to use jobs, it looked like this:</p>
<p>1.  Compute broad phase collision (fast single threaded job)<br />
2.  Queue up jobs for each narrow phase collision (massively parallel)<br />
3.  Divide up rigid bodies into islands (fast single threaded job)<br />
4.  Queue up jobs for each physics island, or sub-island if we had too many bodies (massively parallel)</p>
<p>The results of having a massively parallel game engine were stunning.  When we finally got rendering and simulation of the game in parallel in the last weeks of Ghostbusters, the game became solely render bound.  Jobs were totally asynchronous, and we were able to fully utilize 3 to 4 cores.  When there wasn't any action in the game, the game was waiting on the vertical blank.  With a lot of action, the job model allowed the heavy lifting to be absorbed over as many processors as the system had.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/06/24/highlights-and-challenges-during-ghostbusters-development-part-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Highlights and challenges during the Ghostbusters development, Part 1</title>
		<link>http://software.intel.com/en-us/blogs/2009/06/15/highlights-and-challenges-during-the-ghostbusters-development-part-1/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/06/15/highlights-and-challenges-during-the-ghostbusters-development-part-1/#comments</comments>
		<pubDate>Mon, 15 Jun 2009 20:29:37 +0000</pubDate>
		<dc:creator>Mark Randel</dc:creator>
				<category><![CDATA[Game Development]]></category>
		<category><![CDATA[Graphics & Media]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[parallel programming]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/06/15/highlights-and-challenges-during-the-ghostbusters-development-part-1/</guid>
		<description><![CDATA[Ghostbusters was an unusually long project for us - we started in January 2006 with a prototype. For the first nine months of development, we were working on recreating the ballroom scene where Slimer is captured from the first movie, obtaining the movie license, and getting a green light to develop the project. At the [...]]]></description>
			<content:encoded><![CDATA[<p>Ghostbusters was an unusually long project for us - we started in January 2006 with a prototype.  For the first nine months of development, we were working on recreating the ballroom scene where Slimer is captured from the first movie, obtaining the movie license, and getting a green light to develop the project.  At the same time we were working on Ghostbusters, we knew we had something special with the Infernal Engine.  A few former Terminal Reality employees wanted to use our technology to create a game for themselves, and hence our engine licensing effort began as well.</p>
<p>We started off knowing we would be a cross platform game.  We had early PS3 hardware, early 360 hardware, and the dual core PC was the fastest money could buy.  Multithreaded game programming was just a concept - we had experience on the PS2 where you could get two coprocessors running independently from the main CPU, but nobody had made a game yet with multiple cores in mind.   As the lowest common denominator, we chose the PS3 multithreading model, with one main CPU controlling the game loop and multiple coprocessors performing small "job" tasks as the main game loop needed work done.</p>
<p>When we got final development hardware, we knew multiprocessing was the way to go.  At this point, the consoles were faster then their PC counterparts, and we had to reinvent how the Infernal Engine worked.  Ghostbusters was conceived and sold as a game with highly destructive enviornments, so our physics engine, Velocity, was the first part of the game to be multi-threaded.</p>
<p>Threading Velocity was quite a fun task and we are still making improvements to this date.  Velocity consists of two main parts, collision detection and the physics solver.   Collision detection is an obvious problem to thread - if you had 5000 processors and 5000 bodies, you could do 5000 collisions in one step.  However, the rigid body solver was a big challenge to thread.  If you have multiple "islands" of bodies touching each other, then you can perform the islands solving in parallel.  However, if you have one giant island of rigid bodies, like a huge pile of library books, things get really tricky as each body's accumulated forces depends slightly on its neighbors.</p>
<p>We also knew that for a game coming out in 2009, that the bar would be quite high for particles and special effects.  We set out to write our 3rd particle engine, appropriately named, Trifecta.  Trifecta would be a material driven particle system giving the artist more freedom than ever, with no code involvment.  We also decided early on to make it use a new technique called "soft particles."   Soft particles blur the intersections between the particles and the scenery and eliminates the billboard effects from previous efforts.  For the first time, you could have realistic smoke fill up a room!   Trifecta was originally written as single threaded - up to this point, our previous particle engines never took a lot of CPU time to simulate or render, but towards the end of Ghostbusters, it was clear to us that we were spending over 25% of our frame time on just special effects when the action was heavy.</p>
<p>We almost didn't ship Ghostbusters with Trifeca running in parallel - it was a highly object oriented system with layers and layers of effects that could be added to each particle system instance.  During development, we could not figure out any reasonable method for making such a highly object oriented system simulate and render in parallel.  Only after we thought we were originally done with the game and had a week's worth of rest, we figured it out.  It turned out cloning the particle instance data, rather than using traditional methods of overlapping rendering and simulation, worked extraordinarily well.  What took us almost a year to figure out was coded in two days, and slipped into the game, almost doubling the frame rate on the Xbox 360 and PC versions.   We could now even keep 4 cores fully busy (provided you have a very fast GPU) on your PC!!!   Most review copies of Ghostbusters went out without this final improvement.</p>
<p>One more challenge we had was to fit Ghostbusters on a single DVD-ROM not only for the Xbox 360, but for the PC version as well.  In order to fit an hour's worth of HD video on the ROM where every bit counted, we would need a modern compression system.  Luckily for us, the Open Video format was announced in November 2008, and we were able to quickly recode it to run in parallel on the 360 and PS3 in order to be fast enough.  We now had high quality 720p HD video at 6000kbps with 10 channels of audio.  (5.1 audio with 4 extra center channel languages for those of you counting...)</p>
<p>Ghostbusters was an expensive project to make, and it switched publishers twice.  Sierra was dissolved after Vivendi Games was purchased by Activision.  Atari purchased the worldwide rights, but then sold their Eurpopean distribution rights to Sony.  Each publisher had their own vision for what the game was, and we had our own vision as well.  After the first publisher change, our lead level designer took over the reigns of the game and did a great job keeping the team focused on finishing it.</p>
<p>In future posts, I will discuss more of the technical challenges in depth.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/06/15/highlights-and-challenges-during-the-ghostbusters-development-part-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

