Using OpenMP to Parallelize a Game

Several weeks ago I did an unofficial google survey to see if there was much information on the topic of using OpenmP for games.  There were not that many posts on the subject and most were rather negative on using OpenMP for games – and the general consensus was that OpenMP is fine for data decomposition of for loops only and since this does not apply to the main game loop this relegated OpenMP to a minor role for developing parallel games.

 I had not found any examples in web search of anyone using OpenMP to parallelize a game.  Turns out it was relatively easy to parallelize this game demo using OpenMP tasks with some signal handling.  The advantage is of using OpenMP , and my reason for wanting to give it a try, is the ability to incrementally parallelize a portion of the code and if you run into problems at run time you can comment a single line of code or not compile with the /OpenMP switch – this gets the developer back to the serial baseline, whenever this is needed.   This lets a developer test relatively quickly parallel strategies in code.  For example, I was able to test the effect of adding the Physics->update, AI->Update etc to the list of parallel tasks – this lead to race conditions so I commented out the two OpenMP pragmas and went about my subsequent investigations. But this is getting ahead of my story…

Undaunted at the lack of examples, I decided to try my hand at using OpenMP Tasks to try to parallelize the DestroyTheCastle Game demo that Intel engineers put together for GDC several years ago.  At first I stumbled around trying to figure out just how to parallelize it. I decided to take an approach similar to the way the Intel engineers has previously used Windows QueueUserWorkItem in the original game demo to queue up tasks using event handlers to coordinate when stages of work were completed and thus ready for rendering.

This approach essentially makes tasks out of the Physics, AI, and Particle calculates.  In the serial baseline of the code, these “tasks” are done one after the other in sequential order.  By using OpenMP tasks I was able to do these three tasks concurrently and achieve a nice speedup.

I was not all that sure OpenMP was even a sensible choice for parallelizing a game because games are inherently event driven with event handlers all over the place and so they feel rather unstructured.  OpenMP requires structured block of code to parallelize.  A parallel region must have only on entrance and one exit possible to be a valid parallel region and the compiler will complain otherwise.  So how do you make such an event based application “structured” for the purposes of OpenMP?  In this case I created the parallel region inside the WinMain routine which is found in ParallelDemo.cpp. The OpenMP parallel region starts near the top of WinMain and ends near the bottom of WinMain.  I then immediately placed a single region within the parallel region so that only a single thread really operated WinMain.   Well, you ask, how does that help?  Remember that the parallel region sets up a pool of threads, the single construct is a worksharing construct that allows only one thread operate the enclosing single region.  So what I have done by using this trick is really just create a pool of threads for me to use at other locations in the code, but meanwhile I did not affect the behavior of the WinMain sequence of instructions.

After building the above mods to with OpenMP and running the code to verify that the app still behaved similarly to the serial baseline, I went on to place OpenMP tasks at strategic points in the code.  The Version of ParallelDemo.cpp that contains the parallel/serial region trick is called ParallelDemoOpenMPSolutionActivity1.cpp. 

After adding OMP Task directives around calls to the Physics, AI, and Particle tick methods I realized that I needed a way to synchronize when the tasks were compete and thus ready for rendering.  I decided to use OMP Taskwait clauses at strategic locations in the code to ensure that all tasks were complete prior to rendering.  Turns out, due to the event based nature of the game, I had to sprinkle OMP Taskwait clauses at all points where the code could exit.  So I had to include taskwaits in the various event handlers such as MsgProc, KeyboardProc, etc.  Failing to do this caused the app to crash when doing such things as bringing up menus, switching in and out of multi-threaded mode, etc.

I saved this level of mods to ParallelDemoOpenMPSolutionActivity2.cpp.  It represents the fully OpenMP parallelized version of the code.  Unfortunately, I did not see any speed up at this point.  I had to go back and do a little tuning, mainly ridding myself of one call to taskwait in OnFrameMove and replacing it with some event handling routines built with RestEvent,SetEvent and WaitForSingleObject.  A little playing around with Intel® Parallel Studio Amplifier helped me identify that I was over synchronized and I decided to try the synchronization scheme used with the Windows threading QueueUserWorkItem approach, where some event handlers were used to indicate when tasks were completed. 

In the new approach, saved as ParallelDemoOpenMPSolutionActivity3.cpp, I created an array of event handlers named s_hTickDoneEvent.  There is an array element for each event I care about, Physics, AI, and Particles.  In this new approach, the event handlers are used to keep track of the state of each task.  The event for Physics is reset prior to executing the Physics->tick task and the event is set when the task is complete.  AI->tick and Particles->tick is handled in a similar fashion, each having event handlers reset prior to tick, the set immediately afterwards.

It turns out that a big performance gain is achieved by removing only one omp taskwait – the one at the top of FrameMove. Replacing it with calls to WaitForSingleObject, which waits for the associated event handler to become set, gave a nice sppedup. Due diligence then required in order to create and destroy the events and to reset them and set them around each of my three tasks.  When these steps are complete the code shows a decent speedup.

To evaluate my performance, I ran the demo loop by pressing the “B” key in the demo.  All frame rates were measured when the graphics were set to the maximum dimensions on my 22 inch screen. I recorded the minimum framerate and maximum frame rate but from a user experience perspective, minimum frame rates were what really made the demo feel fast or slow so I will restrict my metrics here to just minimum frame rates. My rough performance results on my Intel® Core™2 Quad CPU Q6600 @2.40GHz running Windows XP professional & using DirectX 9.0c are as follows:

11 FPS:    Serial baseline minimum frame rate
12 FPS:    OpenMP Tasks & Taskwaits
60 FPS:    OMP Task w Event signals
99 FPS:    Windows QueueUserWorkItem with Signal Handling

So while the original QueueUserWorkItem with Signal handling approach was the fastest threading methodology, the OMP Tasks & Taskswaits modified with a bit of signal handling was not far behind.  The advantage of the OMP method however, is that I could easily get back to my serial base version by just commenting out the #pragma omp statements or else just not compiling with the OpenMP switch.  The other advantage was that I was able to quickly try other parallel strategies quickly.  I was able to see the effect of adding Physics->Update & AI->Update  to the task list.  It will also afford me the ability to use data decomposition using OMP For at some point in future investigations.

A short powerpoint gives an overview of the Destroy The Caste demo and the OpenMP parallelism approach. This presentation as well as all the code and solution & project files for this example is accessible at /en-us/courseware/course/category.php.  Download it and see I there are other enhancements you can make to this code and if you use this in your class please post to my blog to let me know!

Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.