Achieving Performance: An Approach to Optimizing a Game Engine

by Will Damon

Introduction

Wanting to try out something new for GDC 2002, a few game developers got together and invented the Dogma 2001 Challenge*. Chris Hecker and Sean Barrett, two of the original organizers, were brainstorming ideas for games and started to think about how many sprites could be rendered in real time with modern PC hardware, and more importantly, what kinds of interesting games could be created with such huge numbers of characters. The result was the 0th Indie Game Jam*. Twelve games were created, all of them built on the same 100,000 real-time sprites engine. Looking for an opportunity to demonstrate how a few subtle changes can positively impact the performance of a game engine on an Intel® Pentium® 4 processor-based system, I figured this engine provided an interesting subject.

This open-source engine was fairly well optimized before I even had a chance to open up the workspace, as we will see in the performance analysis. The original engineers really wanted to stress the hardware while maintaining a level of generality so that any number of games could be built with this engine as the core. Nevertheless, as our investigation reveals in the "Tuning Hotspots" section below, making a few slight modifications here and there can help make an already great game engine even better.

The goal for optimizing any game engine essentially equates to saving as many CPU cycles as possible in compute-intensive sections of the code so that they may be used for something more interesting. For example, saving a few thousand clock cycles on scene management or occlusion culling might not perceptibly boost the overall frame rate (anything beyond 85Hz becomes benchmark machismo anyway). However, it will allow the engine to spend those "extra" cycles on something like smarter or more complex AI, more realistic physical simulations, simply processing more characters, or anything else for that matter. The goal for this paper is to paint a step-by-step illustration of the optimization process for a game engine, and hopefully inspire you to try some of these techniques on your own.

Getting Started

The number one place to start is to determine a base performance that will become the metric against which we measure our success. Assuming access to the source code we hope to optimize, we start off by getting a build environment up and running. Usually, this simply means opening the project- or make- file and making sure all the relative dependencies are correct. Typically, everything is set up properly, but if the project will not compile, make sure you have a valid project. Check for things like hard-coded paths in the environment settings, incompatible libraries, headers, and so forth. In this particular case, I had the DirectX* 8.1 SDK installed, but the source code used DirectInputCreate(), and linked to dinput.lib, which was only available in DirectX 7.0 and earlier versions. My options at this point were to either define the DirectX version in a header or source file, or update the code to use the latest publicly available libraries. I chose the second approach, and replaced DirectInputCreate()with Dir ectInput8Create(), and linked to dinput8.lib. After that one modification, the project compiled and linked beautifully. Notice that at this point I haven't tried to integrate any new code or tools into the build process. I want to have a similar (if not the same) build environment to that which was originally used to create the code so that I can establish the best baseline performance possible.

The project we're building is to compile the game called Angry God Bowling*. This particular source tree allows for compilation of different games based on a few #defines, but the one that best allows us to reproduce a similar workload every time we run the game is Angry God Bowling. It is always necessary to stress the importance of a reproducible workload when talking about performance profiling. Ideally the workload demonstrates a typical, though intense game play that stresses all the components of the engine, specifically those components that may be good candidates for optimization. Angry God Bowling does a good job of stressing what we're interested in, which are the core components of the engine such as scene management, collision detection, and so forth. We could have chosen any of the games, as they all use the same core components, but this game provides one of the better workloads for our purposes, and was suggested by the creators of the engine as the benchmarking application.


Performance Analysis

Once the project compiles and runs without any problems, we can move to the performance analysis. The performance analysis tool used is the Intel VTune™ Performance Analyzer 6.0. The Pentium III and Pentium 4 processor families expose event counters that the VTune Performance Analyzer monitors during application runtime. This Event-Based Sampling, or EBS, is a low-overhead (less than 1%), system-wide profiling that helps identify which modules and functions are consuming the most time, and can give a detailed view into an application and the system as a whole. After the monitored program exits, the samples are mapped to modules and stored in a database within the analyzer program. We can then use the module view to see the results.

From the module view, we can drill down to application-specific information, known as the "hotspots view." The hotspots view helps us identify potential performance problems in classes and functions. Optionally, we may view hotspots by process, thread, module, function, or source file in a graphical view or a table format. The VTune Performance Analyzer has a robust feature-list that spans well beyond the scope of this paper; however, if you want to learn more about the tool, please visit the Intel® VTune™ Performance Analyzer web site. For our baseline performance measurements, we'll use a basic EBS project that runs for 20 seconds and monitors events that usually indicate coding pitfalls.

Counting More Than Clocks

Monitoring other events in addition to clockticks and instructions retired can often reveal potential causes of what appear to be pipeline stalls. Some such events are those that map directly to common coding pitfalls: 64K aliasing conflicts, split loads retired, MOB load repla ys retired (blocked store-forwards retired), SSE input assists, x87 input assists, and x87 output assists. Each one of these events indicates that the source code contains certain sequences of instructions that are potentially unfriendly to the microarchitecture in one way or another. MOB (memory order buffer) load replays retired, for example, indicates that store-to-load forwarding restrictions are not being observed. The Pentium 4 processor and Intel Xeon® processor use a store-to-load forwarding technique to enable certain memory load operations (loads from an address whose data has just been modified by a preceding store operation) to complete without waiting for the data to be written to the cache. There are size and alignment restrictions for store-to-load forwarding cases to succeed, and when a restriction is not observed, the memory load operation stalls.

Evaluating when coding pitfalls are causing performance hits is difficult to do without the help of the analyzer. To determine whether there is something in the implementation that can be done to help speed things up, we profile the application with all the events we want to monitor. Though the VTune analyzer is capable of collecting data on multiple events simultaneously, it usually runs more than one sampling session to collect all the data because certain events cannot be monitored at the same time. After the sampling sessions are complete, we are presented with a graph similar to Figure 1.

Looking at the modules

After running a sample session, the VTune Performance Analyzer displays the following graph. Each color bar represents a different event sampling, and the pop-up box summarizes all the data.



Figure 1. A module view of application runtime performance

As you can see, the game consumed the majority (but not all) of the CPU cycles during the sampling session. In fact, this particular game doesn't really stress the Pentium 4 processor that much; only ~63% of the sampled clock cycles were taken from the game itself. The rest of the cycles were mainly sampled from the video drivers and the OS kernel. Indeed this is interesting because it raises the question of what (or if) we should think about optimizing: low-level code (which involves breaking out the optimization manual, dealing with latency tables, and so forth); or, high-level algorithms, and/or utilization of the rendering APIs.

Since this is our first look at the project, let's assume that we can achieve some kind of performance win. At this point, we don't know yet whether the gains will come from hand-optimizing a few routines, restructuring the entire rendering algorithm, trying a different compiler, or some combination of all of these. All we know at the moment is that we think it is possible to improve the engine's runtime performance. The remainder of this paper focuses on optimizing the engine code as it was written.

Drilling down to the hotspots view gives us more information about what is going on during application runtime. Figure 2 shows the hotspots view for the game. Notice that simulate_guy() consumes almost 35% of the clockticks, with a poor CPI of 2.063. Also notice that _ftol2 (floating-point to integer conversion) is the third largest spike.



Figure 2. A hotspots view of igj0.exe

At this point, we can take a couple of different approaches. We can either dive into the code, looking for areas where we can apply our knowledge of the microarchitecture to enhance performance, or we can think about the tools we are using, and how we are applying them.


Considerations: The Compiler

Before diving into hand-coded optimizations, the first things to look at are the project settings and build environment. Make sure all the appropriate optimization flags are set, and that the compiler is building a "Release" version of the project. The more aggressive you are with throwing optimization flags, the more aggressive the compiler can be about generating fast binaries. Another consideration to keep in mind is the possibility of using a different tool for the job; in this case, consider using a different compiler than the one that ships with the development environment. Alternative compilers are either free (such as gcc), or have a free, fully functional trial version available for download (such as the Intel C/C++ Compiler). Typically you would want to try out such alternative tools first, because they can potentially provide a big win with a small time investment and very little effort. Our next step is to integrate the Intel C++ Compiler, and then run a new profile to measure the impact this tool has on the project.

Integrating the Intel® C++ Compiler

Converting the Visual Studio* .NET* project to Intel C++ Compiler association is quite easy. Just make sure the solution is not open in Visual Studio .NET, and run the command line tool,slnconverter.

Now that the project is associated with the Intel tool, open it in the IDE, and then pull up the project properties. Notice that under the Configuration Properties there is now a new tab: Intel® Specific. This is the integration utility that lets you select a default compiler for the project. Select the Intel Compiler, and then select the command line options from the C++ tab.

Intel C++ Compiler Command Line Options

The Intel C++ Compiler can help in targeting a Pentium 4 processor-based system (though it is not limited to the Pentium 4 processor, and in fact, it can help in targeting any Intel processors, as well as a general x87 machine). We can tell the compiler to aggressively schedule instructions for the Intel NetBurst™ microarchitecture (the core microarchitecture of the Pentium 4 processor) with the –G7 switch. Additionally, we can invoke the auto-vectorizer with –QaxW, which tells the compiler to turn on the vectorizer and generate specialized code specifically for the Pentium 4 processor. Alternatively, we could have used –Qvec-, which disables vectorization, but continues to generate processor-specific code. In most cases it is preferable to allow the compiler to detect operations in the program that can be done in parallel, and then convert the sequential program to process 2, 4, 8, or 16 elements in one operation (dependin g on the data type) by automatically using SIMD instruction sets (MMX™ technology, SSE, and/or SSE2).

The compiler also has the ability to send messages to the output debug window to let you know what it was able to vectorize, as well as what wasn't vectorized and why. The level of diagnostic detail is somewhat flexible with the –Qvec_reportn compiler option. When n is set to 0, no diagnostic information is generated while setting n = 3 generates the most diagnostic information.

Let's set these three flags (–G7 –QaxW –Qvec_report3), apply the changes, and rebuild the project.


Revisiting the Modules

Sometimes using the Intel compiler as opposed to other compilers makes only a slight difference. Other times, such as in this case, the Intel compiler effectively optimizes the code for our target platform. The peak frame rate jumped from 90 fps to 100 fps; that's about an 11% increase in performance, and we didn't even touch the code. The profile changed slightly, too, as the Intel compiler inlined different routines, and generated different instructions for certain code sequences, such as converting a floating-point data type to an integer. Figure 3 illustrates the new profile of hotspots. Note that two of the spikes from the first run are now missing: simulate_guy() and _ftol2.



Figure 3. A profile of igj0.exe after rebuilding igj0.exe with the Intel C++ Compiler

The function simulate_guy() did not go away completely. Rather, the compiler inlined the code into drawAndRun(), and in fact, we can see that samples were still taken from simulate_guy() by drilling down to the source level in the VTune analyzer, as shown in Figure 4.



Figure 4: The source level view of simulate_guy()


Tuning Hotspots

As Figure 4 shows, the routines FLOAT2INT_BUILDING_{X | Y}(), and BUILDING_TEST() are causing some stalls due to 64K aliasing conflicts and MOB load replays retired. In order to understand what is happening, and perhaps how we can resolve some issues, let's take a closer look at each of these routines.

Bit Operations: Calculating BUILDING_TEST(x, y)

Algorithm

BUILDING_TEST(x, y) is actually a macro that is used to determine whether a guy has run into a building or the edge of the map as part of the guy simulation. If we were looking at the source without a runtime performance analysis, we would probably not even look at this bit of code for the very fact that it is so straightforward. However, because this code is executed for every guy in the scene, a little latency adds up to a lot, especially when the result of the test determines the target of a branch.

Let's take a look at this macro to see if there is any optimization opportunity. Here is what the macro expands to:

     building_blocked[((y * BUILDING_X + x) >> 3) + 

    (BUILDING_X >> 3) + 1] & (1 << (y* BUILDING_X + x) & 7); 
 

 

Presumably the compiler converts the math on constants to a single constant, which is indeed the case if we take a look at the produced assembly. However, some math is left to runtime because of the (x, y) data dependencies. Now let's consider how much time is actually spent on this code.

Performance

On Intel Xeon and Pentium 4 processors, the shift operator latency is 4 clocks with a throughput of 1 per clock while the ALU is capable of completing 2 adds every clock (latency and throughput of 0.5). Normally such small amounts of latency are not really a contributing factor, but call this function even a fraction of 25,000 times per frame at well over 60 frames-per-second and that latency adds up fast, most of it coming from the shifts.

Improved Implementation

The terrain map is stored as an array of bytes, with each bit representing a cell on the map. Consequently, we can replace the second half of the calculation with a very small lookup table, and theoretically reduce the clocks spent on this test. Here is what the new code looks like:

    static int building_bit_field[] = { 

     0x01,

     0x02,

     0x04,

     0x08,

     0x10,

     0x20,

     0x40,

     0x80

    };



    /*

     Use a lookup table to accelerate guy-hit-building test 

   */ 



    inline uint8 BuildingTestFast(int x, int y)

    { 

     return building_blocked[((y * BUILDING_X + x) >> 3) 

    + 

        (BUILDING_X >> 3) + 1 ] 

    & 

        building_bit_field[(y * BUILDING_X + x) 

    & 7]; 

    } 

 

The deeply pipelined and speculative nature of the Intel NetBurst microarchitecture makes it difficult to calculate exactly how much time will be saved with this observation, but running the application shows a slight improvement on average. Another sampling with the VTune Performance Analyzer confirms that it is indeed this fix that played a role in the improvement. Notice in Figure 6 that the clockticks and the number of instructions retired have both reduced by ~100, and the 64K aliasing conflicts have diminished by ~500. All in all, that is a pretty significant improvement for such a subtle change.

Type Conversions: Floating-point to Integer

Before we switched to the Intel compiler, we saw that _ftol2 was the third largest spike i n the application. Though the routine is never called explicitly, the compiler generated calls to _ftol2 in order to perform the necessary operation. Sometimes it is difficult to see how dependent code is on compiler generated library calls, so I did a call graph analysis to get a clear picture. Figure 5 shows the call graph profile, highlighting the reliance on _ftol2 (before using the Intel compiler).



Figure 5. Call graph profile of igj0.exe before compiling with the Intel Compiler. Darker oranges represent higher number of function calls while grays represent a low number of function calls.

Let's take a closer look at a small snippet of code from one of the more frequent offenders, MapGetBilerpSample(), to see why code generated by the Intel Compiler is faster in this case.

This bit of C code looks straightforward enough:

    float iy = floor(cy-0.5); 

    uint i0 = uint(iy);

 

Here is the assembly code produced by the Visual Studio .NET C++ compiler:

    ; float iy = floor(cy-0.5); 

     fld DWORD PTR _cy$21886[esp+52]

     fsub QWORD PTR __real@3fe0000000000000

     fstp QWORD PTR [esp]

     call DWORD PTR __imp__floor 

    ; uint i0 = uint(iy);

     call __ftol2

 

This looks like what we would expect, but_ftol2 is causing some performance hits that we would rather avoid. In fact, it is causing almost 30% of the total MOB load replays retired, and seriously blocking store-forwarding. Now let's take a look at the assembly generated by the Intel C++ Compiler to see why _ftol2 and its corresponding performance penalties do not show up in the new profile.

    ;;;  float iy = floor(cy-0.5); 

     fld

     fadd

     fstp

     call

      QWORD PTR [esp+104]

      QWORD PTR _2il0floatpacket.217

      QWORD PTR [esp]

      DWORD PTR __imp__floor

        ;509.26

        ;509.26

        ;509.26

        ;509.26

    ;;;  uint i0 = uint(iy);

     fxch

     fnstcw

     movzx

     or

     mov

     fldcw

     fistp

     fldcw

      st(1)

      [esp+24]

      eax, WORD PTR [esp+24]

      eax, 3072

      DWORD PTR [esp+32], eax

      [esp+32]

      QWORD PTR [esp+16]

      [esp+24]

        ;512.24

        ;512.24

        ;512.24

        ;512.24

        ;512.24

        ;512.24

        ;512.24

        ;512.24

 

The Intel C++ Compiler chose to inline the code necessary to do the conversion from floating-point to integer, with a sequence of instructions that is more friendly to the Intel NetBurst microarchitecture, based on the switches we gave at compile time. The result is a much faster runtime because an approximate 7% spike has now been reduced and spread out throughout the entire application (through inlined code).

While the benefit of replacing_ftol2 with inline code is substantial for this game, we can still achieve slightly better performance by fine-tuning certain high-profile float-to-integer conversions. We achieve this by using SSE to do a scalar conversion of data. This also allows us to alleviate some of the 64K aliasing conflicts as well as retire more instructions per clock compared to the FLOAT2INT_BUILDING macros.

Algorithm

The FLOAT2INT_BUILDING macros scale a floating-point value to an integer while scaling. Again, let's expand the macro to understand what the code does:

    static double BUILDINGsekrit_temp;

    static double BUILDINGsekrit_temp2;

    #define FLOAT2INT_BUILDING_X(d) ((BUILDINGsekrit_temp = 

     (((d)-BUILDING_SCALE_X/2)+BUILDING_SCALE_X*6755399441055744.0)),



    *((int *)(&BUILDINGsekrit_temp)))

    #define FLOAT2INT_BUILDING_Y(d) ((BUILDINGsekrit_temp2 = 

     (((d)-BUILDING_SCALE_Y/2)+BUILDING_SCALE_Y*6755399441055744.0)),



    *((int *)(&BUILDINGsekrit_temp2)))

 

Most of this math is done by the compiler, but there is a data dependency on d, and the return value of the integer cast is determined at runtime. The Intel Compiler uses inlined code to cast the double-precision floating-point value to a 32-bit integer, but going from single-precision to double-precision and back to an integer may still incur some penalty. To resolve this potential penalty, use the scalar SSE instructions to convert a floating-point value to an integer value while scaling. Here is what the code looks like:

    static const __m128 V_BUILDING_SCALE_X_RCP = 

         _mm_set_ss(1.0f /

    BUILDING_SCALE_X);

    int zx =

    _mm_cvtss_si32(_mm_mul_ss(   _mm_load_ss(>->loc.x),



          V_BUILDING_SCALE_X_RCP));

 

Performance

We're avoiding an expensive (runtime) floating-point divide by creating a constant reciprocal value of the scaling factor. Then we simply multiply the location of the current guy by this reciprocal scaling factor and convert the value to an integer result. We can do the same thing for the "y" component, and avoid 64K aliasing conflicts and some MOB noise. The end result by itself is not a noticeable jump in frame-rate, but combined with earlier optimizations to the building test, we see an additional 2% increase in performance (~3 fps).



Figure 6. A source level view of the optimized simulate_guy()


Final Results

Using the Intel C++ Compiler with optimization flags (–G7 –QaxW), combined with a couple of hand-coded optimizations, we are able to achieve an awesome 15% performance increase, with a peak frame-rate of 103 fps.

Possible Directions

This paper only touches on enhancing the performance of the hotspots in a module by tweaking the code of those particular hotspots, and using an advanced optimizing compiler. While these techniques are effective, there are many other ways to approach optimization, which range from modifying high-level algorithms to rearranging the core data structures of the engine. For example, one might determine that the drawHF() function, which showed up as a hotspot in our original VTune analyzer profile, is suffering from blocked store-forwarding by looking at the events profiled above, and may chose to modify the rendering algorithm for the height-field (terrain map) to use a different approach than making lots of calls into the OpenGL* API, such as glVertex3f(). In terms of data structures and algorithms, it may be possible to modify the entire drawAndRun() function to handle four guys simultaneously using the Streaming SIMD Extensions instruction set. Implementing this would probably be a moderate amount of work, and it is difficult to determine whether the return would meet the investment. This is exactly why it is so extremely important to consider the features of the platform when designing performance-critical algorithms and data structures from the start. While retro-fitting SIMD instructions into a scalar algorithm is effective at times, and it may even be possible to "SIMDize" an entire function with very little effort (such as with a 4x4 matrix transform), it is crucial to think about employing SIMD as early as the design stages of data structures (such as choosing a hybrid AoS¹/SoA² vertex data structure aligned to a 16-byte boundary³ as opposed to a simple {x, y, z, u, v} vertex). There are many papers available on design and implementation details, and we encourage everyone writing software to review them.

¹ The abbreviation "AoS" stands for "Array of Structures"

² The abbreviation "SoA" stands for "Structure of Arrays"

³ An optimal hybrid vertex data would contain 4 componen ts of each element and be aligned to a 16-byte boundry, e.g. declspec(align(16)) struct vertex3f_h { float x[4], y[4], z[4] };


Summary

In this paper, we've seen how interpreting the data that the VTune Performance Analyzer presents to us can help improve the runtime performance by making some slight modifications to the source code of an engine. Additionally, we've seen that invoking the Intel C++ Compiler with its unique optimization flags can boost performance. Using a combination of the Intel compiler and hand-tuned code at the C/C++ language level is an effective way to gain big performance wins without investing a great deal of time; nor does this mean a massive restructuring of the game engine code. Finally, we discussed the importance of considering employing low-level processor features in the early stages of high-level design.


References and Resources

Be sure to check out the References and Resource listed below to find out where you can learn more about these techniques, and many others.

 


About the Author

Will Damon was a Technical Marketing Engineer within Intel's Software Solutions Group. He has a bachelor's degree in Computer Science from Virginia Polytechnic Institute and State University*, where he graduated with honors. He has been with Intel for over a year, helping game developers enable their titles to achieve the highest performance possible on Intel® Pentium® 4 processor-based PCs. He welcomes email regarding optimization, mathematics, physics, artificial intelligence, or anything else related to real-time 3D graphics, and gaming.


For more complete information about compiler optimizations, see our Optimization Notice.