Experience and Key Strategies for Load Balancing Games


By Julien Hamaide

Download Article

Download Load Balancing Games (PDF 800KB)

With the increase in processing unit count, load balancing has become a challenge. Constantly loading all units and using all the available processing power requires specifically designed code. Many resources tell you how to organize or optimize your code and data, but few talk about scheduling and load balancing. This article explains how to effectively dispatch units of work to the available CPUs and adjust the game content to the available computing power.

SCHEDULING

Scheduling threads is the operating system’s responsibility. But operating system threads are heavy, and spawning and deleting a thread for each task is overkill. To limit this overhead, games use job systems, which consist of several worker threads that process jobs. The threads are created at the start of the system and are used during the entire game. Jobs can be made of C++ classes or C functions with arguments. Because you don’t use the operating system’s scheduler, however, you must implement a custom scheduler whose role is to dispatch the jobs to all available worker threads. The advantage is that you can set up custom algorithms and adapt them to the game’s needs. The following discussion assumes that the job system implements a dependency manager.

Constraints of a Game Environment

Games are real-time software. For each frame, every system must be updated. Every task must be completed on time or the frame rate will decrease. But not all tasks have the same importance. Tasks can be divided into two categories:

  • Required tasks in a frame. These tasks are required to complete the current frame. Physics, animation, rendering, and input processing tasks are necessary for correctly displaying the current frame. The constraint on this type of task is a hard constraint: Any delay directly introduces a delay in the frame.
  • Frame-delayed tasks. These tasks are important for the game, but their results are not needed immediately. Pathfinding is a perfect example. If this type of task is delayed, it may not be a problem immediately, because the following frames could absorb the delay. Nonetheless, you must ensure that the task is processed sufficiently early.
FIFO Scheduling

The simplest type of scheduling, first-in, first-out (FIFO) schedulers dispatch jobs in the order they were added. Although not sufficient for an entire game, FIFO scheduling can act as a first implementation and reference. This algorithm introduces little overhead: It just picks the last element of a list. That said, it doesn’t try to optimize poor scheduling order, either. Nor does it make any difference between frame-needed jobs and frame-delayed jobs.

One important aspect of a scheduler is its behavior over variable processor count. On gaming consoles, the number of available CPUs is known and doesn’t change over time. On a computer, the story is completely different. The number of CPUs is determined at game startup, and different architectures are available.

If there is only one thread, the scheduler executes tasks one after the other, as if the code were a single thread. Because the overhead is low, performance is similar to a single-threaded version. Because the number of CPUs in computers and consoles is growing, however, the scheduler will have random execution times, depending on the order in which the jobs are pushed and how the critical path is processed.

When the number of processors increases, the scheduler tends to reach the optimal time if it does not introduce latency. Jobs will be processed as soon as they are pushed and their dependencies are resolved.

Priority-based Scheduling

This scheduler type uses the priority you define to order the jobs. Although based on user input, priority-based scheduling can also be updated to increase performance. User input is kept at minimum granularity, so there is no need to give thousands of priority values. A good set contains these priorities:

  • Immediate. Run as soon as a worker thread is ready, or preempt a low-priority thread. This priority is useful if other jobs need this job’s output. For example, an animation could be needed to update the position of a grabbed object.
  • Needed for the end of the frame. This priority is set on jobs that are required but not used before the end of the frame-typically particle system or sound system updates.
  • Low-priority game updates. This priority is for the kind of task you want to execute as quickly as possible but without limiting higher-priority tasks. Examples include pathfinding and saving data to disk.
  • Idle jobs. Those jobs will use idle CPU time. Such jobs can be used to perform background tasks, such as sending statistics to a server or cleaning unused resources.

Unfortunately, some problems will appear. Low-priority and idle jobs might never be triggered if the game overwhelms the CPUs. If the target is 60 frames per second (FPS) and the current CPUs are capable of handling 50 FPS, a new frame will start just after the required jobs are handled. New immediate jobs are directly pushed as the new frame begins, preventing lower-priority jobs from being trigged. To avoid this situation, use a dynamic priority system. To supplement user-level priorities, a second priority value is introduced, the value of which is handled by the scheduler itself. For the low-priority scheduling problem, this value can be increased every time a job misses a launch opportunity. Its priority will then become superior to the priority of immediate jobs. The same problem might appear if an important job depends on a low-priority job. In this case, the priority is transferred to the job on which it depends.

This alternate priority value allows you to sort jobs within a priority category and boost some jobs into the upper category. You can compute the value using different parameters, and you can adapt the scheduler to your needs by tweaking that value. Here are a few of the features that you can use to modify job priority:

  • The number of dependent jobs. If a job blocks lot of other jobs, boosting its priority can be a huge win.
  • The length of the job. The scheduler can keep statistics about job duration. Longer jobs can be scheduled sooner. If they are scheduled too late, the game might wait for its end with no other jobs to execute.
  • Is the job part of the critical path? The critical path is the longer sequence of jobs in the dependency graph. Those jobs must be scheduled with a high priority.

ARCHITECTURE-SPECIFIC PROBLEMS

Architectures can vary among computers. The architecture type, the number of CPUs, the number of cores inside each CPU-everything can affect the low-level behavior.

Typical architectures are symmetric multiprocessor (SMP) and nonuniform memory access (NUMA). In SMP, each CPU has equal access to memory. The memory bus is then most likely the bottleneck. For NUMA, each CPU has its own memory attached. Therefore, it accesses its own memory quickly with its own bus, but accessing other CPUs’ memory can be slow, and latency can be high. Each operating system has its own application programming interface (API) to query system setup. Nothing specific can be done for SMP architectures, but things are different in a NUMA architecture: The scheduler must try to assign jobs on processors that have the data in their local memory. In Windows* operating systems, you can query information about your system using functions such as QueryWorkingSetEx to discover which NUMA node memory is allocated to. Figure 1 shows a four-processor NUMA and SMP architecture.


Figure 1. Typical topology of NUMA and SMP architecture


Cores of a Single CPU

When a CPU is multicore, each core uses a unique arithmetic logic unit (ALU) and Single Instruction, Multiple Data (SIMD) unit. If two computation-expensive jobs are scheduled on each core of a single CPU, the performance will degrade, so the scheduler should avoid scheduling such jobs on the same CPU. Users can tag jobs to be computation expensive, but those jobs are usually easy to spot. Animations or physics jobs are good examples of computation-expensive jobs.

Improve Code Cache Usage

Code is just data, and as such must be cached by the CPU. For better performance, the scheduler must schedule jobs using the same code on the same processor.

 

BALANCING THE SYSTEM: ADAPTING TO VARIABLE CPU COUNTS

If your game targets a wide range of machines, prepare it for multiple resolutions. For example, particle systems can be up-scaled, spawning more particles. Pathfinding can output a smoother path. Animation can use more bones for skeletons. Your game can increase its quality, using all the extra power it can. However, it should also be able to work on low-end machines. Players are used to tweaking the 3D settings-resolution, texture size, shadow definition, and so on-but it would be awkward for a player to tweak the artificial intelligence (AI) update rate or the pathfinder precision. The game must be able to adapt itself to the available power. But how do you prevent the game from running out of resources? What do you sacrifice first?

The AI System

Think of the AI system in three parts: seeing, thinking, and acting.

For an AI-driven character, seeing is acquiring all data needed to make a decision, thinking is combining all this data and making decisions, and acting is applying the decision. These three parts might not be updated at the same frequency. The acting layer is frame-required: The character must do something every frame, the new animation pose must be computed, and so on. But for the other two layers, the requirement is lower. There’s no need to update the sensory data each frame, nor does the current decision need to be challenged with the current data. But how do you determine the system’s update frequency?

One solution is to let the scheduler decide. The sensory system creates a first job with a low priority. If processing power is available, the scheduler will proceed with the job in the current frame. If not, the priority will be increased over time. Some frames later, the job will be finally executed. The trick is to schedule the next update within the current job (with some delay to prevent double-updating in the same frame). Using this technique, the system is updated as often as possible but no more. If the update rate is too low, the system can use extrapolation, such as is done in network play.

To implement such a system, two techniques are available:

  • Blackboards. Variables such as current decision and first visible enemy are continuous values. They could be updated at each frame. A blackboard allows you to keep the latest value of such variables. When an update job is executed, it writes its result to the blackboard but not instantaneously. The blackboard accumulates all changes during the frame and applies them at the end. The other jobs then never read inconsistent values, as variables don’t change during a frame.
  • Future objects. Requests such as pathfinding are less frequent and usually take more time to process. The use of future objects can help in such cases. The future object allows you to send a request and wait for the result. If the result is not yet available, the code falls back to another behavior. Future objects can contain intermediate results, such as first path segments.
The Component System

The main advantage of the component system is its modularity. Each entity is made of a component, each with its own data and executing its own task. Cutting the entities’ update code into pieces eases calling them at different frequencies. If your component system is data-oriented, different tables store each type of component, which are then updated in a row. For a given component type, only half of the table can be updated, thus dividing the frequency by two.

Take particular care about the time step: It should be accumulated over the skipped frame. To compute this step, the system keeps the last time steps, and each component stores the frame index it was last updated. Using the current frame index, the system can add time steps from skipped frames.

The Sound System

For the sound system, the story is completely different. The update is strict: The game must send its audio data on time or a glitch will be heard. You cannot balance by changing the update frequency. Instead, the idea here is to limit the number of sounds that can be played simultaneously. It’s easy for the scheduler to know what percentage of CPU the sound jobs are using, so the user can decide to limit this percentage to a fixed value. The sound system can then adapt its simultaneous play count using this value. If you use Digital Signal Processing (DSP) effects, the system can limit its simultaneous instances.

The Physics System

As most physics systems’ algorithms are based on an integration technique, the time step must be sufficiently small and constant. Some systems require you to break the time step into smaller pieces, and the update runs several times a frame. Users can't decrease the update rate, so the only solution is to decrease the need for CPU time. Several solutions exist:

  • If your gameplay does not depend on the physics system’s behavior, you can change your collision resolution algorithm. A discrete collision algorithm is less precise than continuous collision detection, but it is also less expensive. Changing this algorithm at runtime can be tricky depending on your framework. Generally, this is a start time decision.
  • Lower the distance at which objects are removed from the physical world.
  • Lower the distance at which objects do not interact with other objects but only with static geometry. Most physics systems assign two 32-bit integers to their objects. Those bits are considered layers. The first layer defines which layers of an object belong to the collision group; the second layer determines which layers an object intersects with in the collision mask. Use this system to limit collisions between objects when they are far from the camera. The test is fast and easy to set up, and you can update the bits depending on the object’s distance from the camera.

Because the first technique is only available at start time, use a heuristic to determine whether to activate it. The other solutions can be enabled at runtime.

The Streaming System

The streaming system is responsible for loading data from disk. If the data is compressed, the system will consume some CPU cycles, but you must pay attention to the impact this system can have on the frame rate. A good solution is to use a low priority for the streaming thread. But as in priority-based scheduling, “starving” may occur.

Preparing for the Unexpected

A huge difference between game consoles and computers is the existence of foreign processes. When users launch a game, they rarely close all running applications or shut down the operating system’s services. Some processes might wake up and perform some background work. When load balancing your game, you should always keep a bit of security in your timing measurements. On Windows* machines, GetProcessTimes and GetSystemTimes return the times (idle, user, and kernel) of the game process and the whole system. The application can sample those values and extract the time spent in other processes. The mean value of the recent peak is used as a security. On a six-processor machine with a game running at 60 FPS, the total available CPU time during a frame is 6 CPU * 1/60 FPS = 100 ms. If other processes have peaks around 20 ms per frame, the game should limit itself to 80 ms, even if in 90 percent of frames the other processes use less than 5 ms. As explained later, the load balancer integrates this constraint in its design.

 

THE LOAD BALANCER

You’ve seen different techniques for lowering the required CPU power. But when should the game enable them and in which order? The load balancer is responsible for activating each system in turn. The idea is to degrade or progressively improve each system in turn. Unfortunately, there is no unique solution: The answer really depends on the importance of each system in your game experience. A mood game will surely put more importance on music than a first-person shooter will. So instead of explaining a finished and tuned system, let’s explore the concept and how to tune your game to your needs.

The main load balancing criterion is the frame length (equivalently, the frame rate). The goal is to keep the highest constant frame rate. Even if the game can run at 40 FPS 90 percent of the time and 30 FPS for the last 10 percent, it might be more interesting to keep the game at 30 FPS. Changes in frame rate are more noticeable than a slightly lower frame rate. And keeping a slightly lower frame rate frees some CPU time for streaming and the unexpected.

At the end of each frame, the load balancer receives information about the last frame length and job lengths, then compares this time with previous frame lengths. Table 1 contains the frame length for a given scenario. In this example, the load balancer requires that the update rate be around 50 Hz. The load balancer maintains the list of level of detail (LOD) it can enable or disable (see Table 2). The time each LOD takes is computed at the start of the application and is used as an approximation of the time the tasks required to execute. Every time an LOD is enabled, its time is updated (using a moving mean). When the load balancer detects that the frame length has decreased several frames in a row, it inspects its lists and activates the next LODs. You provide this list, as it depends on what constitutes the more important details you want to keep. The load balancer will always activate those details in order. But when should the game start decreasing the frame rate?

Table 1. Example of Load Balancer Behavior

Frame indexFrame lengthInstantaneous frame rateTime waitingLoad balancer order
1 20 ms 50 Hz 0 ms Don't change
2 21 ms 47.6 Hz 0 ms Don't change
3 19 ms 52.6 Hz 1 ms Don't change
4 20 ms 50 Hz 0 ms Don't change
5 17 ms 58.8 Hz 3 ms Don't change
6 17 ms 58.8 Hz 3 ms Don't change
7 14 ms 71.4 Hz 6 ms Increase quality
8 18 ms 55.5 Hz 2 ms Don't change
9 18 ms 55.5 Hz 2 ms Increase quality
10 20 ms 50 Hz 0 ms Don't change

The load balancer can access other information in the list, as well: the minimum frame rate required to disable a feature. In this example, the developer decided that it’s more important to have precise pathfinding than to have a game at 50 Hz. But if the frame rate were to decrease below 30 FPS, the feature could be disabled. At startup, the load balancer targets a high frame rate but runs at the lowest LOD. The system adds a new LOD each frame until the system finds its equilibrium, preventing the system from stalling the first few frames.

Table 2. Example of an LOD List
LOD DescriptionEstimated time if activatedFrame rate requred
Divide sensory update rate by 2 1 ms 50 FPS
Divide face emotion component update rate by 2 1 ms 50 FPS
Decrease available sound channels by 5 0.5 ms 50 FPS

This technique allows a precise definition of the degradation of the game. Once the system is tuned, it is fully automatic, maintaining the influence of each LOD on the frame length. The frame rate is also linked to LOD selection. Although the design and the code are simple, the results are good. To tweak the system, an edit-and-continue rule system can come in handy. The ability to dynamically simulate lower-end machines (by limiting the number of CPUs) is also important. With those tools in hand, the rule list can be created and tuned in hours. This system was used in an unreleased game and is currently used in a game that will be released in 2012.

CONCLUSION

The scheduler is a complex system. As the conductor, it should be adapted to the score your game is playing. There is no unique solution, but you should select the best algorithm based on the code and data statistics. The algorithm should also be able to adapt to different architecture types. The key to success is analyzing, adapting, and profiling. But a scheduler alone won’t solve the problem. The load balancer is there to finish the work. It serves as an LOD coordinator, linking definition and quality to frame rate. You then have everything you need to shape the best experience for a wide range of machines.

 

About the Author

Julien Hamaide graduated as a multimedia electrical engineer at the Faculté Polytechnique de Mons (Belgium) at the age of 21. After two years of working on speech and image processing at TCTS/Multitel and three years leading a team on next-generation consoles at Elsewhere Entertainment, Julien started his own studio, called Fishing Cactus (www.fishingcactus.com). He has published several articles and spoken about such topics as 10Tacle's movement and interaction system and multithreading applied to AI. He now leads development at Fishing Cactus.

 

RESOURCES

Amdahl, G. (1967). “Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities.” AFIPS Conference Proceedings (30):483–485.

Alexandrescu, A. (2001). Modern C++ design: generic programming and design patterns applied.

Hamaide, J. (2008). “Multithread Job and Dependency System.” Game Programming Gems 7.

Leiserson, C. What the $#@! is Parallelism, Anyhow? http://software.intel.com/en-us/articles/what-the-is-parallelism-anyhow-1.

Leiserson, C. Are Determinacy-Race Bugs Lurking in YOUR Multicore Application? http://software.intel.com/en-us/articles/are-determinacy-race-bugs-lurking-in-your-multicore-application.

Multiple Processors. http://msdn.microsoft.com/en-us/library/ms684251.aspx.

Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.