Gaming into the Future

Published: 09/09/2011, Last Updated: 09/09/2011

Gaming into the Future
Research Area: Optimizing games for Multi-core Processors
Authors: Adit Lal, Lakshya Sivaramakrishnan, Nihal Armaan Alji, Shreya Nayak
Faculty mentor: Mrs. Shubha Bulla
Name of the Institution: Sir M Visvesvaraya Institute of Technology
Game engines facilitate rapid game development. Until now they have been highly optimized to extract maximum performance from single processor hardware. The physical limits and performance gains have slowed to become incremental. Consequently, game engine performance has also become incremental. Currently, hardware manufacturers are shifting to multi-core processor architectures. This presents a challenge to the developers because of the unfamiliarity and complexity of concurrent programming. Engines must address the issues of concurrency if they are to take advantage of the new hardware. Doubling cores is not the same as bumping chip speed. Multicore processors offer some relatively small performance gains for single-threaded applications; their real power comes from multithreading.
The basic principle for the design is to be able to run on any number of cores, and to achieve almost linear performance boost on it. What we need is to alter the algorithms that are involved in a game engine to make them parallel. And these algorithms should have good CPU complexity. It depends on the engine and how it was designed.
There is a law, called Amdahl's Law, that states: Let P be the parallel part of the program and S the serial and P+S=1 then the speedup that can be obtain with many computing units is
Speedup=1 /(P/N +S). This paper discusses the issues, approaches that need to be considered in the design of a multi-threaded game engine.

Background: A wave of multi-core panic spread across the Internet about 18 months ago. IT organizations, it was said, urgently had to improve application performance by an order of magnitude in order to cope with rising demand. We wouldn't be able to meet that need because we were at the "end of the road" with regard to step changes in processor power and clock speed. Multi-core technology was the only sure route to improving the speed of applications but, unfortunately, our current "serial" programming techniques, and the limited multithreading capabilities of our programming languages and programmers, left us ill-equipped to Exploit it. Multi-core mania gripped the industry.
It is revealing to plot the developers' expectation for how the median system evolves over time, along with the chip introductions and public forecasts from Intel regarding core counts:

Game programmers do need to start working with the current tools and 'low-level' coding techniques that will allow them to exploit multi-core technology. For most programmers the multi-core is a storm in a teacup. Once the 8- and 16-core systems become mainstream, however, a single-threaded application may be leaving an order of magnitude in performance and throughput on the table. That is when multicore enablement moves from "nice to have" to a "must have."

For the next few years, the performance gains in
new chips will be fueled by three main approaches, only one of which is the same as
in the past. The near-term future performance growth drivers are:


  • Hyperthreading
  • Multicore
  • Cache

Problem Statement: The problem is that a lot of game developers are rewriting their code now. But they shouldn’t get into the functional threading model as much because they might get themselves into a threading model that isn't really scalable. For software developers for rewriting the engine, it is always better that they take the time to do it right so that engine can last eight years instead of three years.

This might seem like a backwards analogy but from a car's perspective, people exist solely to move cars between parking places. If the cars form a convoy, and a driver leaves his or her car, everyone else behind is stuck.

By using more CPU cores, a game can increase the number of rigid body physics objects for greater effects on the screen. Optimized games might also yield a smarter AI.

Methodology: Within 12-24 months, the 8- and 16-core systems will be common. At that point, applications that are not effectively multithreaded suffer an order-of-magnitude performance disadvantage with their competitors. So, if we have not yet picked a concurrency platform (a layer of software that coordinates, schedules, and manages the multicore resources), here are a couple considerations:



  • It'll take some time - at least months - to explore the leading contenders.
  • Once chosen, it'll take one or more release cycles to bring the multicore-enabled software to market.

Key Things to remember: -
o Pick the number of cores we intend to first target
o Look at the silicon roadmap and determine when that number of cores will be available
o Add the time it takes for new processor chips to penetrate the installed base of your customers' systems
o Subtract roughly two development cycles to give us enough time to multicore-enable our code base

Key Results and Discussion:

As for the n in n-core, we do not necessarily know what it will be in a few years, so we look for threading techniques that scale. The best bet hence will be data parallel threading. When determining core availability, we always rely on what the OS tells and not the hardware directly. Thanks to the potential resource surplus offered by multicore processors, virtualization will become more common. Doing an end-run around the hypervisor could prove disastrous for the client.

Unfortunately, games do not have the type of architecture that allows for a 100% performance boost per core. But we can still see tremendous gains when we implement functional threading, data parallel threading, or a mix of the two.



  • Functional Threading -Our game encompasses multiple subsystems, such as the physics engine, particle system, audio mixing, AI, rendering. With functional threading, we put each of these functions on its own thread. In this context, functions can also be defined as any discrete task run on its own set of data.
  • Data Parallel Threading -With data parallel threading, a single task is split across multiple threads, so that each thread handles a different chunk of the same data set. We have the potential not only to optimize for massive performance gains, but also to introduce entirely new aspects of game play.

"What multithreading approaches have we explored?"


• Native threads and thread pools are mature approaches. Applications employing this type of concurrency platform have been widely deployed
• OpenMP appears to be doing well in terms of both new evaluations and actual deployments - likely a credit to the fact that it has been out there for a while, is open source, and seems to work well for parallelizing "fat" loops, which are a common case in scientific codes.
• Intel's Threading Building Blocks, which appeared on the horizon within the last two years, appears to be widely evaluated.
• Cilk++ -Without Cilk++, multicore enablement will require a drastic rewrite of the code
• Microsoft's Parallel FX Library does not yet appear to be widely explored or deployed.

Single-threaded game engines can still work, but they are becoming increasingly outclassed by multithreaded solutions that are more sophisticated, but also more complex to create and optimize. To stay competitive, now and in the future, games will have to implement multithreading techniques that take advantage of all the CPU cores a player's machine has to offer.  Parallel programming is about implementing an algorithm, using sets of instructions or tasks designed to be executed concurrently.


Unit of performance:
How many frames per second is the program updating. In a single frame, we can't have the first stage pass data to second stage, then the third stage be dependent on that second stage and the fourth stage be dependent on the third stage. Because now we have a perfectly serial, unparallizable game-state loop.

But once you do have your game re-architected for parallelism, how do you squeeze out even more performance? For one thing, focus on the unique aspects of the chip architecture itself.

Cache-size optimization:
From a software interface, it's vastly different, but conceptually, in the Cell processor we break down our workload into the smallest chunks. Each of those little Cell processors only has 256K of memory, so there's no big memory space. We are forced to break down your column into small bite-size chunks.
It’s always better to debug the program.
The Parallel Game engine is a multi-threaded game engine that is designed to scale to all available processors within a platform. To do this, the engine executes different functional blocks in parallel so that it can use all available processors. However, this is often easier said than done, because many pieces in a game engine often interact with one another and can cause many threading errors. The engine takes these scenarios into account and has mechanisms for getting the proper synchronization of data without being bound by synchronization locks. The engine also has a method for executing data synchronization in parallel to keep serial execution time to a minimum.
Parallel Execution State
The concept of a parallel execution state in an engine is crucial to an efficient multi-threaded runtime. For a game engine to truly run in parallel, with as little synchronization overhead as possible, each system must operate within its own execution state and interact minimally with anything else going on in the engine. Using this mechanism greatly reduces synchronization overhead, allowing systems to act more independently.

Execution Modes

Execution state management works best when operations are synchronized to a clock, which means the different systems execute synchronously. The clock time does not even have to be fixed to a specific frequency, but instead can be tied to frame count, such that one clock step is equal to how long it takes to complete one frame, regardless of length

Fig 1: Execution state using the free step mode

• Free Step Mode
This execution mode allows a system to operate in the time it needs to complete its calculations. It is free to select the number of clocks it needs to execute. A simple notification of a state change to the state manager is not enough. Given the extra memory operations, free step might be slower than lock step, although this is not necessarily true.

• Lock Step Mode
With this mode all systems must complete their execution in a single clock. Lock step mode is simpler to implement and does not require data to be passed with the notification because systems that are interested in a change made by another system can simply query that system for the value (at the end of execution).

Fig 2: Execution state using the lcok step mode

The main difference between the two modes is that free step provides flexibility in exchange for simplicity, while lock step is the reverse.

Data Synchronization
It is possible for multiple systems to make changes to the same-shared data. To do this, the messaging needs some sort of mechanism that can determine the correct value to use. Two such mechanisms can be used:
→ Time. The last system to make the change time-wise has the correct value.
→ Priority. A system with a higher priority is the one that has the correct value. This can also be combined with the time mechanism to resolve changes from systems of equal priority.



Fig 6:


Scope for future work:
Multi-core is the present and future of PCs and all other games platforms, even mobiles soon.
Once chips move beyond a number of cores that are easily utilized by taking large functional pieces of your game and dedicating a core to them, the whole architecture of games, and indeed support from programming languages, will be forced to change.

Conclusion: What if we have an 8-core or 16-core or 100-core machine? Parallel Programming must be implemented into Game Engines from day one of coding for Games to live up into the future.


References:Online Resources and Whitepapers
Acknowledgements: Intel and Sir Mvit


Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804