View PDF [543KB]
Intel worked with Creative Assembly* throughout the development of Shogun 2: Total War* to support them in getting the best possible performance from the Intel® Core™ i7-2820QM CPU and Intel ® HD 3000 Graphics. Through a series of analyses and experiments using Intel ® tools Parallel Amplifier XE and Intel® Parallel Studio, we managed to identify and remove numerous locks causing bottlenecks and produced a final frame rate which was 1.28X on Intel® Core™ i7 processors compared to 2 core systems. The team also used Intel ® Graphics Performance Analyzers throughout development to identify bottlenecks in the terrain shaders that were impacting performance on Intel ® HD 3000 Graphics and achieved a very playable frame rate of 28 frames per second. This case study looks at some of the efforts made by both Intel and Creative Assembly through the development to make Shogun 2: Total War a top class product on Intel ® HD 3000 Graphics.
Getting Accurate Results
Intel engaged with Creative Assembly quite early in the development of Shogun2. At that point, little of the game was functional, which made performance testing and monitoring difficult. What we needed was a game typical workload which was repeatable which we could use as a yardstick for future tests. Later in the project, we knew we were going to rebuild and enable the replay system used on previous incarnations of the engine but this was not possible when we started. We needed numbers early. Eventually we settled on Lua* scripted scenarios with managed cameras (similar to what you see at the start of the campaign battles). These worked quite well as a repeatable workload and would suffice for early testing as all the AI and everything else ran normally during the script. We used this system to help profile the game on a number of graphics devices, including the Intel ® HD 3000 Graphics processor family (codenamed Sandybridge).
Intel worked with Creative Assembly to debug and refine the in-game replay system which had been designed during previous incarnations of the Total War engine. In Total War, the replays are actually real game action, not a movie. During the replay all game code elements are fully active and are playing the scenario as surely as if you were pushing the buttons yourself. This made the replay system ideal for producing a repeatable and representative workload, however there was a problem. As diehard Total War fans know when you play back a replay you are normally free to move the camera where ever you want. This level of control makes for a great experience but it also means the frame content is unpredictable, i.e. if you point the camera at the sky there is almost no content and consequentially the frame rate is really high, but if you point at a charging army the content is very busy and hence the frame rate is lower. If we had measured the improvement in game performance with this free camera movement, it would have been impossible to repeat the camera movement for all tests and so it would have been impossible to make an accurate comparison.
Changing the replay system was extra to the planned work but Creative Assembly agreed recording the camera was essential in order to provide accurate performance data on all tested platforms. Intel worked closely with Creative Assembly on the replay system. An onsite Intel engineer contributed design ideas and helped out by testing and debugging the replay system. Intel also developed a few battle scenarios to help debug the camera in the early stages of game development.
Intel® HD 3000 Graphics Investigation
To make Shogun 2 a success on Intel ® HD 3000 Graphics, we wanted to have at least ‘medium’ settings for all the options, so any options that showed significant slowdown at or below ‘medium’ resulted in a deep dive with Intel ® Graphics Performance Analyzers (GPA). Early in the enabling process we tried to focus on the graphics pipeline to determine what effects and settings levels we could apply for all the possible graphics options. The Intel ® HD 3000 Graphics was capable of dealing with settings greater than ‘low’ for almost all the options, but we wanted to go beyond mere settings selection, and actually tweak some higher settings to make them better performing on Intel ® HD 3000.
We investigated a number of areas where there was heavy graphics processing and discussed various ways to reduce the workload so that higher settings could be used on Intel ® HD 3000 Graphics. In the following example we looked into the landscape renderer, since being able to default to one higher level of landscape made a significant visual difference. Because of the camera in Shogun2, the landscape is rendered over most of the screen which meant that any improvement in the landscape renderer would pay dividends.
Each horizontal bar is a single draw call. Those picked out in yellow are calls made by the terrain renderer. The vertical axis shows duration of draw call.
Figure 1. Intel ® GPA shows draw calls which render the terrain take up a lot of time.
All pixels drawn by the terrain renderer highlighted in pink. The renderer takes up a fair portion of the screen, and so probably should command a large portion of the frames processing time. However, the stats suggested there may be things we could do.
Figure 2. Shows pixels drawn by the terrain renderer.
We investigated the game with Intel ® GPA. The main workhorse, as always, was Frame Analyzer. Using Frame Analyzer you can see the time taken to carry out each draw call, then, you can select individual draw calls and investigate their textures, shaders etc. Pretty quickly, we identified the landscape as being the most expensive component of the scene. The first thing we noticed about the shader for the landscape was that it was 350 instructions long. That count included 21 texture reads from 14 different textures to arrive at the final image. The textures in question were quite large so our initial thoughts were that we could be losing a lot of time accessing textures.
We were surprised to see that our initial assessment was wrong. In fact, using Intel GPA experiments we found that using 2x2 textures on the land only reduced it from 34% to 33% of the scene – almost no difference at all. Looking further we saw that the execution unit use on the Intel ® HD 3000 Graphics was at almost 80%. The final clinching blow was that replacing the shader with a simple one reduced the processing share of the landscape to 3%.
A partial rewrite of the landscape shader gave us a reduction in size which in resulted in a performance improvement. The main change was the vectorization of a set of sequential operations in the shader which greatly improved the execution time. Other minor changes were made to the number of detail textures and the way they were handled resulting in our being able to set the landscape detail option a full level higher in the options and keep more or less the same frame rate but with a significant improvement to the visual quality.
Multi- core Investigation
We really wanted Shogun2 to take advantage of systems with multiple cores. We started work on multi- core optimization very early in the development of Shogun 2, long before the earlier mentioned replay camera had been implemented. At this point we were using static scenes of armies and comparing the frame rates on the 2 core and 4 core target systems with camera positions as near as we could get to identical. Our testing was so early in the development that many of the graphics techniques used to achieve the visual richness of Shogun 2 had yet to be implemented.
It’s important to get in early with multi-core optimizations, but how do you do that on an unfinished engine? What we did have was the core of the scenery and weather systems, and the AI and animation engine for the soldiers. After deliberation we felt that this was enough to give us meaningful results if not 100% accurate, and our findings could be verified later once the game had been completed. As a result of earlier work with Intel on the Total War* titles Napoleon* and Empire*, Creative Assembly already used Intel® Threading Building Blocks (TBB) to thread systems involving repetitive tasks such as the animation of combatants and ships. What we found was that in designing the next generation Total War engine Creative Assembly had had to rewrite some of these systems and while the new design was innovative, the multi core scaling we expected to see was not there.
We set aside an afternoon to investigate this using Intel® VTune™ Amplifier XE , a tool from Intel that lets you see exactly where execution time is being spent right down to individual instructions. We took samples using Intel® VTune™ Amplifier XE on a 2 core and a 4 core system to compare threading performance. What we found was puzzling at first. It seemed that there was a good percentage of threaded code in the engine still. On a 4 core Hyper-threaded system we were getting about 275% code execution (equivalent to 2.75 cores flat out) and on the 2 core HT system we were getting about 175% (equivalent to 1.75 cores running flat out) so there should have been some scaling but the frame rates were doggedly identical +/- less than 1%. Drilling down with VTune we found the problem pretty quickly. The Windows* function WaitForSingleObject(), used to prevent multiple threads accessing the same code at the same time, was oddly taking a significant portion of the execution time.
A deeper examination of the code showed that the threading optimizations in trees, weather and animation used linked lists to store items for processing by the graphics thread. While it appeared that each of the threads used its own list, an optimization in the list management code completely separate from the systems we were examining meant that ‘under the hood’ all the systems used the same global list. Consequently, all the threaded optimizations were being cancelled out by a single lock they all shared in the engine core. Once we removed the offending lock and provided true separate lists for each thread the scaling returned and we began to see up to 1.28X performance increase between the 2 systems.
Once we had scaling, it was a fairly simple matter to periodically check it as the project progressed to watch for sudden losses. If we saw a sudden drop, then we could look at the code added over the last period and track down the problem fairly quickly.
One amusing event which occurred during the development deserves mention here. Although we had scaling as a result of the animation threading code, we would regularly see a drop off as battles progressed as shown figure 3.
Graph showing the recorded framerate from Shogun 2 Total War after the global lock fix (vertical axis is frame rate, horizontal is in seconds). We were typically seeing 1.2X scaling through most of the battle, but as can be seen from the end of the trace the scaling always seemed to tail off.
Figure 3. 2 core versus 4 core frame rate over time.
Caution: Death can inhibit your frame rate!
This puzzle was eventually traced to an issue with corpses. In Shogun2 the corpses of the fallen stay in view on the battle ground. Once fallen, the corpses would still animate so they could be blown about by explosions and trampled by horses etc. A tiny error in the code meant that once a model was marked as dead, the animation code swapped to an old path which did not have the threading optimisations in it. The net effect was that as more men died and moved to the unthreaded animation system, the performance improvement dropped off. The fix to the corpse code was fairly minor, and once complete we were seeing a continuous scaling right through the thick of battle at about 1.2X improvement from 2 core to 4 core.
It wasn’t all good news for multi-core on Shogun 2. During the development we added threading to tree animation and specifically the level of detail (LoD) calculations for the trees. With the amount of trees in a typical landscape this added up to quite a lot of parallel code and boosted the scaling to over 1.3X for a time. An innovation added by the Creative Assembly team in the form of a system to batch the trees together and share LoD calculations across groups of trees resulted in a new piece of code which was significantly faster on a single core than the threaded execution had been. It is practically impossible to see any difference between the two tree processing systems so there was no benefit to keeping the more complicated parallel code. By batching the trees there were so few LoD calculations that there was little or no effect from threading it. This was a classic example of the old adage that ‘the fastest piece of code is the one which does not execute at all…’ – hat’s off to Creative Assembly!
Intel worked with Creative Assembly for about 9 months on Shogun 2. We worked on the game together to add multi- core optimizations which resulted in the game being 1.2X faster on a 4 core HT system compared to a 2 core one, and we added graphics optimizations which gave us more than reasonable performance on Intel® HD graphics, proving that integrated graphics systems could hit the mark in the games environment. We all concluded from the development that early focus on CPU and graphics optimization is vital to making a successful game on modern hardware.
But the main thing Intel achieved was to help Creative Assembly produce a game which was worthy of Creative Assembly’s Total War lineage and at the same time demonstrate what Intel already knew: that Intel hardware has a great deal to offer game developers looking to excel in performance through multi core optimizations and increase their potential market by embracing Intel® HD Graphics.
About Intel GPA
Intel® Graphics Performance Analyzers (Intel® GPA) is a powerful, agile developer tool suite for analyzing and optimizing games, media, and other graphics-intensive applications. Intel® GPA Frame Analyzer is a powerful, intuitive, best-in-class single frame analysis and optimization tool. Intel® GPA System Analyzer Heads-up Display (HUD) and Standalone provide straightforward initial analysis and provides interactive Microsoft Direct3D* pipeline state overrides. Intel® GPA Platform Analyzer provides a timeline view for analysis of tasks, threads, Microsoft DirectX*, OpenCL™ and GPU-accelerated media applications in context.Intel® GPA Media Analyzer: See how efficiently your code utilizes hardware acceleration on Intel® Core™ processor-based PCs with Intel® HD Graphics or run real-time media performance analysis of encode and decode metrics to get in-depth, real-time media performance analysis.
About the Author
Steve Hughes spent over 12 years developing games for PC and various consoles with, he boasts, “at least 10 released games - hard to be sure…” before joining Intel as a Senior Application Engineer in 2008. Since joining Intel he has worked with many companies to try to synergize the relationship between their games and Intel hardware. When not gazing at code, he plays guitar, tries to polish telescope mirrors, and occasionally builds sheds.
* Other names and brands may be claimed as the property of others.
Appendix A: System Information
Core™ i7-2820QM @ 2.3GHz
Intel® HD Graphics 3000 @ 1300 MHz
|OS||Windows 7 x64|
Appendix B: Tools
- Intel® Graphics Performance Analyzer 3.0 & 4.0
- Intel® VTune™ Amplifier XE + Parallel Studio
- Intel® Threading Building Blocks
- Creative Assembly internal tools