Inside the Intel and Creative Assembly* Collaboration

Intel has been helping game developers maximize performance on Intel® platforms from the beginning. In their role as trusted advisors, Intel application engineers help ensure that games scale across the span of computing power and resources. In this way, games deliver great experiences whether played on laptops or custom gaming rigs.

Today’s CPUs offer unparalleled power and performance. For game developers, ever-increasing core counts along with innovations in memory and other system resources enable exceptional experiences. More interactive non-player characters (NPCs). More effects that model real-world physics. More believable character animations. And even more potential to expand your game’s market reach.

Taking advantage of the power of Intel platforms often requires a carefully orchestrated approach to coding. Where it makes sense, spread workloads across all of the processor cores in a system. But in complex systems like games, threading is not enough and it’s sometimes more important to:

  • Select and design your algorithms to manage memory access.
  • Design your data to make good use of caches.
  • Use Single Instruction Multiple Data (SIMD) instructions to accelerate math.
  • Investigate and use other advanced architecture features.
  • Use available tools such as Intel® VTune™ Amplifier or Intel® Graphics Performance Analyzers (Intel® GPA) to remove bottlenecks in the codebase to really get performance moving.


Intel understands these considerations as well as anyone. To help its collaborators get the most out of its hardware, Intel cultivates long-term relationships with innovative game development studios of all sizes. Working as trusted advisors, Intel application engineers analyze content, code, and gameplay and then offer expert advice to help game developers bring their creative visions to life.

This paper looks at the progression of desktop and mobile Intel® multicore processors over the last 10 years. It also chronicles the decade-long collaboration between Intel and Creative Assembly* on the award-winning Total War* series. The paper examines many technical innovations that advancing technology has enabled—increased core counts, enhanced instruction sets, evolved memory and other sub-system resources, and optimizations for on-chip graphics processing.

After reading this paper, you will understand many of the techniques Creative Assembly and Intel used to ensure that the Total War series of games will run great on a full range of current and future Intel platforms, from laptops to desktop gaming rigs.

Two Cores Ignite a Relationship

In today’s crowded games market, standing out often requires stand-out performance.

Back in 2009, the Intel® CoreTM 2 Duo processor was an ideal solution. The processor had been garnering industry praise for its breakthrough performance and efficiency, but it was still relatively new to market when Intel application engineer Steve Hughes brought it to Creative Assembly’s attention.

“I still remember Steve insisting that having two cores on a single chip wasn’t a trick,” Charlie Dell, technical director of Creative Assembly, recalled. “He kept saying there really are two computers on this thing.”

“Up to that point, our relationship with Intel was pretty low-key,” Charlie continued. “It took on a whole new character when we recognized the benefits that core scaling could bring to our Total War series.”

Having not one but two cores on a single chip gave Creative Assembly the ability to execute all of the AI code for Total War: NAPOLEON* on its own core. That left the other core free to render as many frames as possible. The result was a 300% increase in frame rate.

Multicore and Hyperthreading, What’s the Difference?

Multicore CPUs have two or more execution cores (or compute engines) on a single silicon chip. That chip plugs into a single processor socket, and the operating system perceives each core as a discrete processor.

Practically all modern CPUs have at least two cores. Some have eight or more. All multicore CPUs enhance things like multitasking—running multiple applications at the same time—and boost game performance by allowing you to do more work per clock cycle or game tick.

Hyperthreading technology, on the other hand, makes efficient use of a single execution core by allowing more than one thread to share that core’s resources. When multithreaded applications are run on multicore processors with hyperthreading enabled, available hardware resources are used more efficiently and process two threads per core.

Most but not all Intel® Core™ processors support two threads per core.

For a more detailed look at the difference between multicore and hyperthreading, see Difference Between Physical Cores and Logical Processors.

How was that possible? Consider that, with single-core execution, the AI code would update every 1/10th of a second. In between AI ticks, the core would draw as many frames as possible (Figure 1).

Figure 1. Single-core execution. Between AI ticks, the core draws as many frames as possible.

Splitting the game code across two cores allowed the AI code to still update every 1/10th of a second, but the rendering thread was free to continue drawing as many frames as possible, lerping the data between the last two AI ticks (Figure 2). Because Core 1 didn’t have to pause to process AI ticks, moving AI onto Core 2 yielded a 300% frame rate increase.

Figure 2. Dual-core execution. AI ticks still occur every 1/10th of a second on Core 2, while Core 1 renders as many frames as possible.

Two More Cores, Now What?

In 2010, Intel Core processors—Intel® Core™ i3, Intel® Core™ i5, and Intel® Core™ i7—made more cores and more powerful integrated graphics processors available to both mobile (laptop) and desktop PCs.

First Generation Intel® Core™ Family Processors in 2010

CPU Model Intel® Core™ i3 Intel® Core™ i5 Intel® Core™ i7
Number of cores 2 4 4
Threads 4 8 8

For Creative Assembly, the increased core count and, in some cases, thread count inspired them to build on the lessons they’d learned writing code for two cores. They tapped the additional multicore power to animate the cloth sails and rope rigging on naval ships in Total War: NAPOLEON.

According to Steve Hughes, “Cloth and rope simulations have a lot of repetitive tasks that are fairly easy to parallelize. Anyone new to parallelizing code would do well to start in similar fashion. For our work with Creative Assembly, it was a good primer for later work, which got a lot more complicated.”

While processing animation code on its own thread boosted frame rate, the team also realized that they could give priority to foreground objects and animate other things in the distance.

Foreground objects in this case were things like ships, their cloth sails, and rope rigging (Figure 3). Background objects were also ships, sails, and rigging all rendered with a level of detail (LOD) less than the foreground animations. Additional cores enabled more unique animation systems to run in the distance. More cores also gave the ability to change the LOD at different distances. In other words, the game achieved better visual quality on machines with more available cores.

Figure 3. Ships in the foreground receive rendering and detail priority over faraway objects as shown in this scene from Total War: NAPOLEON*. (Image courtesy of Creative Assembly)

Multiple Cores Plus On-Die Graphics Processing

In 2011, 2nd generation Intel Core processors offered developers Intel® HD Graphics 3000. For gaming, this on-chip graphics processor presented a viable alternative to standalone graphics processors. Early access to an alpha software development platform with a second generation Intel Core processor gave the team their first opportunity to optimize their code for Intel Graphics.

According to Charlie, a lot of bespoke coding went into Total War: SHOGUN 2*. Core and thread counts had doubled to four cores, eight threads. Integrated processor graphics provided more and more power. All of this let the team play even more unique animations at different levels of detail based on distance from the virtual camera (Figure 4).

Figure 4. One of the many techniques used in the Total War series renders objects with varying levels of detail based on how close or far they are from the virtual camera.

But new features often come with new challenges. In this case it was pixel mask array (PMA) depth stalls. The problem persisted through 8th generation Intel Graphics. A patch is still available on Github*.

Intel and Creative Assembly have collaborated on six games in the popular Total War series: EMPIRE*, NAPOLEON, SHOGUN 2, ROME II*, WARHAMMER II*, and THREE KINGDOMS*. While the action takes place on battlefields dating back to feudal Japan, ancient and medieval Europe, eighteenth century France, and beyond to fantasy realms populated by magic and monsters, Total War titles have these things in common:

  • Players command armies numbering in the thousands.
  • Campaign strategy and tactics are turn-based.
  • Battles stream in real time.
  • State-of-the-art animation techniques breathe life into each simulated battle.

All Things Considered: CPUs, GPUs, and Memory

So far, we’ve focused almost exclusively on the benefits of scaling offered by multicore processors. But that’s only part of the story. No matter how many cores are in a system, those cores and the code that runs on them have to play well with the system’s GPU and memory resources.

In other words, achieving better scaling performance involves a lot more than simply shifting workloads from a single CPU to multiple cores and threads.

According to Charlie and Steve, writing scalable, parallelized code means confronting an ever-present battle between CPU and GPU workloads:

  • Too much processing on the GPU prevents multicore scaling.
  • Too much processing on the CPU has a negative effect on overall performance.

“In the case of Total War,” Steve explained, “the key was striking the right balance in the mathematical relationship between CPU load and GPU load when adding large numbers of fighting units. For example, if you add 10,000 units, then the GPU frame time is extended by 10,000 multiplied by the time it takes to draw one unit. But on the CPU, the time it takes to prepare those 10,000 men increases by 10,000 multiplied by CPU preparation time divided by the number of cores.”

It’s fairly easy to see then that as you increase the number of units, you quickly become bottlenecked on the GPU. As a result, CPU scaling stops.

Techniques like pseudo impostors (see below) help reduce the CPU load by reducing the number of individual instances that need to be processed. Aggressive culling helps reduce the GPU load by reducing the amount of geometry that needs to be processed.

Serialized tasks that aren’t broken down into smaller, parallelized granular tasks are a major cause of CPU bottlenecks. Steve advised, “We found that the best way to run optimized but unthreaded tasks faster was to run them on a faster, more efficient CPU.”

Deadlocks and race conditions are two more common threats that cause errors or crashes. Identifying them is crucial.

Lockless programming is a way of avoiding the use of mutexes (locks) or critical sections that adversely affect performance by introducing synchronization points between threads.

The aim is to run with the minimum number of these synchronization points. Creative Assembly would turn to Intel performance profiling tools such as Intel VTune Amplifier to identify these conditions.

In addition, years of experience taught the team to look for and resolve mis-spun threads.

The team also suggests that you:

  • Avoid having two cores working on the same cache line.
  • Profile performance constantly and look for hot spots and performance bottlenecks.
  • Structure data to optimize the render order of meshes.
  • Use Intel Graphics Performance Analyzers (Intel GPA) to identify hotspots and bottlenecks via instrumentation.
  • Pinpoint the most computationally expensive calls through frame analysis to spot and correct CPU- and GPU-bound issues.

Scaling to Increase Market Reach

Scaling games to PCs equipped with high-powered processors such as the Intel Core i7 and Intel® Core™ i9 is great. But a lot of people play on machines with less power. That includes laptops, which typically use Intel Graphics—on-chip graphics processors instead of standalone GPU cards. Scaling content as well as graphics ensures great experiences across a spectrum of CPUs. It also helps ensure that the game will reach the widest possible audience.

“The usage numbers on Steam* show that a significant number of the people playing Total War use laptops,” Steve pointed out. With that in mind, on ROME II in 2013, Creative Assembly implemented both multicore and Intel Graphics optimizations in the same title.

Building on the lessons learned while developing SHOGUN 2, two new features emerged as work on ROME II began: Multitouch support and adaptive order independent transparency (AOIT).

Multitouch, the ability to detect more than one touch point on a control surface, enhanced gameplay on devices that supported it. AOIT proved more elusive.

AOIT analyzes contributing color values where a lot of semitransparent objects overlap. It selects the four color values that contribute most to the final image. The algorithm is particularly useful in particle system-generated smoke or on computer-generated foliage where transparency softens leaf edges. According to Creative Assembly, it was difficult for untrained eyes to discern that fine details were enhanced.

The Intel® Iris® Graphics pixel synchronization extension supplied AOIT support. It allowed the comparison between the various color values to happen very quickly while avoiding race conditions.

New Priorities

As Moore’s Law marches on, core and thread counts increase. This was evidenced in the four years between Creative Assembly’s release of ROME II and WARHAMMER. During that time, core and thread counts increased from four cores, eight threads to six cores, 12 threads, which in turn gave way to eight cores, 16 threads. As a result, developers were given even more new opportunities and new challenges.

Steve explained, “Scaling using cores only gets you so far. That’s because in multicore programming, the proportional performance increase from additional processing cores actually decreases with higher core counts.” In other words, Amdahl’s Law comes into play.

“Imagine a physics engine with one rock interacting with another rock. Processing collisions between the two takes a relatively small amount of processing power. But if you have a whole pile of rocks, or debris from exploding buildings, the collision detection complexity grows linearly, but the processing power required to handle it grows exponentially.”

That also applies to the vast armies in Total War. As you field more soldiers on a battlefield, each one needs to avoid colliding with the others. The number of soldiers increases linearly, but the processing power needed to handle them goes up exponentially.

And that holds true for almost all other game visuals—surface realism in the form of human or animal skin, boulders, trees, buildings, flags, leafs, weapons, and monsters. “Adding more content to enhance game visuals and appeal isn’t as simple as it might sound.”

“Threaded code doesn’t play fair with memory,” Steve explained. “A lot of parallelized processes touch memory—virtual memory, L2 and L3 caches, VRAM, etc. As core counts increased, we had to pay closer attention to memory management and data housekeeping.”

Which brings us to another point—processor speed. As core counts increase, the ability to handle parallelized code alongside serial code benefits from the CPU’s raw physical speed. “We talk a lot about frame rate,” Steve said. “More cores let you render higher frame rates, and faster cores deliver even higher frame rates. But instead of raising the frame rate of the game, why not use the extra performance to generate more content that enhances visual quality, thus creating a more enhanced user experience?”

Think of it this way: if you’re processing AI ticks at 10 frames per second (FPS) to get good visuals, you can get even better visuals processing game ticks at 90 FPS. You’re using the same number of cores, but the net result is happening nine times faster.

Key to taking full advantage of that speed boost is ensuring that you’re running an efficient render pipeline. “I can’t stress enough how much effort we spend on this,” Steve continued. “Render pipelines are like an automotive assembly line. Car manufacturers often claim that they turn out cars at a rate of one every 45 to 90 seconds, but each of those cars typically spend anywhere from 10 to 20 hours or even longer on the assembly line. If you think of a render pipeline as an assembly line, code rolls off the pipeline smoothly and efficiently as long as the code is properly aligned.”

Tamas Rabel, Creative Assembly rendering technical director, went into great technical detail on Total War: THREE KINGDOM’s rendering pipeline instruction queue for parallelized workloads at the Game Developers Conference 2019. You can view his presentation in the Intel Software Developer Zone.

Steve added that “we knew we needed to utilize every available Intel® hardware system resource. That included things such as dual-channel memory, Intel® Smart Caches, overclocking (processors whose model numbers include the letter K), Intel® Optane™ memory, and so on.”

Enter Laboratory Mode

This shift in focus fundamentally changed the nature of the collaboration between Intel and Creative Assembly. “Our partnership evolved around the time we started collaborating on Warhammer II. We were no longer zeroed in on optimizing and performance tuning. We were partnering with the studio on game design,” Steve said.

Charlie describes one of the first outcomes of that relationship. “Running six or eight super-fast, highly capable cores and using all of the coding techniques we developed throughout our collaboration with Intel, experiments revealed that we could get many thousands of men on an open battlefield. But things became challenging during other types of battles.” For example, during siege battles, rules that worked well for open-field conflicts broke down when many thousands of NPCs had to interact with siege towers and castle walls.

Figure 5a. Laboratory Mode controls in Warhammer II gave players control over parameters not normally exposed to them. (Image courtesy of Creative Assembly)

There just wasn’t enough physical space to accommodate that many soldiers. It was also too late in the development cycle to modify the game rules to accommodate the massive numbers of soldiers enabled by the latest processing and rendering power. Instead of leaving untapped compute power on the table, they used it to create competitive advantage. They invented “Laboratory Mode.”

Laboratory Mode in WARHAMMER II (Figures 5a and 5b) allowed players to tweak parameters that wouldn’t normally be exposed to them. Gravity, explosions, damage, and plenty of other effects could suddenly be dialed up, down, or turned off.

All this resulted in epic and often ridiculous battles. For example, if a player turned the explosion power up and gravity way down, when something blew up, soldiers would go flying hundreds of meters across the battlefield. Or off into space.

Figure 5b. WARHAMMER II Laboratory Mode let players create giant-sized war machines and monsters. (Image courtesy of Creative Assembly)

Players loved it. YouTube* filled with Laboratory Mode gameplay videos. Creative Assembly even created an official trailer to promote it.

Continually Leveling the Playing Field

By 2019, Total War: THREE KINGDOMS in main game mode provided still more realism with larger unit sizes. “Creative Assembly with Intel’s help strives to make Total War available on a wide range of hardware,” Steve said.

“Fielding an incredible number of soldiers was great if you’re in single-player mode. Or playing against other people using computers of similar or equal power,” Steve said. “The challenge has always been making it work when one player has a customized gaming rig and another is on an off-the-shelf laptop.”

9th Generation Intel Core Desktop Processors (other chips available)

CPU Model Intel Core
i5-9600K processors
Intel Core
i7- 9700K processors
Intel® Core i9-9900K processors
Cores 6 8 8
Threads 6 8 16
Speed Up to 4.6 GHz Up to 4.9 GHz Up to 5 GHz

9th Generation Intel Core Desktop Processors (other chips available)

CPU Model Intel Core
i5-9400H processors
Intel Core
i7-9750H processors
Intel Core
i9-9980HK processors
Cores 4 6 8
Threads 8 12 16
Speed Up to 4.3 GHz Up to 4.5 GHz Up to 5 GHz

Ninth generation Intel Core processors offer:

  • Up to eight cores and 16 threads
  • Support for up to 64 GB dual-channel DDR4 2666 memory
  • Intel® UHD Graphics 630 to deliver 4K video and 360-degree viewing
  • Intel Optane memory with solid-state storage to speed load and launch times
  • Processors with K in their model number to support overclocking

“For our part,” Steve continued, “we helped Creative Assembly craft strategies to get the most out of Intel platforms.” Some steps were fairly straightforward, like loading game data that’s used most often into a single storage file that’s loaded into Intel Optane memory. This helps ensure that key game content gets retrieved very quickly, which in turn results in super-fast system responsiveness.

Other strategies were much more complex. For example, the sheer size of the cities in THREE KINGDOMS required special attention. Not only did cities encompass large numbers of individual buildings, those buildings were all destructible.

Finding optimal ways to render scalable amounts of debris and rubble as buildings were destroyed proved to be one of many rendering challenges. Applying techniques and lessons learned from previous Total War development efforts, Creative Assembly and Intel engineers used variable LOD techniques to great effect.

One such effect, pseudo impostors, was used to render variable LOD of buildings based on distance from camera, and helped boost performance across low-, medium-, and high-powered systems. Creative Assembly used pseudo impostors to render low-polygon (<200 triangle) count characters with two LOD levels (<200 triangles and <50 triangles).

Prior to WARHAMMER, the team used a classic billboard system with pre-rendered animations. Particle system spark makers could take hundreds of megabytes of video RAM (VRAM). That limited the amount of material information that could be stored, so they only stored diffuse color. That meant any soldier with a spear looked like any other soldier with a spear. The effect was amplified as you got farther away from the camera. To differentiate opposing armies or units, they tinted soldiers in a color that signified they belonged to the same faction.

With WARHAMMER, thanks to the pseudo impostors technique, animations could use traditional kinematic skeletons, albeit skeletons that took only a couple of bits per bone. The result was more realism that scaled with available PC power.

Steve stressed the importance of designing the rendering pipeline with hyperthreading in mind from the start. “Orchestrating all the threads in a game is a careful balancing act. As a best practice, we recommend that you plan for hyperthreading at the beginning of any project to get the most benefit from it.”

Which brings us back to the constant battle between making things run as fast as possible on the CPU and GPU and balancing those workloads.

To that end, Steve advises:

  • Don’t draw (cull) things that are not visible.
  • Reduce vertex counts where possible using LOD and pseudo impostor techniques.
  • Reduce shader load by providing a range of rendering techniques that vary in complexity. Then select the right one for the target.
  • Identify scene elements that have a high cost and optimize them.
  • Thread tasks both in the main game and in the graphics thread (Total War games have two threads: a main thread that runs the game, and a graphics thread that prepares and sends content to the GPU).

“The goal,” Steve said, “is to balance CPU work and GPU work on your whole set of target machines so that neither CPU or GPU is a bottleneck on any system.”

Dynasty Mode

Laboratory Mode in WARHAMMER was a huge hit with gamers, so Creative Assembly, working with Intel engineers, took what they’d learned from Laboratory Mode and created a new arcade-style gameplay mode for players with high-powered PCs. In Dynasty Mode, individual players or teams score points as three heroes face off against wave upon wave of enemy units.


Figure 6. Players pick three heroes to battle waves of enemies in a new arcade-style gameplay mode that resulted from the collaboration between Intel and Creative Assembly. (Image courtesy of Creative Assembly)

Throughout the Total War franchise, armies consist of “units”—groups of soldiers. The new mode renders larger and larger unit sizes based on a PC’s specs. Machines with higher core counts and faster processors, such as those equipped with 9th generation Intel Core i9 processors, offer the best frame rate gains. That’s because larger unit sizes and interactions draw more heavily on the CPU. PCs with less power, however, don’t necessarily handle the excess CPU workload well.

Figure 7. In Dynasty Mode, when played on fast, highly capable PCs, enemy units can contain thousands of soldiers. (Image courtesy of Creative Assembly)

Game Changers

For games where jaw-dropping impact and performance play an essential role, super-fast, highly capable Intel cores help deliver the best gaming experience on any Intel platform.

“For more than a decade, the collaboration between Intel and Creative Assembly focused on making games that play well on any PC that meets the minimum spec requirements and play amazingly well on high-end PC systems,” Charlie said.

For their part, Intel application engineers analyzed content, code, and gameplay and offered expert advice. Occasionally they contributed code, usually in the form of shaders or Intel Graphics extensions. And their collaboration crossed over into game design, which resulted in innovative new game play modes to the delight of WARHAMMER II and THREE KINGDOMS players.

Creative Assembly reciprocated by test driving pre-release Intel® software developer tools and new hardware platforms. Regular support and onsite engineering meetings, along with intensive profiling/debugging sessions ensured that whatever project was in the works moved forward. In the process, both teams shared their wealth of knowledge. There was always something new to learn, a new goal to achieve, and new hardware and software to assimilate.

At the end of the day, Creative Assembly has been able to release award-winning titles that take advantage of all the power on any given PC platform, delivering great performance on laptops and high-powered gaming rigs alike. And Intel continues to invest in new technologies that will take game play into the future.

What creative vision will Intel platforms help you bring to life?

Call to Action

Learn more about Intel platforms and get acquainted with Intel software tools, case studies, code samples, and how-to guides here.

产品和性能信息

1

英特尔的编译器针对非英特尔微处理器的优化程度可能与英特尔微处理器相同(或不同)。这些优化包括 SSE2、SSE3 和 SSSE3 指令集和其他优化。对于在非英特尔制造的微处理器上进行的优化,英特尔不对相应的可用性、功能或有效性提供担保。该产品中依赖于微处理器的优化仅适用于英特尔微处理器。某些非特定于英特尔微架构的优化保留用于英特尔微处理器。关于此通知涵盖的特定指令集的更多信息,请参阅适用产品的用户指南和参考指南。

通知版本 #20110804