This article proposes a simple and efficient multithreading solution to accelerate rendering for a number of animated 3D models. The solution enhances the performance of skeletal animation by using a thread pool, double buffering, and intermittent updates. This set of techniques performs CPU-based skeletal animation at performance levels that are competitive with GPU-based animation implementations, and it serves as a flexible alternative on multi-core systems.
Animation plays an important role in modern games. Skeletal animation, one of the most commonly used and technically advanced real-time animation techniques used today, treats the triangle mesh of a model as a skin and defines an underlying structure, called a ‘skeleton’, that influences the position of the vertices on the original mesh. An animation set transforms the skeleton through a series of poses. Because each vertex in the original mesh is transformed by one or more influencing bones in the skeleton, the rendering of animated 3D models becomes a significant performance bottleneck. This is particularly true when multiple animations are present  as game scenes become more complicated, such as in massively multi-player online role-playing games (MMORPGs) where many animated models are visible in battlefields or complex urban environments.
A number of software and hardware technologies have been used to optimize the performance of skeletal animation. Adjusting the Level of Detail (LOD) of model meshes and skeletons can reduce the complexity of the animation of distant models. Streaming SIMD Extensions instructions may offer greater than double the performance of a base C implementation . Animation can also be offloaded from the CPU to a vertex shader-supported GPU with significant performance gains.
Over the past few years, two trends are emerging in the PC industry. Most of the processors shipped for desktops and laptops have multiple execution cores, which means that any game that begins production today will be released in a market of mainstream multi-core processors. In addition, shipment of laptops is growing faster than desktops. Laptops, many equipped with integrated graphics, are increasingly common as game platforms. Consequently, how to best make use of the CPU and GPU resources of multi-core systems is an important consideration for game developers.
This article presents a simple and efficient multithreading solution for model animation that improves the performance of games by harnessing multi-core processor resources. The solution parallelizes the rendering pipeline of skeleton animation by using a thread pool, double buffering, and intermittent updates. A demo application for rendering many animated 3D models has been developed to validate the usefulness of the solution. Performance testing of the demo indicates that this solution benefits the skeletal animation of both GPU-based and, especially, CPU-based implementations. Furthermore, the CPU-based animation using this solution provides performance that is competitive with high-end discrete graphics cards, and it delivers significant improvements over mainstream integrated graphics.
The balance of this article is organized as follows:
- Section 2 introduces the technique of skeletal animation and compares CPU-based and GPU-based implementations.
- Section 3 presents the solution and its implementation.
- Section 4 describes a demonstration of the solution.
- Section 5 presents a performance benchmark and analysis.
- Section 6 describes some additional considerations.
Skeletal animation is one of the most commonly used and technically advanced real-time animation techniques. For skeletal animation, a model is converted into a simplified representation consisting of a number of bones that are organized into a hierarchy to define the model’s skeleton. The animation is defined by positioning the bones into a certain number of key frames specified as relative transforms for each bone. Bone positions can then be interpolated between key frames. The model’s skin-a polygon mesh-is layered on top of the bones. Each vertex of the mesh is influenced by one or more bones in the skeleton. Skinning, the process of matching the model’s skin to the current bone positions, modifies the position of each vertex based on the current bone positions. At runtime, the game engine determines the appropriate animation sequence and time, transforms the skeleton to the appropriate position, skins the model, and, finally, draws the model.
To determine the skeleton’s position for a given render frame, the game engine identifies the animations currently applied to a model (a model may have multiple animations applied simultaneously), and extracts the appropriate key frames for the animation sequences. The key frames contain information to transform each bone from its reference position to the position specified by the key frame. For each animation, the bone transformations are interpolated from the two surrounding key frames. If the model is influenced by multiple animations, all of those animations are blended together to get the final skeleton transforms. This work is typically performed on the CPU.
Skinning moves the model’s polygonal mesh to the correct animated relative position. Each vertex is first transformed from the mesh’s reference position to the bone’s reference position. Then the vertex is moved to its final position using the bone’s transform. If the vertex is influenced by multiple bones, each bone's influence is calculated independently and then blended together, possibly with different weights for each bone, to determine the final vertex position. Skinning may be performed in a vertex shader on the GPU (referred to here as "GPU-based skeletal animation"), or on the CPU (referred to here as "CPU-based skeletal animation").
Because skinning calculations for each vertex can be performed independently, GPU-based-skeletal-animation can take advantage of the high throughput of modern GPUs. For middle- to high-end GPUs, skinning is faster and requires less memory than CPU-based implementations. However, for some applications it may be beneficial to perform skinning on the CPU . Performing skinning on the CPU improves compatibility across a wide range of systems. Older systems may have graphics cards that do not support the necessary features to perf orm skinning on the GPU. Graphics cards that do allow skinning on the GPU may have limitations that force large skeletal models to be subdivided into multiple meshes, which reduces efficiency.
Some rendering techniques require a mesh to be drawn multiple times, often to temporary buffers or surfaces. In this situation, any vertex transformations have to be recalculated on the GPU for every rendering pass, because intermediate results cannot be saved. This may cause performance to be limited by vertex program execution on today’s mainstream GPUs. In such cases, performance may improve when vertices are transformed on the CPU, where any transformations only need to be performed once. In addition, SSE instructions and multithreading optimization may yield performance competitive with (partial) GPU-based implementations. For these reasons, some computer games like DOOM* III and Quake* 4 adopt CPU-based skinning, allowing the games to run on a wide range of system configurations. Further, many algorithms that are commonly implemented on the CPU (e.g., Shadow Volumes and Collision Detection) require the animated mesh data. However, current graphics cards do not allow this data to be retrieved after it has been processed, so CPU-based animation can provide additional benefits.
The last stage of the rendering pipeline for 3D model animation draws the skin vertices to the screen. In this stage, the transformed vertices go through clipping, rasterizing, shading, etc. to be rendered on the screen. This stage is performed by the GPU.
The primary goal of our solution is to take maximum advantage of the parallelism in the rendering pipeline for skeletal animation based on software. A few key observations influenced the design of the solution. First, the pipeline is computationally intensive because of the necessity of massive matrix calculations. Second, the animation of different animated models is usually independent from one another, which allows the performance of the rendering pipeline for skeletal animation to benefit from multithreading. Third, in many cases (especially for MMORPGs), frames within a model’s animation over short time intervals are often so similar that it is unnecessary to update some models every frame. Intermittently updating the model animation can reduce bus traffic, improve the parallelism between the CPU and GPU, and avoid redundant calculations.
The architecture of this solution is illustrated in Figure 1. Three primary components compose this solution: thread pool, double buffering, and intermittent update. The thread pool hosts several animation threads, which perform the bone transformation calculations and skinning. Drawing is performed by the main thread, because the calls to graphics APIs such as DirectX* and OpenGL* are not appropriate to invoke from different threads because of performance issues . Double buffering is used to maximize the parallelism and simplify the synchronization between animation threads and final draw calls. Finally, intermittent updating is an optional addition that controls the frequency of animation updates.
The following sections describe the details of these components, as well as the resource management strategy called for in this solution.
Figure 1: The solution architect ure
When a model is animated, a separate animation thread is assigned to the model to process the skeleton transformation and skinning. To render a number of animated models in parallel, the thread pool is used to avoid the heavy cost of frequent creation and destruction of animation threads. The number of animation threads in the pool also scales with the number of animated models and processor cores. Moreover, it provides a straightforward programming interface through which the main thread is merely required to submit models to the task queue. The thread pool handles the scheduling of the task queue automatically. Using a well-wrapped thread pool, such as the Win32* system thread pool, the definition of a threaded model class is straightforward, as shown in the following passage of code:
The model is submitted to the thread pool by calling ‘QueueUserWorkItem’ , the Win32 thread pool API, in the model class function ‘Animate()’. The third parameter of this function specifies the maximum number of threads in the thread pool. Generally, the number of active threads in an application should be no more than the number of processor cores available. Oversubscription of threads does not increase performance, due to the overhead of thread scheduling.
Because the main thread must synchronize with the animation threads, a simple initial synchronization method is proposed. The counter on the number of models in the thread pool is exclusively accessed by the main thread and animation threads using the lightweight Win32 ‘Interlock’ synchronization functions . When the counter is equal to zero, the main thread may start to draw the models.
Although the simple synchronization method described above avoids data races, it requires that all skinning be completed before rendering. Because skinning calculations for each frame are independent, double buffering is introduced to allow skinning and rendering to be performed simultaneously. Each animated model is given two buffers for skinning data. One buffer stores data for the current frame and is used by the main thread for drawing. The second buffer stores data for the next frame and is written to by an animation thread.
Figure 2. The mechanism of double buffering
When a model enters the camera’s field of view, the animation data in the read buffer may be either unready or too old for the main thread to draw. To avoid this issue, a life cycle is attached to each buffer. The buffer is assigned a new life when the latest animation data is in placed in it, and its life is decreased in every subsequent frame. The main thread uses only the buffer with non-negative life to draw the model.
The following passage of code presents an implementation of double-buffering in the model class:
When beginning the game loop, the main thread updates the status of double buffers. The following code implements double buffering in the game loop:
Double buffering also facilitates multithreaded programming, because programmers can assume that valid animation data for a model is available. For example, an animation thread processing one model can get the animation data of another model from its read buffer, which may enable some other parallel processes, such as parallel collision detections among models. The main drawback of double buffering is additional memory consumption. For instance, when rendering 100 models using CPU-based skinning, assuming that each has 5000 vertices with a vertex size of 32 bytes, the additional buffer occupies about a 16 MB memory footprint.
One of the primary reasons why many CPU-based skinning implementations are slower than GPU-based implementations is that skinned vertices are uploaded to the video card so frequently that bus bandwidth becomes a performance bottleneck. For many game scenes, it is not necessary to update the skinning data every frame. This is the case, for example, in a large battle scene in an MMORPG where many animated models are visible at the same time. The models do not need to be updated every frame to maintain a smoothly animated appearance. Also, because the cost of animation is amortized over multiple frames, more complex animations can be performed without adversely effecting frame rate.
The following code adds intermittent updates to the main game loop and the model class:
The actual frequency of updating model animation is equal to the frame rate divided by the update interval. When the interval equals one, the solution is equivalent to animating each model every frame. If skinning is a bottleneck, larger update intervals will increase frame rate while decreasing the update frequency of the model. Game developers should select an appropriate interval to keep the update frequency above a minimum threshold (e.g., 15 per sec).
Resources associated with of 3D models, such as vertex buffers, index buffers, textures, etc., are frequently accessed by the CPU and GPU while the animated model is being rendered. The location and usage of these resources have a great influence on the performance of the rendering pipeline of skeletal animation . For GPU-based animation, most of the workload is performed by the GPU. Typically in DirectX 9, the resources are placed in the local video memory as a “default resource” for the GPU to access quickly. This strategy is also effective in the proposed solution.
For CPU-based animation, resources such as vertex buffers of meshes are updated frequently by the CPU. Typically, these resources are placed in non-local video memory (such as AGP memory) as a “dynamic default resource” of DirectX 9. In the implementation described here, for example, the POOL_DEFAULT flag and the USAGE_DYNAMIC hint of DirectX 9 are used to create the vertex buffer of model meshes . Although the above resource-management strategy is most beneficial to the traditional CPU-based animation, the resultant performance is still not as good as that of th e GPU-based animation in most situations. This is because GPUs are able to access the local video memory much faster than non-local memory, and traditional CPU-based animation uploads the mesh vertices every frame, which causes the bus to become a performance bottleneck.
Because this proposed solution updates the model animation intermittently, the resources are not uploaded every frame. Through the intermittent update approach, the solution turns the dynamic resource of model animation into a semi-dynamic one that can be stored as a “managed resource” instead of a “dynamic default resource.” DirectX 9 will locate the resource in an appropriate place to ensure that the CPU and GPU can access them efficiently.
To evaluate the usefulness of this solution, a demo that renders many animated 3D models was implemented. The demo, based on the DirectX sample ‘MultiAnimation’, demonstrates CPU-based skeletal and GPU-based skeletal animation with single-threaded and multithreaded versions of each.
This demo shows a number of models wandering on a planar floor (Figure 3). The models are controlled by either the application (default) or the user. Every model has three sets of animation: Loiter, Walk, and Run. The animation controller blends animation sets together to ensure a smooth transition when moving from one animation set to another. Collision detection is also performed, so models can block each other's movement. In this demo, the blending of animation sets-skeleton animation and skinning (by means of CPU-based solution)-compose the animation task of the animation thread. Models outside the camera frustum are not animated. In the application control mode, the camera can be moved by arrow keys. In the user control mode, a selected model can be moved by “A/W/D” keys, and the camera will follow the model.
Figure 3. The screen snapshot of our demo
A configuration file is used to initialize the application with various running modes. The demo can be configured with the following options:
- Method: choosing CPU-based skinning or GPU-based skinning
- Threads: the number of animation threads in thread pool; non-zero values indicate using the proposed multithread solution, and zero indicates the traditional single-thread implementation
- Models: the initial number of models in the demo
- Interval: the number of frames in an update interval; this option is only valid in the proposed solution
The source code of this demo is available using the reference in Appendix A.
The demo was tested on two system configurations:
2.6 GHz Intel® Core™2 Quad processor ( 4 cores )
ATI Radeon X1900 Series*
2.4 GHz Intel Core™2 Duo processor ( 2 cores )
Intel® Integrated Graphics GM965
Table 1. Testing platform configurations
Note: Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations.
On Configuration A, a desktop with a multi-core processor and a high-end discrete graphics card, the performance of the traditional single-thread implementations of skeletal animation is benchmarked as the baseline. The testing results are presented in Figure 4. Test 1 represents the performance data for CPU-based skeletal animation, which performs skinning on the CPU. Test 2 represents the performance data of GPU-based skeletal animation, which performs skinning on the GPU. The CPU-based implementation is much slower than the GPU-based one.
Figure 4. Benchmarking the performance of the traditional single-thread skeletal animation on Config A
Next, the performance of the multithreaded implementations of skeletal animation using the proposed solution is benchmarked. The testing results are presented in Figure 5. For the CPU-based animation, the outcome of this solution is much better than the traditional single-thread solution (compare Test 3 / 4 / 5 with Test 1) with a speedup of approximately 2.69x. For GPU-based animation, the proposed solution also shows obvious improvement over the single-threaded approach (compare Test 6 / 7 / 8 with Test 2), although the performance enhancement is not as high as that of the CPU-based animation. This solution makes the performance of CPU-based animation comparable with (compare Test 4 with Test 7) and even faster than GPU-based skinning (compare Test 4 with Test 2, or compare Test 5 with Test 8). Because the CPU-based animation has more parts of the rendering pipeline on the CPU than the GPU-based animation, it can leverage multithreading and decrease bus traffic.
Figure 5: Benchmark the performance of the Solution on Config A (animation threads = 3)
Figure 5 also indicates that the application performance does benefit from intermittent updating. However, while increasing update interval results in higher frame rates, it causes the frequency of the animation update to decrease, as can be inferred from the data in Figure 5. The quality of visual experience could be decreased because of the low frequency of animation update (e.g. below 15). In Test 4, the demo program uses an update interval of 6 frames to give the CPU-based animation performance comparable to the GPU-based animation, and to maintain the update frequency between 21 to 36 frames.
Intel® Thread Profiler  was used to capture the detailed thread behaviors of this solution. Intel Thread Profiler is a threading analysis tool that plugs into the Intel® VTune™ Performance Analyzer and can profile the load balance of threads and potential threading overheads. Using Intel Thread Profiler on the demo of this solution, the typical output (Figure 6) shows that the animation threads and main thread work almost completely in parallel (indicated by “Fully Utilized” bar) with very low synchronization overhead due to the double buffering. The workloads (indicated by “Active” bar) among animation threads are fairly well balanced due to the dynamic scheduling of the thread pool. Furthermore, the difference in the workloads between the main thread and the animation thread suggests that the latter can accommodate heavier-weight animation tasks that require more time to complete.
Figure 6: The typical output of Intel® Thread Profiler as profiling the demo of the solution
For validating the resource management strategy of this solution for CPU-based animation, the demo is benchmarked with various options of DirectX mesh creation. The testing results (Table 2) show that the managed resource has the best performance for the solution, and other resource strategies could cause the application to become locked in the multithread context.
Test # (Metric:Frames per Second)
Animation threads = 0
Animation threads = 3
Table 2: The performance influence of various resource management strategies to the CPU-based animation using the solution
For Configuration B, a laptop configured with a dual-core processor and integrated graphics, the performance of the multithread implementations of skeletal animation using the proposed solution is benchmarked (Figure 7).
Figure 7: Benchmark the performance of the Solution on Config B (animation threads = 1)
The testing results show that the performance of CPU-based animation using this solution is much better (about 3.5x-4.2x) than that of GPU-based animation. Here, the GPU is the bottleneck, so performing the skinning on the CPU lessens the load on the GPU. The update interval has almost no influence on the performance of either the CPU-based or GPU-based animation on Config B. Intel's integrated graphics solutions make use of a Unified Memory Architecture where GPU has no local video memory and always accesses main memory for data, so there is no bandwidth savings with a larger update interval.
While the demo implementation only allows one update interval for all models, real game engines may need more control. A straightforward method for two update intervals is that the models requiring immediate update are rendered sequentially by the main thread, while other models may be submitted to the default thread pool. A more flexible method might involve a thread pool with multiple priority queues. The high priority queues store the quickly animated models, and the low priority queues store the slowly animated models. The animation threads in the thread pool process the high priority queues prior to the low priority queues. Because the main thread is synchronized with the related animation threads in every interval, priorities can be rebalanced so that low priority tasks are not ignored.
Although the solution was designed for skeletal animation, it can be extended to other animation types such as morphing, particles, cloth, soft body simulation, and animated textures. Besides transforming the skeleton and skinning, the animation task of the animation thread in this solution may comprise other computation-intensive workloads, such as physics simulation, AI, collision detection, and shadow volume extrusion.
This article proposes a simple and efficient multithreading solution to improve the performance of skeleton animation by harnessing multi-core processing p ower. The solution parallelizes the rendering pipeline of skeleton animation using a thread pool, double buffering, and intermittent updates. This solution is demonstrated to benefit both CPU-based and GPU-based skeletal animation-especially for the CPU-based case. The resultant performance is much better than the traditional single-threaded implementation. Furthermore, the CPU-based animation using this solution is competitive with high-end discrete graphics cards, and it delivers significant improvements for mainstream integrated graphics. The solution leverages multi-core processor power and saves GPU resources for rendering complex game scenes; it is suitable for GPU-bound games and games that require wide graphics compatibility. In addition, the solution may be extended to more types of animations. When creating games, developers may combine this solution with other optimization approaches to balance the workload of CPU and GPU hardware, to achieve the best overall performance on multi-core systems.
The authors would like to thank Adam Lake, Bruce Zhang, Clay Breshears, and Ron Fosner for their technical review.
 J.M.P van Waveren, Id Software, Inc.; Optimizing the Rendering Pipeline of Animated Models Using the Intel Streaming SIMD Extensions
 MSDN: QueueUserWorkItem Function
 MSDN: Synchronization Functions
 Microsoft DirectX Documentation for C++: D3DCREATE_MULTITHREADED
 Microsoft DirectX Documentation for C++: Performance Optimizations
 Microsoft DirectX Documentation for C++: Resource Management Best Practices
 Microsoft DirectX SDK (April 2007): MultiAnimation Sample C++ (September 2003)
 Intel Thread Profiler: http://www.intel.com/cd/software/products/asmo-na/eng/219690.htm
Demo Source Code Download [ZIP 1.1MB]
About the Authors
Sheng Guo is an application engineer in Intel's Developer Relations Division, focusing on enabling online game ISVs with Intel advanced technologies and performance optimization. He has worked in real-time 3D graphics for several years. He has a Master’s degree in Computer Science from Nanjing University.
Dave Bookout is a graphics software engineer in Intel’s Visual Computing Software Division working on the graphics algorithms supporting Larrabee. He has a M.S. in Computer Science from the Oregon Graduate Institute and a B.A. in English from Purdue University.