Dynamic Volumetric Cloud Rendering for Games on Multi-Core Platforms

by Sheng Guo, Cage Lu, Xiaochang Wu
Software and Services Group, Intel Corporation


Clouds play an important role in creating images of outdoor scenery. Most games render clouds with the planar cloud textures mapping to the sky dome. This method is suitable when the viewpoint is close to the ground, but is not visually convincing when the viewpoint approaches or passes through the clouds. For a realistic experience in flying games, players should see clouds that appear to be three-dimensional with irregular shapes and realistic illumination. Implementing these features requires using volumetric techniques to model, illuminate, and render clouds. However, due to the inherent computational intensity of volumetric cloud techniques, it can be a challenge to apply these techniques in games. Although there have been some cloud systems that support real-time rendering of large-scale volumetric clouds in games, based on performance considerations, these systems generally must abandon the realistic dynamic features of clouds at run-time.

Currently, multi-core platforms are the PC market mainstream. However, because traditional game architectures are not designed for multi-core systems, most games based on multi-core processors are not able to make full use of the power of all cores. Using all cores would provide performance headroom for games and flight simulators to render more realistic volumetric clouds.

This article presents a technique for games running on mainstream multi-core platforms to render dynamic volumetric clouds. This technique is based on existing algorithms and uses a multithread framework and Intel® Streaming SIMD Extensions (Intel® SSE) instructions to improve implementation and optimize game performance. A demo called LuckyCloud was developed to implement and evaluate our solution. LuckyCloud benchmark demonstrates that this technique has good performance scaling on multi-core platforms. When compared to previous static cloud systems, this solution enabled real-time dynamic simulation and illumination, and had no additional performance impact during game play.


Over the past several decades, research in computer graphics produced many volumetric techniques that can simulate, illuminate, and render clouds. Most of these techniques required a great deal of computing resources, preventing them from being used for interactive performance. Some recent techniques achieved real-time performance by implementing algorithms in up-to-date GPU shaders. However, implementing volumetric clouds in PC games remains quite challenging. Unlike applications dedicated to generating volumetric cloud images, games have very limited performance space for rendering clouds because they must process game logic and render other scene objects at the same time. To guarantee optimal performance, some clouds systems used in games had to pre-process some complex calculations offline, such as modeling, illuminating, and so on. This leaves a lot of dynamic effects to be desired, for example, physics-based evolution, variable natural scattering illumination, or reasonable flying-in effects.

Our solution is mainly inspired by Harris2001 [2] and Dobashi2000 [1]. Harris proposed a cloud system that could be used in a flying game using a particle system to model and render volumetric clouds. The Harris system accelerated cloud rendering by using imposters for clouds at far distances, meaning the system could generate large-scale clouds while still able to have high real-time performance. Harris used a simplified Rayleigh scattering model to implement multiple-forward-scattering illumination. This model can achieve anisotropic light scattering in clouds so that different cloud colors can be observed from different angles. To accelerate the shading, Harris calculated the incident light intensity for each cloud particle through the GPU. But unfortunately, this method requires reading back the pixel color from the frame buffer every time a cloud particle billboard is rendered. The pixel read-back operation is very expensive in the graphics pipeline because there can be hundreds of thousands of cloud particles in the system. The benefit of GPU shading in this way is easily lost by the frequent pixel read back overhead, and instead causes a severe performance bottleneck. To render the volumetric clouds in real time, Harris' system must shade the cloud particles offline, and merely render them at runtime. As a result, in Harris' solution light intensity and direction is fixed. The LuckyCloud solution has adopted Harris's illumination model, but uses a different implementation to enable dynamic interaction of the clouds and lights at runtime.

Another problem is that Harris' approach can only render static clouds. To simulate the dynamic evolution of clouds, LuckyCloud adopts Dobashi's cloud simulation method based on cellular automation. In this method, the simulation space is represented by a 3D grid. At each cell of the grid three binary state variables are assigned: humidity (hum), clouds (cld), and activation factors (act). The state of each variable is either 0 or 1. Cloud evolution is simulated by applying a set of simple transition rules at each time step, as shown in Figure 1. The transition rules represent formation, extinction, and advection by winds. Density at each point is calculated by smoothing the binary distribution of the surrounding cells' cloud states. Compared to other simulation methods, the Dobashi method is able to produce realistic cloud animation at a smaller cost to performance. This is the main reason this method was chosen.

Figure 1: Dobashi's Simulation Process [1]

The Solution

As a solution, the process of generating cloud images in real-time includes three primary steps (in order): simulation, illumination, and rendering. The simulation and illumination is performed on the CPU, and rendering is primarily completed on the GPU. The simulation step uses Dobashi's method to model dynamic clouds and generates the density distribution of cloud media. The illumination step calculates the scattering colors of cloud particles according to light passing through the cloud density space. The illumination model is the same as Harris', but it is implemented in the CPU instead of the GPU. The rendering step is similar to Dobashi's and Harris' implementation which synthesizes the final cloud image by drawing the shaded cloud particles with the traditional billboard splatting technique.

The CPU approach is proposed for the simulation and illumination based on the following considerations:

  1. CPU-based illumination avoids the performance bottleneck caused by the frequent read-back operations of the frame buffer in Harris' implementation.
  2. CPU-based implementation can reduce the GPU resources consumed by cloud rendering and lower the requirement for GPU functionality and performance so that a game can be compatible with a wider range of graphics cards.
  3. Multi-core has become the mainstream PC gaming platform, but most games do not take full advantage of all the cores of a multi-core processor. Those available computing resources can be used to handle and accelerate the cloud simulation and illumination, thus minimizing the performance impact on the game loop.

Cloud particles are shaded by the CPU-based illumination method, as illustrated in Figure 2 and described here:

  1. Cast a ray parallel with the sunlight from the sun to the cloud particle. Several sampling points are generates in the simulation space along the ray.
  2. Iteratively calculate the incident light intensity of every sampling point based on Harris' shading equations [2] until the cloud particle point is reached. During this course, the density of each sampling point is interpolated by the surrounding cells densities in the simulation space.
  3. Calculate the scattering light intensity of the cloud particle according to Harris' shading equations and use it as the color of the cloud particle.

Figure 2: The Illumination Method

As the relevant algorithms were implemented for the cloud, a multithreading framework was developed to render the whole cloud scene and Intel® Streaming SIMD Extensions (Intel® SSE) instructions were used to optimize illumination performance for each cloud particle.


Multithreading Framework


The multithread framework consists of two levels. The higher level does task decomposition. To minimize the performance impact of cloud rendering on game play, the simulation and illumination steps of the cloudscape are separated from the main thread of the game loop, and are placed into a separate thread for execute. Because graphics middleware such as Direct3D* and OpenGL* do not recommend distributing rendering tasks in different threads, the rendering step remains in the main thread to render with other portions of game scene.

Clouds and light usually change slowly in games, so it is unnecessary to update the cloud simulation and illumination in every frame. That is, the main thread does not need to wait for the cloudscape thread to produce the latest data; it can render the cloudscape using the old data more than once and simply obtain the new data at a specified synchronization point. In this way, the cloudscape thread can amortize its heavy load in multiple frames. There are a couple of synchronization forms between the main thread and the cloudscape thread, for example, synchronizing every few frames, or after a specified amount of time, or when the cloudscape has completed its task. The last form is also called free step mode, which prevents the main thread from stalling to wait for the completion of a cloudscape thread. In this way, this technique achieves the same performance as rendering static clouds. This solution takes this technique as the default synchronization method.

The lower level of our multithread framework achieves data decomposition using the fork-join model in the cloudscape thread. Because there are usually many clouds in the cloudscape and the simulation and illumination of every cloud is independent from every other cloud, every cloud instance's update task can be taken as a decomposition granularity and can be performed by different sub-threads in parallel. Data decomposition further enhances the multi-core utilization and the frequency of the cloudscape update.

Our multithread framework is implemented by the Intel® Threading Building Blocks (Intel® TBB) task manager and the TBB parallel_for construction. TBB provides C++ templates for parallel programming and enables developers to focus on tasks other than thread details [6]. The pseudo codes of the framework are as follows:

bool bSubmitNewTask = false;
if ( bFreeStepMode ){ 
bSubmitNewTask = pTaskManager->isJobDone();
bSubmitNewTask = true;
if (bSubmitNewTask){
GetNewDataFromCloudScapeThread ();
for(int i=0; i< uNumClouds; i++ ) 

Figure 3: Pseudo Codes in the Main Thread (Game Loop)

TaskManager manages the cloudscape thread and implements task parallelism (Figure 3). The simulation and illumination tasks in CloudScapeThreadFunction are submitted to the cloudscape thread at the appropriate time.

tbb::parallel_for ( tbb::blocked_range<int>( 0, uNumClouds, uNumClouds/uNumThreads),
*pForLoopToUpdateClouds );

Figure 4: Parallel_for Construction in the Cloudscape Thread Function

The parallel_for construction in the CloudScapeThreadFunction implements data parallelism (Figure 4). This construction averagely splits the iterations of a For-loop structure into several sub-threads. The For-loop structure pForLoopToUpdateClouds comprises the actual calculation of simulation and illumination for every cloud (Figure 5).

for(int i=range.begin(); i!= range.end(); i++ ) 

Figure 5: Pseudo Codes of Actual Cloud Update Tasks in ForLoopToUpdateClouds

Intel® Streaming SIMD Extensions (SSE) Optimization

The simulation and illumination of the volumetric cloud are performed in a simulation space represented by a 3D grid. Each of the sampling points or grid cells in the simulation space has the same attribute set (such as position, density, and so on) and processing rules. These sampling points can be processed in parallel. The performance is optimized by multithreading in the granularity of the cloud. For the granularity of the sampling point, Intel SSE instructions are used to uncover the capability of the vector instructions and 128-bit registers in the processor cores.

Intel SSE is used to improve the performance of media, imaging, and 3D workloads, and it can also significantly improve vectorization calculation. Intel SSE instructions can be written by using compiler intrinsics for easy use. All Intel SSE instructions in this article are written in the form of intrinsics and are compiled with Intel® C++ Compiler 10.1 which supports the latest Intel® processors.

As identified by the Intel® VTune™ Performance Analyzer [4], the hotspot functions in this program are illumination functions and the density functions called by them. These functions iterate over all sampling points to calculate the same attributes, such as density, transparency, light intensity, and so on. Intel SSE instructions _mm_load_ss and _mm_insert_ps were used to pack the same input attributes into a single 128-bit __m128 variable for every four sampling points. For example, each sampling point has an X coordinate attribute. Every four sampling points' X coordinates are packed together and are loaded into 128-bit __m128 variables as shown in Figure 6.

Figure 6: Pack Every Four Samples' X Coordinates into One Packed Variable

Then, Intel SSE intrinsics are used to compute on these variables based on the original scalar operations. Finally _mm_extract_epi32 and _mm_store_ps are used to restore the results back to the sampling points.

Note: See the functions appended with the suffix "_SSE" in the source code for details. Refer to the Intel® IA-32 Developer's Manual and Intel SSE4 whitepapers for usage details and information about the Intel SSE instructions [5].


The LuckyCloud demo (Figure 7) by DirectX* 9 was developed to illustrate implementation of this solution and validate its performance. The demo displays the cloudscape evolving in real time. You can use the direction (arrow) keys and use your mouse to move the camera, to pass through the clouds, and to view how the clouds' scattering colors change from different view angles.

Figure 7: A Screenshot of the LuckyCloud Demo

As a performance benchmark, 16 big clouds were placed so they covered the sky of the demo scene. There were about 60,000 cloud particles to illuminate and render. LuckyCloud was tested on the Intel® Core™ i7 processor-based platform, which has four CPU cores with SMT (Simultaneous Multi-Threading) support, enabling eight logic cores in the OS.

Static cloud rendering performance was compared with that of rendering the dynamic clouds in our solution. To simulate rendering static clouds, the simulation and illumination steps in the demo were disabled using the Pause control in the demo UI. The frame rates of both scenarios were approximately 90 fps. This means that under the same rendering method, rendering dynamic clouds using the new technique described in this solution can achieve a similar performance as rendering static clouds. The primary reason for this result is that the additional computation for simulation and illumination did not conflict with the CPU and GPU resources for the original game loop. Instead, this technique primarily leverages the game's unused multi-core resource.

Settings1 Threads2 Threads4 Threads8 Threads2T:1T4T:2T8T:4T
Intel SSE off9 fps16 fps32 fps36 fps1.78x2.00x1.13x
Intel SSE on10 fps18 fps35 fps40 fps1.80x1.94x1.14x
Intel SSE Scaling1.11x1.13x1.09x1.11xThread Scaling

Figure 8: Test Results for the Performance Scaling of Multithread and Optimizations Under the Lock Step Model

To test the performance scaling of updating the cloudscape on multi-core processors, the demo was run in a lock step mode. This keeps updating the cloudscape every frame (using set bFreeStepMode = false in Figure 3) so the frame rate measures the frequency of cloudscape update. The demo was tested by adjusting the number of threads in parallel_for construction (Figure 4) and enabling or disabling Intel SSE codes. The test results (Figure 8) indicated that the LuckyCloud technique has good performance scaling according to the number of threads, as it is no more than the number of CPU cores. For Intel SSE optimization, the test results showed approximately a 10% performance improvement.


Conclusion and Future Work


This article introduces a solution for real-time generation of dynamic volumetric clouds in games. This technique simulates and illuminates clouds on a multi-core CPU, and was optimized with a multithreading framework and Intel SSE instructions. The LuckyCloud demo proves that this solution may enable heavy-loaded cloud modeling and illumination in real time without obvious performance impact to game play. Additionally, performance scaling is good given the number of processor cores. Furthermore, some hints can be learned from this solution:


  • Multithreading is one of the best methods to improve the performance of CPU-bounded game features on multi-core processors.
  • Multi-core CPU resources can be used to enable advanced game effects. This is true especially for some complicated computational tasks which do not need to execute in every frame. These tasks can benefit from residing in the CPU and from the multithreading optimization of our solution with little performance side-effects to game play.


Future work includes further optimizing the rendering performance and implementing more realistic volumetric cloud algorithms. Today, the laptop platform market is quickly growing. In the future, more people will be playing games on multi-core laptops. For these reasons, it makes sense to improve the cloud rendering performance for the integrated graphics on laptops.

Currently the billboard-splatting rendering technique used in the solution suffers from a high pixel fill rate. When there are thousands and millions of cloud particles, the fill rate is the primary performance bottleneck. Harris used the imposter technique to reduce the fill rate. Other methods include using visibility culling and Level-Of-Details to reduce the rendering of unnecessary cloud particles. Using ray-casting volume rendering can also reduce the fill rate, but this method will increase the loads of GPU shaders. As more CPU cores are available in the system, the simulation and shading stages of this solution can be substituted by more complicated algorithms to render more realistic volumetric clouds. For example, the fluid dynamics based simulations [3] and illumination algorithms with high precise scattering models.

Source Download

Download LuckyCloud Source Code [5.43MB]


[1] Y. Dobashi, K. Kaneda, H. Yamashita, T. Okita, and T. Nishita. "A Simple, Efficient Method for Realistic Animation of Clouds".SIGGRAPH 2000, pp. 19-28
[2] M Harris and A Lastra. Real-time cloud rendering. In Computer Graphics Forum, volume 20, pages 76-84. Blackwell Publishers, 2001.
[3] Multi-Threaded Fluid Simulation for Games, https://software.intel.com/en-us/articles/multi-threaded-fluid-simulation-for-games
[4] Intel® VTune™ Performance Analyzer, https://software.intel.com/en-us/intel-vtune-amplifier-xe/
[5] Intel® Streaming SIMD Extensions 4 (Intel® SSE4) Whitepapers, 45nm Next Generation Intel® Core™2 processor family (Penryn) and Intel® SSE4
[6] Intel® Threading Building Blocks, http://www.threadingbuildingblocks.org/


For more complete information about compiler optimizations, see our Optimization Notice.


anonymous's picture

WOw i love it

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.