Multi-Threaded Fluid Simulation for Games


Along with mentioning a strange rash, a sure way to kill a conversation is to bring up the topic of Computational Fluid Dynamics (CFD). But for those who wish to create fantastic-looking clouds, explosions, smoke, and other game effects, nothing could be more exciting! CFD is the section of fluid mechanics that finds numerical solutions to the equations describing the behavior of fluids, both liquid and gaseous, and their interaction with various forces. Until recent hardware advances, even approximating a solution was too computationally intense for real-time applications. However, the last year or two has shown a small but increasing amount of research and articles, as well as the release of some commercial products that deal with simulating fluids, not only in real time but in a video game environment. With the Kaboom project, we've created a modular, threaded CFD simulation that can be dropped into an existing project, hooked up to a rendering engine, and used to generate realistic, real-time fluid dynamics simulations, suitable for smoke or water.

We observed that many games are not reaching their full potential because they're unable to use all of the available CPU resources. Frequently, this is because they are not properly threaded for multi-core architectures. Rather than run the simulation on the GPU, we decided to use those extra cycles to produce a basic, real-time fluid simulation. By performing the simulation on the CPU as opposed to the GPU, the simulation has fast and easy access to all game resources (such as mesh information) and the game has access to the simulation results. This also leaves the GPU free to handle the graphics.

As we continued our research we found that although interest in fluid simulations was high, little concrete material was available, especially in the way of code and examples. We were unable to find examples of 3D or multi-threaded solvers. Most of the articles described ways to solve the equations but did not use a multi-threaded approach. Despite this, there seemed to be a lot of interest in the topic, so we decided that another goal of the project would be to make the resulting code modular and thereby fairly simple for someone to integrate into their own project. Developers can also use and expand upon the set of threaded, modular routines that we created for 3D CFD.


The code is based on an article and sample code written by Mick West that was recently posted on Gamasutra.[1],[2] Since one of our goals was to produce a modular, reusable code base, the first thing we did was to convert the existing sample to C++ and separate the simulation from the rest of the application, specifically from the rendering. This separation allows the simulation code to be easily added to an existing code base and rendering engine.

Next, we extended the code into three dimensions, which turned out to be a fairly straightforward exercise because none of the algorithms changed appreciably with the addition of another dimension. For example, the 2D diffusion step is solved by applying the diffusion operation to move values from a cell to its neighbors (Figures 1 and 2). This approach is extended in the 3D case so that instead of inspecting four neighboring cells we look at six neighbors (Figures 1 and 3). Other cases used the same principle as well.

Figure 1. Transitioning code from 2D to 3D

Figure 1. Transitioning code from 2D to 3D.

The following two code samples (Figures 2 and 3) show the code that performs the diffusion operation for non-border case cells in both the 2D and 3D cases. The border cells are those cells at the very edge of the grid that are missing one or more neighbors. These cells are handled separately as special cases. The transition to 3D is accomplished by adding in consideration for the z-axis. Apart from that, the algorithm remains essentially the same.

for (int x = 1; x < m_w - 1; x++)


  for (int y = 1; y < m_h - 1; y++)


    p_out->element(x, y) = p_in->element(x, y) + force *

      ( p_in->element(x,     y + 1) +

        p_in->element(x,     y - 1) +

        p_in->element(x + 1, y)     +

        p_in->element(x - 1, y)     -

        (4.0f * p_in->element(x, y)));



Figure 2. A code sample that shows 2D diffusion.

for (int x = 1; x < m_w - 1; x++)


  for (int y = 1; y < m_h - 1; y++)


    for (int z = 1; z < m_d - 1; z++)


      p_out->element(x, y, z) = p_in->element(x, y, z) + force *

        ( p_in->element(x,     y,     z + 1) +

          p_in->element(x,     y,     z - 1) +

          p_in->element(x,     y + 1, z)     +

          p_in->element(x,     y - 1, z)     +

          p_in->element(x + 1, y,     z)     +

          p_in->element(x - 1, y,     z)     -

          (6.0f * p_in->element(x,y,z)));




Figure 3. A code sample that shows 3D diffusion.

After transitioning the simulation into 3D, we began investigating how to break the work up for parallel processing. As a base for our multi-core work we used the Intel® Threading Building Blocks (TBB) package. We found TBB to be very competitive in performance with other threading APIs. Additionally, it has a lot of built-in functionality and structures, such as a thread pool, task manager, and multi-threaded memory management.

The idea behind threading the simulation is data decomposition: take the grid and break it into smaller portions. These smaller portions are submitted as tasks to the task queue and processed by the worker threads, independently of each other. The grid is broken up evenly until a small-enough granularity is reached, at which point further reductions become ineffective, or worse, inefficient due to thread management overhead. Figure 4 shows an example of how a 2D grid is broken into four tasks.

Figure 4. Breaking up a 2D grid into tasks and an individual cell adjacent to other tasks

Figure 4. Breaking up a 2D grid into tasks and an individual cell adjacent to other tasks

For example, in the simulation there is a diffusion step followed by a force update step and later, an advection step. At each step the grid is broken into pieces, and the tasks write their data to a shared output grid, which becomes the input into the next step. In most cases, this approach works well. However, a problem can arise if the calculation uses its own results in the current step. That is, the output of cell 1 is used to calculate cell 2. Looking at Figure 4, it is obvious that certain cells, like the shaded cell, will be processed in a task separate from the task in which their immediate neighbors are processed. Since each task uses the same input, the neighboring cells' data will be different in the multi-threaded version compared to the serial version. Whether this is an issue depends on which portion of the simulation we are in. Inaccuracies in the results in the diffusion step can occur, but they are not noticeable and don't introduce destabilizing effects. In most cases the inaccuracies are smoothed out by subsequent iterations over the data and have no lasting impact on the simulation. This kind of behavior is acceptable because the primary concern is visual appeal, not numerical accuracy. However, in other cases inaccuracies can accumulate, introducing a destabilizing effect on the simulation, which can quickly destroy any semblance of normal behavior. An example of this occurs while processing the edge cases in diffusion-where diffusion is calculated along the edges of the grid or cube. When run in parallel, the density does not diffuse out into the rest of the volume and instead accumulates at the edges, forming a solid border. For this situation and similar ones, we simply run that portion serially.

The main computational load of the simulation occurs when each step loops over the grid. These are all nested loops, so a straightforward way to break them up is by using the "parallel for" construction of TBB. In the example case of density diffusion, the code went from the serial version shown in Figure 3 to the parallel version shown in Figures 5 and 6.

parallel_for( tbb::blocked_range<unsigned int>( uBegin, uEnd,

              uGrainSize ), *diffusionContext );

Figure 5. Using Intel® Threading Building Blocks to call the diffusion algorithm.

Figure 5 shows how TBB is used to call the diffusion algorithm. The arguments uBegin and uEnd are the start and end values of the total range to process, and uGrainSize is how small to break the range into. For example, if uBegin is 0 and uEnd is 99, the total range is 100 units. If uGrainSize is 10, 10 tasks will be submitted, each of which will process a range of 10.

Figure 6 shows the actual algorithm called by the TBB task manager. The range to process is passed in, and the input and output grids are variables common to each job. As a result of using multi-threading, the performance in frames per second was improved by 3.76´ when going from one thread to eight threads on an Intel® Core i7 processor.

for ( unsigned int gridPos = range.begin();

      gridPos !=range.end(); gridPos++ )


  x = gridPos % m_w;

  y = ( gridPos / m_w ) % m_h;

  z = gridPos / ( m_w * m_h );


  // Update all the remaining cells that are not touching a

  // boundary.  6 neighbors total.

  if((x >= 1) && (x < m_w - 1) &&

     (y >= 1) && (y < m_h - 1) &&

     (z >= 1) && (z < m_d - 1))


    m_fluidOut->element(x, y, z) =

      m_fluidIn->element(x, y, z) + m_force *

     (m_fluidIn->element(x,     y,     z + 1) +

      m_fluidIn->element(x,     y,     z - 1) +

      m_fluidIn->element(x,     y + 1, z) +

      m_fluidIn->element(x,     y - 1, z) +

      m_fluidIn->element(x + 1, y,     z) +

      m_fluidIn->element(x - 1, y,     z) - 6.0f *




Figure 6.  A code sample that shows 3D, multi-threaded diffusion.


Although the goal of the project was to take unused CPU cycles and use them to compute a CFD simulation, the results won't have much impact without a nice, visual representation. There are many methods of rendering volumetric data, but we like the results we got using the ray casting method detailed in Real-Time Volume Graphics.[3] Using this method, the volumetric data is copied into a 3D texture, which is used in a pixel shader on the GPU. The simulation volume is rendered as a cube at the position of the volume in the 3D scene. For each visible pixel, a ray is cast from the camera to this pixel on the cube. The corresponding point in the 3D texture is found and sampled and the ray marches through the texture at a fixed step size, accumulating the density data as it goes.



With Kaboom, we created a modular fluid simulation that adds to the existing knowledge base. It operates in 3D, is independent of the rendering portion, and is multi-threaded to take advantage of multi-core processors. Of course, there is still plenty of future work that someone can do with this code sample. On the simulation side, a new or modified algorithm could be used to remove the serial requirements. Also, we found that a section of the code that does bilinear interpolation calculations takes up a significant amount of processing time, which could be reduced by storing the results in a lookup table. Another interesting avenue to explore is the introduction of arbitrary boundaries. Since the simulation is running on the CPU it has access to and can react with the geometry data of the scene. The rendering could be enhanced with additional effects, such as shadowing or dynamic lighting.

Source and Binary Download

About the Author

Quentin Froemke is a software engineer with Intel's Visual Computing Software Division where he enables the game industry to more effectively utilize multi-core and other Intel technologies.


For more information about Intel® Threading Building Blocks, visit


[1] West, Mick. "Practical Fluid Dynamics: Part 1" Gamasutra, 26 June 2008.

[2] West, Mick. "Practical Fluid Dynamics: Part 2" Gamasutra, 23 July 2008.

[3] Engel, Klaus, ed. Real-Time Volume Graphics, Wellesley, Massachusetts: A. K. Peters, Ltd., 2006.

For more complete information about compiler optimizations, see our Optimization Notice.


roguerobot's picture

to be sure though, I can't find where I originally read it, am I correct that Kaboom itself, stripped of TBB is free for commercial use? and where can I find the licensing info if it exists?

Aubrey W. (Intel)'s picture

Hello roguerobot, I saw you posted your question in the Intel® Threading Building Blocks forum as well. That's the best place to get an answer. I've brought it to the community managers' attention. Thanks for your question.

roguerobot's picture

so, I developed a plugin with the kaboom source for an open source game creation tool that can be used to make commercial games. Now I read that TBB is not free?

does this mean, I can't make a plugin with Kaboom that is used to develop a game for commercial use? and if not, can I remove the TBB parts and then it's fine? or has the free status of kaboom source for commercial use changed?

Igor Lopatin (Intel)'s picture

Hi, I've just tried it with Visual Studio 2010 and TBB 3.0, and it seems to be working in this environment as well, though some modification of project properties were required due to changed directory structure of new TBB release.

Jeffrey F. (Intel)'s picture


We chose not to distribute TBB as it is provided for free on our OpenSource site: or as a commercially licensed application.


anonymous's picture

Just one small comment: Could have distributed the tbb dlls ;)

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.