Creating a Particle System with Streaming SIMD Extensions

Published: 09/13/2011   Last Updated: 09/13/2011

by William Damon


Smoke, fire, water spray, dust, and more all have something in common. They all benefit from a good particle system. For the uninitiated, a particle system is essentially the management of a collection of non-static points in 3D space. In most common examples, each point, or particle, goes through an entire life cycle from "birth" through "death". Adjusting the parameters that affect a particle's life cycle allows the creation of various effects. The key ingredient of a particle system that makes the generated effects look so realistic is chaos; a bit of randomness in the behavioral modification of the particles. This paper dives into creating a particle system that takes full advantage of Intel® Architecture.

Sometimes a well-architected, well-implemented particle system isn't enough when it comes to performance. Wouldn't it be nice if it were possible to design all the math and physics behind these brilliant effects in such a way that the processor could handle four times the workload? If you are developing for the PC platforms, and your target is Intel® Pentium® III or Pentium® 4 processors, look no further than the Streaming SIMD Extensions.

Streaming SIMD Extensions, or SSE, allow the operation of instructions on up to four pieces of data simultaneously with a single instruction; hence the acronym SIMD: Single Instruction Multiple Data. Arranging and organizing the data optimally allows for the maximum benefit from these instructions. Hundreds of applications exist for SSE. Here we focus on one that is near and dear to the game programmer's heart.

Accompanying this article, you may download an example project that contains the full source of a particle system implemented with SSE. Note that you must either use the Intel® C++ Compiler (recommended), or have the Microsoft* Visual C++ 6.0 Processor Pack installed. You may download a fully functional trial version of the Intel C++ Compiler free of charge at:

Intel Compilers

The processor pack is available for download at no charge at:

Microsoft Visual C++ 6.0 Processor Pack*

A Word on Design

The composition of most particle systems is basically the same, though implementation details may differ a great deal. Most often, the particle is set up as a single class or struct, depending on the language. Conceptually the particle system is the class responsible for managing a linked-list or array of particles. With each frame, the particle system must update the parameters of each particle by applying the rules that govern a particular effect. The particle system class is also responsible for rendering the particles with each frame. Before we discuss the intricate details of the particle system class, though, we must first review necessary considerations for composing the particle class.

The Particle Class

Probably the biggest mistake made when designing a particle system is overlooking the importance of memory management. As I mentioned earlier, a common appr oach is to create a linked list or something of similar nature to maintain the particles, which can be bad for performance, and is a potential source for cache thrashing. Since I am starting with an empty canvas, and one of my design requirements is performance, I'll begin with a bottom-up approach by designing a particle class in structure-of-array (SoA) format that takes full advantage of SIMD instructions. If you already have a particle system that uses a more conventional array-of-structures (AoS) format, you may still take advantage of SSE by passing your particles through a conversion routine, performing the appropriate calculations, and converting the data back to AoS format. You will incur a performance penalty by rearranging your data on the fly, but the penalty may only be slight in light of the gain from using SSE.

The following particle class contains the same information as any other particle class might incorporate; the only difference is the organization of the data. When using an AoS format, the particle system class might create an array of particles as follows:

    class Particle



     float x, y, z; // Current position

     float vx, vy, vz; // Current velocity

     float ax, ay, az; // Acceleration vector

     float energy; // Energy

     bool alive; // Life


    mParticleArray = new Particle[numParticles];


However, with SoA format, the particle data is created and arranged in this manner:

    class Particle



     float *x, *y, *z; // Current position

     float *vx, *vy, *vz; // Current velocity

     float *ax, *ay, *az; // Acceleration vector

     float *energy; // Energy

     bool *alive; // Life


    mParticleArray = new Particle(numParticles);


With SoA, memory is allocated within the particle class so that each element of a particle can be arranged in an array. Using this structure is only slightly different than using an AoS format, in the sense that in AoS format, an element would be accessed as mParticleArray[i]->position.x , whereas in SoA format, the same element would be accessed as mParticleArray->x[i].

In addition to proper arrangement, memory must also be aligned to a 16-byte boundary for maximum SIMD friendliness. In general, data structures should be aligned to natural boundaries whenever possible, because the processor requires two memory accesses to access unaligned data (e.g. a doubleword that crosses a 4-byte boundary). A natural boundary for a double quadword (128-bits) is any address that is evenly divisible by 16. We will be working on 4 single-precision floating-point values simultaneously, so a 16-byte boundary is a wonderful fit (4 single-precision floating-point values, each 32-bits == 4 * 32 = 128-bits == 1 double quadword). Aligning dynamic memory is easily accomplished with the intrinsic for allocating memory, _mm_malloc().

    Particle::Particle(int numParticles)


     x = (float*)_mm_malloc(numParticles * sizeof(float),


     // ...



As with any other dynamic memory allocation, be sure to release the memory once it is no longer needed. In the case of the particle class, this is when an instance of the class is destroyed. Freeing memory allocated with _mm_malloc() is accomplished through the intrinsic, _mm_free().




     // ...



Now that we have a particle class defined, and memory management is under control, let's move back up a level to the management of the particles.

The Particle System Class

We would like to design a bit of flexibility into the particle system class because it can be effective and useful for several visual effects such as smoke, sparks, water, etc. The easiest approach is to adopt an object-oriented design with a base particle system class and specialized particle systems derived from that class. For example, smoke and sparks are very different in nature, but they both fit well into a particle system. The main difference between the two is the way in which they are affected by the environment in which they exist, and perhaps the manner in which they are rendered. Sparks consume oxygen, fall towards the ground, and burn out fast. In a more sci-fi environment, sparks may explode in space and remain unaffected by gravity and/or atmosphere. Smoke, on the other hand, may rise, fall, or remain static depending on temperature and composition. More advanced particle systems might include features such as collision, and support to include the particle system as part of an object hierarchy. For the sake of simplicity, the particle system class implemented here does not have such features. However, adding advanced features is easy given a good general design.


Rendering a particle system in Direct3D* whose particles are arranged in SoA format is similar to rendering one whose particles are arranged in AoS format. For each frame, lock and fill a dynamic vertex buffer, unlock, and draw. The only variation is the way we extract the data from the particles with which we fill the vertex buffer. Here is where we see the first big advantage of using a SoA and SSE. Without using SSE, we would have to iterate through a loop n times, (where n is the number of particles) and set each vertex of the vertex buffer one at a time. SSE, however, allows us to run through a similar loop (n / 4) times, setting four vertices of the vertex buffer at once. That means only doing a quarter of the work! The same thing is true for updating the properties of particles in the particle system update method, which we will review in more detail in the next section.

API performance is an important consideration when discussing dynamic vertex buffers. Depending on size and usage, dynamic vertex- and index- buffer performance may vary. The DirectX* documentation describes three ways to approach the problem and how to make effective use of locking flags. You may find that discussion in the DirectX* 8.1 SDK documentation for C++ at under

DirectX Graphics/Programmers Guide/Direct3D Appendix/Programming Tips/Performance Optimizations

The example project included with this article locks and fills an entire single vertex buffer with all the particles each frame using the D3DLOCK_DISCARD flag, and then only makes one call to DrawPrimitive() or DrawIndexedPrimitive(). DrawPrimitive() is used when we let the API handle the details of expanding single points to quads and texturing those quads for rendering by setting the render state D3DRS_POINTSPRITEENABLE to TRUE. DrawIndexedPrimitive() is used when the application directly add resses these tasks with D3DRS_POINTSPRITEENABLE set to FALSE.

Point sprites are really just quads, so it is possible to render particles as anything. Just extract the position of each particle, generate the geometry to render accordingly, and stuff the geometry into an appropriate vertex buffer for rendering. The example project allows the API to do a lot of the work when using point sprites, and the custom vertex type can contain only a single element: a position vertex. Extracting the data from the particles is done as follows:

  • Load the three components of the position vector (x, y, and z) of four particles into XMM registers.
  • Unpack and shuffle the data into xyz format.
  • Check the life flag for each of the four particles. If the particle is alive, then stuff the extracted data into the vertex buffer. Otherwise, ignore it.


The code for the above algorithm is rather straight-forward, and is shown in Listing 4 (below).

Rendering a point list as point sprites may be expensive, depending on a system's hardware and/or driver support. It is good to have a fallback routine that is fast, and fortunately SSE can help here. The idea is to do the same work the API would do, only faster. The algorithm works as follows:

  • Transform a vertex into eye-space (i.e. camera-space).
  • Calculate the screen-space point size.
  • Convert the transformed vertex from eye-space to screen-space (i.e. map the transformed vertex to the viewport).
  • Calculate and store in the vertex buffer the position and texture coordinates of the 4 corners of a quadrilateral.
  • After completing steps 1-4 for all particles, generate the indices that describe the triangles that make up the quadrilaterals, and store those indices in an index buffer.


SSE allows the above algorithm to perform the same operations on four particles at a time, thus reducing the amount of computation required by 3/4. A more detailed discussion on point sprites in DirectX* (including the formulae for steps 2-4) may be located in the DirectX* 8.1 SDK documentation for C++ under:

DirectX Graphics/Programmers Guide/Advanced Topics/Object Geometry/Point Sprites

Furthermore, the example project included with this article, which you may download with full source, includes examples of both the x87 and SSE implementations of these algorithms.


Updating particles follows a similar pattern to rendering. Again, organizing the data in SoA format enables the update method to do just a quarter of the work of a conventional update method. Also note that the update method of the base particle system class is void. The derived particle systems (sparks, smoke, explosions, etc...) must each implement their own update method, overriding the base class method, because each system is so specialized. The attached example project includes a simple spark system with a relatively simple Update().

Each particle's life begins with a "random" energy value, and that energy wanes with time until the particle eventually fades out. When a particle dies, it can either be reborn as a new particle, or be gone forever. In the example project, the spark system stays alive indefinitely; however, choosing otherwise is a simple flip of a Boolean value. If you chose to implement a particle system with a short life span, it may be useful to keep track of how many particles in the system are alive. That way it becomes very easy to send a message to a particle system manager to release resources when all the particles in a system terminate.

"SIMDizing" the update method is much like the work we do to extract the particle data for rendering. In this case, however, we do not have to rearrange the data to fit an API. The data may remain in SoA format, allowing us to blast through updating four particles at a time. Here is how it works (each step uses intrinsics):

  • Load the three components of the position vector (x, y, and z) of four particles into some temporary variables of type __m128.
  • Load the three components of the velocity vector and acceleration vector similarly.
  • Update the velocity, position, color, life flag, energy, etc. of each particle.
  • Store back the updated information into the appropriate elements.
  • If a particle has died and the particle system life span is indefinite, respawn dead particles.


You can see the code for the above steps in the spark system implementation file in the downloadable example project.

Wrapping Up

The first and arguably most important part of creating a particle system is a good design. Be clear on the requirements of your particle system, but keep in mind flexibility, as well as the things that are important from a processor architecture perspective. Include performance in the design and testing of the particle system. Data organization and management is key, and SSE can help provide big wins in this space. Typically, no major algorithmic modifications are necessary to use the improvements SSE provides. Furthermore, the compiler intrinsics allow the programmer to easily write SSE code without going to the assembly level. Adding SSE support to your particle system may improve performance, allowing you to render more particles at the same frame rate, creating some awesome visuals! So download the example project presented here, add some features or change the characteristics of the particle system, and add some excitement to your 3D engine today!


  • Lander, Jeff, The Ocean Spray in Your Face, Graphic Content, July 1998
  • van der Burg, John, Building an Advanced Particle System, Game Developer, March 2000
  • Sarmiento, Sara, Getting Started with SSE and SSE2 for the Pentium® 4 Processor, Intel® Developer Services, January 2002
  • IA-32 Intel Architecture Software Developer's Manual Volume 1: Basic Architecture, Chapter 10
  • Microsoft* DirectX* 8.1 SDK Documentation for C++


Listing 4

    __declspec(align(16)) float v[4];

     // ...

     // Lock the vertex buffer.

     if (FAILED(hr = m_pVB->Lock(0,

     mNumAlive * sizeof(struct PointVertex),




     return hr;


     int remaining_particles = m_NumParticles % 4;

     // Fill vertex buffer with current data

     for (int i = 0; i < m_NumParticles - remaining_particles;



     xmm0 =


     xmm1 =


     xmm2 =


     xmm3 = _mm_set1_ps(1.0f);

     // r3 r2 r1 r0

     // ------------------------------

     // xmm0: x[i+3] x[i+2] x[i+1] x[ i ]

     // xmm1: y[i+2] y[i+2] y[i+1] y[ i ]

     // xmm2: z[i+3] z[i+2] z[i+1] z[ i ]

     // xmm3: 1.0 1.0 1.0 1.0

     // ------------------------------------

     xmm4 = _mm_unpacklo_ps(xmm0, xmm1);

     xmm6 = _mm_unpackhi_ps(xmm0, xmm1);

     xmm5 = _mm_unpacklo_ps(xmm2, xmm3);

     xmm7 = _mm_unpackhi_ps(xmm2, xmm3);

     // ------------------------------------

     // xmm4: y[i+1] x[i+1] y[ i ] x[ i ]

     // xmm6: y[i+3] x[i+3] y[i+2] x[i+2]

     // xmm5: 1.0 z[i+1] 1.0 z[ i ]

     // xmm7: 1.0 z[i+3] 1.0 z[i+2]

     // ------------------------------------

     xmm0 = _mm_shuffle_ps(xmm4, xmm5, _MM_SHUFFLE(1, 0, 1,


     xmm1 = _mm_shuffle_ps(xmm4, xmm5, _MM_SHUFFLE(3, 2, 3,


     xmm2 = _mm_shuffle_ps(xmm6, xmm7, _MM_SHUFFLE(1, 0, 1,


     xmm3 = _mm_shuffle_ps(xmm6, xmm7, _MM_SHUFFLE(3, 2, 3,


     // ------------------------------------

     // xmm0: 1.0 z[ i ] y[ i ] x[ i ]

     // xmm1: 1.0 z[i+1] y[i+1] x[i+1]

     // xmm2: 1.0 z[i+2] y[i+2] x[i+2]

     // xmm3: 1.0 z[i+3] y[i+3] x[i+3]

     // ------------------------------------

     if (m_pParticles->alive[i])


     _mm_store_ps(v, xmm0);

     pVertices->v.x = v[0];

     pVertices->v.y = v[1];

     pVertices->v.z = v[2];



     // ...


     // Complete filling the vertex buffer with remaining



Download Example Project


About the Author

Will Damon was a Technical Marketing Engineer within Intel's Software Solutions Group. He has a bachelor's degree in Computer Science from Virginia Polytechnic Institute and State University*, where he graduated with honors. He has been with Intel for over a year, helping game developers enable their titles to achieve the highest performance possible on Intel® Pentium® 4 processor-based PCs. He welcomes email regarding optimization, mathematics, physics, artificial intelligence, or anything else related to real-time 3D graphics, and gaming.


Attachment Size 99.5 KB

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at