Download Article or Visit AVX Cloth Page
This article describes a code sample that uses the Intel® Advanced Vector Extensions (Intel® AVX) for computing mesh-based cloth simulation. A structure of arrays (SOA) implementation is used to maximize data parallelism enabling the usage of 256-bit (8 float) SIMD processing. Separate cloth objects are interleaved in memory. The processing data size is kept minimal by using a distance constraint-based solver. At the end of each frame, simulation state is efficiently transposed (8x8) into an array of structures (AOS) vertex buffer for rendering. Comparing optimized serial, 128-bit, and 256-bit implementations demonstrates that performance increases with SIMD width.
Single instruction multiple data (SIMD) processing works best with data-parallel applications where the data is arranged in a structure of array (SOA) format. Graphics and image processing applications are often highly parallel and well-structured, and thus are typically good candidates for SIMD processing. Simple particle systems are a well-known simulation example where SOA techniques are easily applied and produce good scaling results. Geometry or mesh data, on the other hand, is not always uniformly structured in a neat grid. Cloth simulation falls into this second category. Like most physics processing, simulating a piece of cloth is far from being embarrassingly data -parallel since all the vertices influence each other many times per frame. Common array of structure (AOS) approaches, such as intrinsic optimized 4D vector class libraries, are not able to take full advantage of the full width of 256-bit (8 float) SIMD capabilities of Intel AVX. Wider SIMD requires new thinking. This article describes an SOA approach for simulating cloth that demonstrates improved performance with increased SIMD width.
Data parallelism is maximized by using an SOA approach to cloth simulation. Because Intel AVX can operate on up to 8 floats in parallel, the approach is to interleave independent cloth sections together in groups of 8 and simulate 8 at a time. The number of vertices and topology for each member in a group of 8 must be the same. Admittedly, this is a usage limitation. However, for simulation performance to even matter in the big picture, an application is going to require many cloth objects. Having many instances of the same geometry is quite common. Furthermore, most other parameters, such as rest lengths, stiffness, friction, wind forces, and collision surfaces, can be unique per cloth.
The technique has other limitations. The processing is less adaptive. Even if only one cloth within a group requires processing, then all 8 will be updated - even if some could be considered at rest. Furthermore, should one cloth need to do a collision pass, then all will have to do the processing. However, it is possible for each cloth to interact with a different collider.
Cloth simulation techniques generally fall under one of two categories: spring-based and constraint-based.
Point-mass spring-based solutions with forward Euler integration are limited to weak springs that result in mesh stretching and oscillation and therefore are not suitable for cloth. Implicit integration [Baraff and Witkin SIGGRAPH98] with a modified conjugate gradient solver allows the usage of much stiffer springs. However, with operand gather indirection, such applications are limited by data throughput rather than floating point operations (FLOPS). Furthermore, keeping all the Jacobians in a sparse matrix creates a higher data footprint. Keep in mind that with an SOA strategy, that means there will be 8 times as much data iterated over many times per frame. There is a strong motivation to keep all the frequently accessed data in the first level CPU cache. Consequently, for a reasonable sized cloth mesh the data size of this preferred technique posed a problem.
For our sample, we wrote a very simple distance constraint-based (or iterative position projection) solver. These types of systems (T. Jakobsen [Gamasutra03], X. Provost [GI95]) are very easy to develop and are the most common approach used in video games today. In addition to available middleware cloth solutions, many studios have their own custom implementation tailored to the specific needs of their project. This algorithm updates the vertex position data in place and does not require much extra data. At a high level, the algorithm is simply iterates over each constraint and adjusts the endpoints to be slightly closer to the desired rest-length between them:
Repeat many times (8 to 32)
For each constraint a,b endpoints
v = b - a
u = v/||v||
e = (||v|| - restlength) * bias
a = a + u*e
b = b - u*e
This entire loop is repeated a number of times each frame so if a point is pulled at one end of a cloth object the motion will propagate across the mesh. With more iterations the cloth is more responsive and behaves less "stretchy".
It is not just polygon edges between neighboring vertices that make up the constraints. To model the physical properties of cloth, we use structural, shear, and bend constraints as shown in the following diagram:
Cloth Environment Interaction
Given the intention is to focus on Intel AVX acceleration of distance-constraint cloth, the remaining details of the overall physical simulation are kept as simple as possible, but sophisticated enough to make the sample interesting enough to motivate cloth usage.
In the code sample, a localized breeze moves forward and back in the environment with a wind direction and speed that changes with each pass. Wind is a vector function of the position in space w(x,y,z). Wind influences each point on the cloth proportional to the dot product between the relative wind velocity and the surface normal. The motion influence at each vertex is along the normal direction (lift). A wind drag influence was tested but, as this is uniform across each mesh, it didn't add to the visual appeal of the demo. In this implementation, there is no advanced aerodynamics modeling, such as the Bernoulli Effect, and cloth does not affect the wind.
Figure 2: Wind Lift Dynamics
Each vertex velocity is accelerated by gravity each frame. To slow down objects orthogonal to relative wind and those sliding along surfaces, the velocity is uniformly damped to provide friction. Collision is handed with a separate post-cloth-position projection pass. Interaction is one way - cloth does not influence objects in the scene.
System Architecture and Implementation
After wind, collision and other inputs are collected, the physical simulation is the first major step in the process. The mesh must then be rendered. The SOA data for normal and position along with texture coordinates are rearranged into a vertex buffer (AOS) suitable for Microsoft DirectX* immediate mode rendering. Because this happens to be an 8x8 transpose, we can exploit Intel AVX for this step as well. The resulting vertex buffer still interleaves the 8 cloth sections (at vertex granularity instead of per 32-bit float), so our rendering consequently will be in batches of 8. If desired, the transpose (SOA to AOS) could be rewritten to take 8 separate destination pointers.
Intel AVX is very powerful, and a single core is able to adequately drive a large quantity of cloth such that the rendering is at risk of becoming a significant hotspot. For the sample, using DirectX* 11 with some basic support libraries, we implemented a single pass rendering solution with no shadows to minimize rendering overhead. Using more advanced rendering would benefit from having a second thread for rendering.
The remainder of the code is entirely self-contained. We do not use a typical Intel® SSE-style 4D vector header file library since this would be AOS centric and therefore would not fully utilize Intel AVX. We had good success generating optimal assembly using high level C++ template SOA classes for common math types such as 3D vectors - class vec3<__m256>. This programming pattern has the benefit of keeping the source code to the algorithm itself as clean and simple as the serial version. Occasionally some hand tuning with intrinsics was necessary to avoid register spilling and generate optimal assembly code.
The solver loop is the key hotspot in this application. This loop iterates over the constraints which number about 6 times the vertex count. Furthermore this loop is repeated a number of times (8 to 32) to prevent perceptively stretchy, slow, or unresponsive cloth.
While there needs to be a number of repeated updates per constraint, full accuracy is not important per constraint update. Therefore we save some cycles by not applying the Newton-Rhapson refinement after taking the approximate reciprocal square root (intrinsic _mm_rsqrt_ps()).
The simulation performs best with 64-bit compilation to enable the generated code to use 16 YMM registers. 32-bit compilations were lower performing due to register spilling caused by having only 8 YMM registers.
With 32K first level cache, there can be 4K per cloth (or 1024 floats at 32-bit precision). The data for each point in the cloth mesh includes normal, new position, old position, inverse mass, and velocity. However, the key loop only iterates over the array of new positions and inverse mass. (Note that some additional memory is needed for the topology/constraints which are common to all 8 pieces.) Consequently, staying within L1 can be achieved by using fewer than 200 vertices. We kept within this limit when testing for the best performance possible.
The constraint solver loop is the key hotspot for cloth simulation. This code section was timed within a the sample application as well as within a standalone testing framework to ensure all code paths were optimized and measured properly. Hand-optimized serial, 128-bit Intel AVX, and 256-bit Intel AVX versions were tested on a single Intel® processor with the codename Sandy Bridge microarchitecture. The application is single-threaded. The following table shows the average number of CPU cycles required per constraint update as well as the overall application frame-rate for the simulated scene with many cloth objects totaling just over 37,000 points and about 193,000 constraints which the solver iterated over 16 times per frame.
|Performance||Serial||SIMD 4 (128-bit)||SIMD 8 (256-bit)|
|CPU Cycles per Contraint Update||35.5||9.3||5.6|
|Solver Speed up||1 (Baseline)||3.8X||6.3X|
|Overall Speed up||1 (Baseline)||3.4X||5.6X|
The results show reasonable scaling with increased SIMD width. In the key solver computation, using 256-bit adds more than a 50% performance boost over 128-bit. This makes a noticeable difference in the overall FPS. Clearly, SOA enables more efficient CPU utilization for this particular application.
Not all meshes will be small enough to fit within L1 cache. Fortunately, Sandy Bridge microarchitecture has good hardware prefetching. Additional tests simulating 8 larger cloth pieces with 2500 points, which requires using the L2 cache, took 6.5 cycles per constraint update - only about 16% slower. Note that such meshes require many more solver iterations to avoid being too stretchy. Rather than dense meshes, it is usually more effective to use subdivision surfaces or other tessellation to improve the visual fidelity of cloth.
As the code sample shows, a single core with Intel AVX provides enough floating point operations (FLOPS) to drive a gratuitous amount of cloth while still maintaining a high frame rate.
Intel AVX offers improved FLOPS performance if an application can find a way to utilize it. SOA style implementation is not the first thing that comes to mind when people think of cloth simulation. This novel approach is able to demonstrate good SIMD usage and could potentially be appropriate for video game scenes with lots of active cloth objects or deformable debris in the environment.
- D. Baraff, A. Witkin: "Large Steps in Cloth Simulation", SIGGRAPH 1998.
- Kwang-Jin Choi, Hyeong-Seok Ko: "Stable but Responsive Cloth", SIGGRAPH 2002.
- T. Jakobsen: "Advanced Character Physics", Gamasutra, January 21 2003.
- H. Enqvist: "The Secrets Of Cloth Simulation In Alan Wake", Gamasutra, April 29, 2010.
- X. Provot: "Deformation constraints in a mass-spring model to describe rigid cloth behavior", Graphics Interface 1995.
- Intel® Advanced Vector Extensions (Intel® AVX): http://software.intel.com/en-us/avx/
About The Author
Stan Melax received his Bachelor's and Master's degrees in Computing Science from the University of Alberta in Canada in the early 90s. He has spent much of his career in the game industry, including BioWare, EA, and Ageia. Stan is now a Graphics Software Engineer at Intel Corporation, where he gets to help developers write faster code and make better applications.