Download Article or Visit Fireflies Page
Imagine playing a video game where the player guides a character through a marsh in the pitch black dead of night; the only guiding light is a swarm of fireflies that follow the player. Or imagine playing a game where the player guides a character through a desert, kicking up dust clouds with each step. These effects can be computationally expensive, but using a multithreaded implementation, they can be added to a game and scaled based on the processing power of the given system.
Fireflies is a code sample demonstrating scalable ambient effects. In this sample, thousands of fireflies scatter, flock, and then return to settle and form a walking character. The ambient effect in the sample uses simple AI that includes flocking and collision avoidance with the terrain and surrounding trees. By utilizing task-based threading, the sample scales to use all available CPU cores on a target machine. All the necessary calculations for the AI are optimized by dividing the work into tasks that can be run in parallel. The task scheduler is written with Intel® Threading Building Blocks (Intel® TBB).
To get a "feel" for this code, download and run it. While it runs, switch between multithreaded and serial mode to easily see the performance difference that multithreading can bring. In the task-based threading mode, there is the option to change the number of tasks. While playing around with these options on a multi-core machine, it is apparent that the number of tasks affects the performance of the sample. A lower number of tasks such as 1 or 2 yields lower performance, while a higher number of tasks yields a performance increase. Of course, changing the number of particles also affects the sample's performance. The user interface features on the right hand side were included so that a user can experiment with what setting works best on a given machine. When integrating an ambient effect like Fireflies, the goal is to add the best possible ambient effect without slowing down the overall application performance.
The Fireflies sample includes functionality to auto-scale the ambient effect. In the upper right hand corner of the UI, there is a button labeled "Auto-Calibrate Optimal Number Particles". This button will cause the sample to estimate the max number of particles that can be simulated while maintaining a base performance on the target machine. The auto-scaling makes the fireflies continuously flock close together, to try to simulate the highest CPU workload experienced in the sample. In order to have the greatest possible throughput, more threads are spawned than the total number of logical hardware threads. This works well, because Intel TBB will automatically distribute the total workload, and fine-grained tasks are scheduled more consistently. After setting a value for the number of tasks, the sample sets different values for the number of particles to simulate, and tries to find the highest number of particles that can be simulated while still maintaining at least 30 frames per second.
To "drill down" and visualize how the sample works, run the sample using the Profile build of the executable. The sample's Profile version has macros that capture frame activity and performance information in the Platform View of Intel® Graphics Performance Analyzers (Intel® GPA).
Divide Work Into Tasks
Compared to the serial version of the application, the computations performed per frame in the multithreaded version for each firefly are split among multiple tasks. When fireflies scatter from the model, then later return, they perform the calculations necessary to flock together as well as avoid obstacles such as the terrain as well as avoiding vertical obstructions such as pillars and trees. When the sample is running in serial mode, each firefly performs its flocking and collision detection tests in order, one after another.
On the other hand, when running the sample in multithreaded mode, the firefly flight calculations are broken up into tasks. In this case, a task simply refers to a set number of the firefly flight calculations that are all executed on separate threads. All flight calculations are independent of each other, so they may be easily done in parallel. In theory, the more tasks there are, the more the flight calculations can all be completed in parallel. In reality, however, the parallelization of these calculations is limited by the actual number of CPU cores. Moreover, there is an overhead incurred when scheduling a task and thus the amount of work assigned to each task should be greater than the scheduling overhead. Breaking up the tasks efficiently requires finding the right number of tasks to gain peak parallel performance without too much overhead cost.
Figure 2 shows a graph of the lowest frames per second (fps) recorded for various sizes of task sets. From Figure 2, it is apparent that there is a maximum number of particles that works well with a given number of tasks. With too many particles, increasing the number of tasks does not have a great impact, because compute time is wasted spawning extra tasks without any performance benefit and without more cores to utilize the extra tasks there is no increase in parallel work being done. However, for a high number of particles, distributing the particle calculations across multiple tasks did have a significant performance increase as compared to simply running the sample serially. As shown in Figure 2, by splitting particle calculations across even as few as 4 tasks, the sample showed a performance increase of as much as 2x. Increasing the number of tasks to 12 yielded as high as a 4x performance increase.
Overall, from a performance standpoint it is advantageous to multithread an ambient effect so that it can take advantage of a multi-core processor. In addition, Intel TBB task-based threading allows the calculations to be distributed across all the available cores. As can be seen from Figure 3, the parallel portion of the simulation experiences a fairly linear increase in performance with an increase in the number of cores available for the simulation. This graph was obtained by measuring the estimated average time taken to perform the purely parallel portion of the code, which was the firefly flight trajectory calculations, each frame with a varying numbers of cores assigned to the sample.² The graph shows the estimated average number of fireflies' flight trajectory update calculations that can be done per frame given a certain number of cores.
One may notice that when running the sample with a small number of fireflies, the multithreaded mode still runs faster than the serial mode. Besides the fireflies' flight trajectory calculations, another important part of the code is how the task-based threading is used to parallelize the computation performed in setting up and rendering each frame. Running serially, the sample performs the usual frame setup, processing data for that frame, and rendering. However, when run in multithreaded mode, the processing of the fireflies is done in the previous frame and in parallel with the frame render, which in effect shortens the total time needed per frame. Below one can see the different sequence of steps executed in multithreaded frame activity compared to the serial frame activity. One can also see in Figure 5 a screenshot of all the sample's thread activity in a frame when run in multithreaded mode, as captured in Intel® GPA Platform View.
This sample shows an ambient effect that can be used to enhance a game and demonstrates how distributing the computation across multiple tasks yields multiple benefits. Not only does multithreading increase performance, but it also enables the ambient effect to scale easily across platforms with different CPU power. By being able to change the number of tasks used to perform the calculations and the number of objects needing calculations, developers can create scalable ambient effects, such as in the Firefly sample, once and not have to worry about the processing power of their target platform. With a task-based threading methodology, developers can write code for ambient effects and have it run on a variety of processors, from Intel® Atom™ processors in netbooks all the way up to high end desktop systems.
About the Author
Eliezer Payzer is an intern with Intel's Visual Computing Software Division where he worked on samples that demonstrate the power of Intel® architecture. He is finishing up his Masters in Computer Science at the University of Southern California.
¹ Testing was completed on an Intel® Core™ i7-980X processor-based machine running at 3.33 GHz with 6 GB of RAM using an NVIDIA GeForce* GTX 285 graphics card.
² Data obtained on an Intel® Core™ i7-980X processor-based machine running at 3.33 GHz with 6 GB of RAM using an NVIDIA GeForce* GTX 285 graphics card. Processors were assigned to sample through the Task Manager by assigning processor affinity.