Multi-threading Line-of-Sight Calculations to Improve Sensory System Performance in Game AI

By Alex J. Champandard

The implementation described in this article was based on a prototype by Jeremy Tryba and a mini-game by Radu Cristea. The code listings were heavily adapted to make sense on their own. Björn Knafla and Julien Hamaide made numerous suggestions about design and performance improvements.

What Makes Sensory Information So Important?

Artificial intelligence (AI) is only as good as the information it gets: garbage in, garbage out. Even if you have delays of a few seconds or simply not enough information, it will show in the behaviors. In concrete terms, this means that you see the AI in action games picking obviously wrong places to take cover or even idling for multiple seconds despite having been exposed [1].

Obviously, there are ways to avoid such undesirable behaviors by providing the AI with better information, and that’s typically the responsibility of the sensory system. Generally, sensory systems can be designed to perform a variety of AI-related tasks, from perception to reasoning [4]. But the biggest challenge for performance is that line-of-sight (LoS) queries used to gather useful information from the world are often very expensive to run.

In practice, game developers use many common tricks to improve sensory performance-for instance:

  1. Using a random subset of nearby cover (also called stochastic selection) to limit the number of LoS checks [2].
  2. Approximating the LoS using a cache, which can be a circular map, cylindrical map, or cube map for locations in space [3].

Both of these tricks help approximate the same information, but you’ll still need LoS queries to get perfect information or deal with dynamic worlds. As such, any performance improvements are more than welcome here.

In this article, you’ll see how you can accelerate such LoS queries using multi-threading techniques through the concept of a centralized sensory system [8]. This approach also has benefits on lower-end hardware (by managing the LoS queries done every frame), but the next few pages focus on how it opens the doors for multi-core architectures.

Hide & Seek and Performance Problems

You’d be surprised to hear how many challenges of modern action games are also in a Hide & Seek game! This is one of the reasons it was implemented as a mini-game prototype in our AI Sandbox. (The AI Sandbox, shown in Figure 1, is the C++ game framework behind AiGameDev.com, built on top of an open source stack of libraries and designed for prototyping game AI. It’s not available publicly yet, but you can find out more at http://aisandbox.com.)

Figure 1. Hide & Seek mini-game in the AI Sandbox. A single patrolling agent (in red) is moving around the world in a fixed patrol route, while the hiders (in yellow) are fleeing to cover, because their previous hiding location was invalidated. While hiding, dozens of LoS calculations are evaluated for nearby cover points.


In this Hide & Seek game, dozens of simulated computer characters run and find cover from each other. These AI agents use sensors to perceive their surrounding and to search for other agents to locate or hide from while detecting obstacles to avoid or use as cover.

The simplest solution for enabling your agents to perceive their surroundings is to perform all LoS calculations wherever you need them in the code. Here’s an example of this approach:

void HideBehavior::isCoverSafe(Vector3 coverPosition, Vector3 enemyPosition)
{
	btCollisionWorld::ClosestRayResultCallback cb(coverPosition, enemyPosition);
	m_pDynamicsWorld->rayTest(coverPosition, enemyPosition, cb);
	return (cb.hasHit() == false);
}

This function is called repeatedly from within a loop evaluating all sensible cover locations. As such, all agents execute the low-level rayTest calculations sequentially as they evaluate the LoS for each location themselves (see Figure 2). This code causes multiple problems:

  • The immediate execution of the LoS offers no opportunities for batching-up calculations.
  • There may be uncontrollable computation spikes that can ruin the frame rate or worse.
  • The code accesses memory in an unpredictable pattern and wastes memory bandwidth by reading the same data repeatedly.

Figure 2. Multiple agents (in red) patrolling both a fixed route and around random positions and the hiders (in yellow), each trying to find cover hidden from their biggest threats. LoS queries are used for the multiple threats while finding cover but also to determine the validity of the current cover. The agent in the top left has been seen and has started moving to a new location.

In the next section, you’ll see how to remove the problem with this approach that causes these issues.

Building a Sensory System

The basic idea for improving the code above is to have a centralized sensory system that’s responsible for all sensory processing, particularly the expensive LoS calculations. Sensory information is then requested by the actual AI logic-for instance, from a Behavior component in each Agent-and the result is returned when it’s available.

Because there is only one sensory system in the whole game, it has the opportunity to process the information in its own time. This opens up the door for batching up the queries, managing the workload each frame, and-most importantly-allowing multiple threads to perform the calculations.

Exposing Sensory Queries

Using a central sensory system requires a different interface than the one shown in the previous section. Instead of a synchronous procedural call, you use an asynchronous request that specifies which calculations need to be done in a more declarative fashion. You achieve this by using Query objects in the following way:

class LineOfSightQuery : public Query
{
public:

	enum Visibility
	{
		UNKNOWN,
		VISIBLE,
		INVISIBLE
	};

	const Visibility getVisibility();
	void setSourcePosition(const Vector3& pos);
	void setTargetPosition(const Vector3& pos);

	LosQuery();

protected:
	Visibility m_Visibility;

	Vector3 m_SourcePosition;
	Vector3 m_TargetPosition;

}; // class LineOfSightQuery


The result of this approach is that it decouples the sensory system interface from the underlying control flow. The AI code can pass queries and expect the result separately. You do this by using polling every frame to see whether the LoS result is available yet or by using callbacks as you see here:

class HideBehavior : public Behavior
{
protected:
	LineOfSightQuery m_Query;

	/**
	 * Request some information from the sensory system.
	 */
	void onStart()
	{
		Query::Observer observer;
		observer.bind(this);

		m_Query.setSourcePosition(m_CoverPoint);
		m_Query.setTargetPosition(m_Enemy.getPosition());
		m_pSensorySystem->addQuery(m_Query, observer);
	}

	/**
	 * Called when the result is available, if the query isn't
	 * cancelled in the meantime.
	 */
	void onCoverLosCalculated(Query&)
	{
		if (m_Query.getStatus() == SUCCEEDED)
		{
			Visibility v = m_Query.getVisibility();		
			// ...
		}
	}

	/**
	 * When the behavior exits, make sure the query is removed from
	 * the sensory system.
	 */
	void onStop()
	{
		m_pSensorySystem->cancelQuery(m_Query);
	}

}; // class MyBehavior

 

With this kind of system in place, the code becomes easier to follow, as there’s a clear place for making LoS requests, processing LoS calculations, and dealing with the results, instead of having the calculations scattered throughout the code. You’ll get better results from such an observer-based notification mechanism, although it might take some extra care to make sure it’s done in a thread-safe fashion.

Peak performance improves only slightly, as the code is still single-threaded. There’s some extra overhead in memory and computation because of the new interface (which slows down performance), but the sensory system can now process all the queries at once, so cache coherence increases (which improves performance). This is one of the benefits of batching, even if you don’t use multi-threading. However, a major benefit of this new architecture is that it can scale up to support multiple cores and opens the door for better AI parallelism.


Parallelization of Sensory Jobs

Internally, the sensory system turns each query into jobs as a preparation phase. This first sequential phase (the pre-update) is intended to be done sequentially and sets up the workload to be done immediately after. The second phase (the update) processes each job independently; the jobs can be performed one by one or in parallel, because they have no interdependencies. All the LoS jobs finish in the same frame in which they start, so they don’t persist until next time. Finally, the third phase (the post-update) dispatches the observers to notify the code that originally made the request. This task is intended to be done very quickly and sequentially to prevent any synchronization problems.

Intel® Threading Building Blocks

In the current implementation, the main update phase is processed by the Intel® Threading Building Blocks (Intel® TBB) library [5], because this kind of problem is suited to even the simplest constructs that Intel TBB provides. As Björn Knafla mentions:

“Running independent jobs is a problem often called embarrassingly parallel: something easy to run in a data-parallel fashion. James Reinders calls such parallel problems ‘happy parallelism’ in his book Intel Threading Building Blocks (p. 9)]” [6].

In practice, the code uses parallel_for, which splits the array of jobs into multiple sub-ranges and in turn divides them over the different threads available to the application. Here’s the code that does this:

class JobProcessor
{
public:
	typedef std::vector<Job*> Jobs;

	JobProcessor(Jobs& j)
	:	m_Jobs(&j)
	{
	}

	/**
	 * Block operator for TBB that processes a range from the array.
	 */
	void operator()(const tbb::blocked_range<size_t>& range) const
	{
		for (size_t i=range.begin(); i<range.end(); ++i)
                {
                         Job& job = *m_Jobs[i];

                         //  Queries can be set to cancelled while the update is 
                         //  underway
                         if (job.getStatus() == Job::CANCELLED)
                         {
                              continue;
                         }
                         // Perform the line-of-sight calculations here, typically
                         // done in one update.
                         bool finished = job.update();
                         if  (finished)
                         {
                                // Set the flag so the post-update knows to dispatch
                                // the observer.
                                job.setStatus(Job::FINISHED);
                         }
                  }
    }   // operator ()
           
           Jobs& m_Jobs;

};  // class JobProcessor

void SensorySystem::update()
{

  tbb::blocked_range<size_t> range (0, m_Jobs.size())
	 tbb::parallel_for(range, JobProcessor(m_Jobs));

} // update()

There’s also a parallel_do provided by Intel TBB, which has a much nicer application programming interface (API). However, this construct does not divide the workload up as well, because it doesn’t assume the array of jobs is static and known beforehand. The performance benefits were hardly noticeable with parallel_do. In such embarrassingly parallel situations, parallel_for fits better.

Under the Hood

The LoS checks are implemented using the Bullet Physics Library [7]. The library contains some prototype code to support parallelism in the narrow phase of the collision query, but not the broad phase. This means that the process of narrowing down the objects is done in a sequential fashion, but the collision checks against the individual objects is done in parallel.

The game prototype described in this article does not use Bullet’s multi-threading. Instead, parallelism is performed at a higher granularity; entire ray tests are run in parallel. Although this approach doesn’t allow as much control of the underlying data, the concept can be applied transparently to any collision system-even the algorithms that use cover-map lookups as approximations, as described in the introduction.

Figure 3. Top-down view of the AI Sandbox during a game of Hide & Seek. In this example, the hiders start at the bottom left (then proceed to find cover against the central blocks), and the seekers are spawned at the top right (then proceed to divide to patrol the map).

Results and Discussion

In the first test, the whole simulation was run as normal; graphics were enabled, all subsystems in the simulation were running, and the system targeted 30 frames per second (FPS). The results were gathered for 32 agents in the world over 30 seconds (both real time and in-game simulation time), which resulted in approximately 4000 LoS queries. The tests were run on an Intel® Core™ i7 processor-based system with Windows Vista*. The results were the average of six runs, with the best and worst results discarded.

Figure 4 shows the results. The best performance achieved was a 40% gain for the average query execution time compared to the non-threaded version of the code. This was achieved using four threads. This result is most likely because of a combination of factors. Julien Hamaide pointed out that LoS algorithms can be ALU intensive and that the Intel® Core™ i7 processor is made of four cores with two hyperthreads each. This could lead to floating point units being fully subscribed at four threads, so adding further threads (for example, 5–8) would simply make the situation worse.

Figure 4. Number of LoS queries per second as the number of threads increases under normal conditions in the game. Four threads produced the best-performing configuration. All results are normalized to the single thread’s performance.

Other possible explanations for the results beyond four threads:

  1. The overhead of using separate threads does not pay off under these conditions.
  2. The cache isn’t being leveraged to its full potential because of memory access patterns.

The following experiment helps shed a bit more light on this problem.

Stress Test

By design, the current implementation of the Hiding behavior in the game does not use LoS queries as intensively as it could. Only when cover locations are invalidated do the Hiders look for new cover, which requires a few dozen LoS queries, depending on the situation.

As an additional test, the simulation was run without animation and pathfinding to speed up the phases without sensory queries. The behavior was also modified to seek better cover immediately after reaching the current hiding location. Finally, the number of agents was bumped to 128. The simulation made over 60,000 queries over 5 seconds in game time, which was accelerated by disabling the rendering. This scenario was chosen to help reveal the full potential of multi-threading when applied to LoS queries-for example, when used during precalculations.

Figure 5 shows these results. In terms of performance, eight threads reduced the average query time by 80% compared to the single-threaded version (also using Intel TBB though). In terms of efficiency of the multi-threading, four threads showed the best compromise, as it achieved around 93% of peak theoretical acceleration over the single-threaded version. This makes a good choice, as the other four threads could potentially be used for other sub-systems.

Figure 5. Time taken for the LoS queries as the number of threads increased under extreme load. Eight threads produced the best performance.

This stress test shows the importance of maximizing the calculations to be done in parallel. The fewer queries, the smaller the benefit of parallelism; the overhead of setting up the calculations on multiple threads are more difficult to justify, as shown in the previous experiment.

Improving Performance

Because parallelizing LoS queries using Intel TBB works best under heavy load, it would obviously make sense to maximize the number of LoS queries. However, because each of them has their price, it’s not necessarily the best thing to do. A hybrid policy is in order.

Thanks to the centralized sensory system, the code has control over when the calculations are performed. The queries can be batched up over multiple frames and calculated every 2–4 frames only when there’s enough work to do. This would allow the multiple threads to get closer to their peak performance at the cost of four frames of lag in the sensory information for the worst case. As long as the lag isn’t measured in seconds, this is an acceptable tradeoff in sensory system implementations, because it helps simulate human reaction times.

Summary

If one of your target platforms supports multi-threading and if your LoS checks are proving to be a bottleneck, parallelizing your sensory system can increase the amount of information for the AI to avoid obviously stupid decisions (for example, those decisions resulting from randomly chosen cover points). Using multi-threaded LoS jobs also helps ensure that the AI gets information about the world at predictable intervals, so the resulting behavior is more responsive to dynamic changes.

However, even if you need to support low-end hardware, the approach described in this article also has many benefits, including avoiding computation spikes, batching up queries for better memory access, and limiting the processing for each frame. In practice, when you implement a dedicated system to support modern hardware, it can often also support traditional platforms better.

In the next article, you’ll learn a better way of structuring the queries to help the sensory system maximize the benefits of the LoS batch calculations and reduce the lag between the requests being made to the sensory system and the result being provided back to the AI.

About the Author

Alex Champandard is the mastermind behind AiGameDev.com, the largest online hub and continuous training program for artificial intelligence in games. He has worked in the digital entertainment industry as a senior AI programmer for many years, most notably for Rockstar Games. He regularly consults with leading studios in central Europe, recently finishing a contract on the multiplayer bots for KILLZONE 2 at Guerrilla Games. Alex also organizes Paris Game AI Conference and is an Associate Editor for the IEEE Transactions on Computational Intelligence & AI in Games. He serves on the Program Committee for the AIIDE Conference.

References

[1] See “Common AI Challenges for Modern First-person Shooters (Part 1, Video)” by Alex J. Champandard.
[2] See “Terrain Annotations: From Designer Hints to Automated Pre-processing” by Alex J. Champandard.
[3] See “Waypoint Cover Maps and Efficient Raycasts on PS3 in Killzone 2” by Alex J. Champandard.
[4] See “Building an AI Sensory System: Examining the Design of Thief: The Dark Project” by Tom Leonard.
[5] See the Intel TBB home page.
[6] See Intel Threading Building Blocks by James Reinders (O’Reilly, 2007)
[7] See the Bullet Physics Library.
[8] See “A Programmer’s Guide to Building a Low-Level Sensory System” by Alex J. Champandard.

Для получения подробной информации о возможностях оптимизации компилятора обратитесь к нашему Уведомлению об оптимизации.