English | 中文 | Русский | Français
2,857 Posts served
8,606 Conversations started
Synchronization between threads in the Infernal Engine
Thread synchronization is a complicated problem and rarely discussed in practice. We came to our own conclusions via experimentation and what worked well for us during the production of Ghostbusters.
Ghostbusters used two kinds of synchronization primitives, "crude locks" and "critical sections". A crude lock is the lowest form of synchronization - it lets one thread sit in a loop until the other thread lets it continue. Here is how a simple implementation of a crude lock class could look in C++...
class CCrudeLock {
volatile long value;
public:
CCrudeLock() { value = 1; }
void lock() {
for (;;) {
if (1 == InterlockedCompareExchangeAcquire(&value,0,1)) {
return;
}
}
}
void unlock() {
value = 1;
}
};
As you can see in the lock code, if one thread tries to acquire the lock, and it is busy, it will sit in a tight loop burning CPU time forever until it gets the lock. The InterlockedCompreExchangeAcquire function is an atomic function that will synchronize the access of a variable across multiple threads. In assembly language, it could look like the following:
mov ecx,dword ptr [esp+4] ; Get the address of value into ecx
mov edx,dword ptr [esp+8] ; Get "0" into edx
mov eax,dword ptr [esp+12] ; Get "1" into eax
lock cmpxchg dword ptr [ecx],edx ; Compare the value of eax with the destination, and if equal, write edx into the destination
ret 12 ; eax = return value
So the guts of the InterlockedCompareExchangeAcquire function is the "cmpxchg" instruction with the "lock" prefix, which preserves the order of operations so that more than one processor does not interrupt this single instruction, i.e., it functions atomically.
On a cache coherent multiprocessor, releasing the crude lock is as simple as writing back to the memory location. If another thread is attempting to acquire the lock, the atomic interlocked function will guarantee the order of operations.
The other type of synchronization primitive we used is the standard Windows CRITICAL_SECTION structure. This structure is well documented, although the implementation may not be, so we will discuss how it might work internally. A standard critical section is similar to the crude lock, with the loop being finite - if a certain amount of loop iterations happen, and it cannot acquire the lock, then yield the thread for an amount of time and start over.
class CCriticalSection {
volatile int value;
public:
CCriticalSection() { value = 1; }
void lock() {
int loopCount = 0;
for (;;) {
if (1 == InterlockedCompareExchangeAcquire(&value,0,1)) {
return;
}
loopCount++;
if (loopCount > 400) {
Yield();
loopCount = 0;
}
}
void unlock() {
value = 1;
}
};
Note that the loopCount value is local to the lock function, so that more than two threads can attempt to access our critical section at any one time.
When is it appropriate to use a crude lock instead of a critical section?
In the Core i7 processor, Intel introduces 4 cores, with 2 threads each, or hyperthreading. Although there are two sets of register contexts for each thread, there is only one execution unit. So if thread 1 is blocked on a core, thread 2 can start to execute if possible. If two code threads try to acquire the lock in a crude lock and they are both executing on the same processor, you can have a live lock situation occur. If you are running on a Core2 Quad or other multi core, non hyperthreaded machine, you will not have this situation occur.
Ghostbusters was a multiplatform title, with the PC, Xbox 360, and PS3 supported. For the PC and PS3, we used critical sections to synchronize code. The PC could be hyperthreaded, and the PS3 is hyperthreaded. Although the Xbox 360 is hyperthreaded, you are able to select which thread on which core you can run your code on, so in the end we were able to use the lighter crude lock for thread synchronization, because we could guarantee what hardware threads our game would run on.
Note that whenever you have a crude lock or critical section in your code, you need to guarantee that you will be accessing that resource for only a very short amount of time. Staying inside of a critical section of your code for a long time will have negative effects on performance.
| July 16, 2009 3:04 PM PDT
Mark Randel
|
We wrote our own instrumented profiler for the Infernal Engine - it worked very well, but didn't catch everything on the system level - VTune helped us find the really tough threading contention issues that were happening on different system configurations. |
| July 17, 2009 5:44 AM PDT
jimdempseyatthecove
|
Excellent article Mark. On both your locks consider inserting an _mm_pause(); in the fast path wait loop (after fail). This will relieve some processing overhead especially for HT companion thread. And for your critical section, be careful of what you use for Yield. If you use SwitchToThread(), normally a very good Yield technique, .AND. if you oversubscribe threads, it is possible that the running threads, all performing Yields, will lock out all pre-empted threads, including the thread holding the lock. Consider adding when the SwithchToThread fails n times, call Sleep(0) (or whatever equivelent you have). There is no requirement to use the "Compare", you can use InterlockedExchange, or InterlockedExchangeAcquire as this may save a few clock ticks. Finaly, because your lock variable is volatile, consider adding an if test immediately prior to the InterlockedExchange such that the InterlockedExchange is attempted only when the lock appears to be released. By doing this you relieve pressure on the memory bus (which generally reduces the time the lock holder holds the lock). Jim Dempsey |

Aaron Tersteeg (Intel)
16,428
Status Points:
16,428
Thanks you for sharing all your insight into what it takes to develop a multi-threaded game. What software development tools does your team use as they implemented multi-threading?
Aaron