TBB on NUMA Plattform?

TBB on NUMA Plattform?

Hi there,

I had recently the chance to test my raytracer based on TBB 3.0 on a 48 Core AMD NUMA plattform (so that are 8 Processor organized in 2 boards connected with hypertransport, each board with 4 CPUs). Sadly the results were disastrous.
My first try was to just use parallel-for with a grainsize of 24x24 which gave me the best results on SMP Machines so far. This actually resulted in 48 Cores being actually about 20% slower than 24 Cores.

So my new approach was to just use a small parallel_for loop from 1 to number of cores and maintain a simple stack where I pull blocks to render from (so just 1 atomic increment for each tile, with about 2000 tiles per FullHD Frame and no mutexes whatsoever). The results where a lot better, 24 Cores were about 10% faster, 48 Cores were about 30% faster than before.

Nonetheless: 48 Cores are about 4% faster than 24 Cores which is a littlebit rediculous. Whats even more interesting: Using 24 Cores I get about 50% CPU usage (so exactly 4 of the 8 NUMA nodes run close to 100%just like it should be). Upping to 48 Cores gives me about 60% CPU usage with still 4 NUMA nodes peaking at 100% while the other 4 are more or less idle at 5% maximum. It also doesnt improve if I massively increase the amount of work to be done for each pixel, so I doubt that the atomic operation for pulling from my stack have any influence.

Although the hypertransport will slow down memory access a little ( from what I have read so far it should be 30% slower compared to a direct memory access on the same board), this is nowhere near the performance that should be possible. It actually looks to me like the TBB Scheduler / Windows 2008 Server R2 Scheduler puts 2 threads on each core of one board and leaves the second board pretty much idle.

Does anyone have an Idea what might go wrong?

Michael

P.S. By the way, scaling up to 24 Cores is pretty ok, considering there is still some serial part in my tracer:

1 Core: Factor 1.0
2 Cores: Factor 2.0
4 Cores: Factor 3.98
8
Cores: Factor 7.6
12 Cores: Factor 10.9
16 Cores: Factor 14.2
24
Cores: Factor 19.5
32 Cores: Factor 23.18
48 Cores: Factor 22.0 <- This is not good

45 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

TBB task scheduler does not control threads distribution across available cores. This completely belongs to OS scheduler domain.

One possible reason of the described behavior can be your process having an affinity mask tying its threads to the cores of only one board. Though I have no idea who could set it. But this does not explain the peak at 32 cores unless your code does some blocking operations (like waiting on OS synchronization primitives or doing IO).

Another possible explanation is that the amount of work is not sufficient to load up 48 cores. Though in this case you'd likely see all your cores busy (either due to high overehad of too fine grained partitioning, or because workers uselessly spin trying to fing some work to do, and occasinally diving returning to the OS kernel just to be woken up soon again).

And at lasts, the most probable reason is that the memory bandwidth gets saturated at around 32 cores.

Unlike a recent linear situation, correct alignment seems very important in a 2D situation, because cuts in the horizontal are repeated over a large number of rows, and because grainsize is not there to provide alignment, you have to make each unit in grainsize space represent a correctly aligned unit of work in hardware space. Maybe sometime TBB will provide the sugar for that, but it's easy enough to do it yourself.

But if I can assume that pixels don't need to interact, I would suggest either not to use 2D ranges, which seem made for physics problems that want to minimize border-to-surface ratios, just vertical ranges of rows that could very well have horizontal parallel_for loops nested inside them, or to let the 2D grainsize degenerate to something that's horizontal, e.g., 256x1. The parallel_for logic will then process wide tiles that should have very little false-sharing overhead. Make it 10000x1 to almost guarantee horizontal stripes of work.

Just setting a very wide grainsize should give you a big boost for less than a minute of recoding, because 24x24 just doesn't sound right. Let us know what this does for you.

Your query suggests so many interesting questions. To start off, since you're talking about a ray tracer, presumably the overhead is NOT in writing the final pixels but doing the ray casting and intersection testing, bouncing around the 3D data structures following reflections and refractions. Any idea how those data are situated? Do they all reside in the memory on one card or the other? On one socket of the one card?

How's the comparative bandwidth saturation on the two cards? If we postulate a model where all the data are on one side and so you have fast and slow accesses, these scaling results would suggest that 24 cores don't saturate memory bandwidth because 32 cores give a higher number, but maybe saturation occurs between 32 and 48? Or maybe the data are distributed and this is a completely shallow analysis. More details are needed.

Do try that different grainsize ratio just to see what happens, but I've been way too optimistic about what you may expect with this amount
of work involved per tile before the result is written, as Robert rightly pointed out, so quite probably the difference wiill disappear in the noise instead. I just quickly wrote this because there was this other question about grainsize just the other day and because 24x24 looks so strange: you might as well not provide a grainsize at all if you're using the (default)
auto_partitioner, and with simple_partitioner the tiles are still going to be
smaller than 24x24 (horizontally you'll get 1920->...->15
and vertically you'll get 1200->...->75->37,38->18,19,
so the tiles will be 15x18 and 15x19, or just a little over 8000 per frame).

The grainSize is not the problem. 24x24 pixel tiles are the optimum while 16x16 are nearly equally good, sometimes better for the better possibility to distribute the workload. It has to be 2D as well for the way the raytracer works and since it is actually one of the fastest tracers on the market it is safe to assume that this is not the problem. As I said, just keeping the tiles on a stack and retrieving the work from the stack works best and is faster than any parallel_for loop, which actually makes sense since this keeps the task running for the longest time possible while limiting the amount of interaction required. Also when I increase the amount of work to be done (so more indirections with many, many more rays to send through the scene), the necessary lock for the pop operation will vanish into noise.

I think the memory issue might be the problem. Indeed, the most time is spend in the intersection functions, at least this is what the profile tells me. And it seems to be during the traversal of the acceleration structure when a new node is fetched.

I allocate all memory in the main thread (using the scalable_aligned_malloc) so I assume that the whole scene is on one board. According to a paper I found on the AMD website, cross board accesses should take 30% longer but I am not shure about what bandwidth the system has. Since my scenes generally dont fit into any cache (usual memory requirements are between 2GB and 8GB, but we also had some scenes that required 24GB). So it is really likely that the interboard connection becomes the bottleneck.
I also did try a small testscene as well that should completely fit into
cache while still doing a lot of work per pixel (many indirections and so on) and the render times
were a bit more promising with 24 Cores requiring 18 seconds while 48
Cores required 12 seconds, so it shows a much better speedup. CPU usage
is still peaking at 50-60% though.

The question is what can I do about it? My idea would be to actually duplicate the data so each board (or even each NUMA node) gets its own scene to work with. But how can I let the tasks know on which board they run? How can I determine this during runtime? Basicly I would require something like "NUMANodeId()" I could query when a task is executed so I can redirect my memory accesses. Of course, the OS shouldnt interfere afterwards and move the thread to another Node.

Any ideas?

Are you saying that "Windows 2008 Server R2" does not have "NUMANodeId()" yet, or wondering how to apply it with TBB and memory allocation? Last time I looked (not very recently, though, and I can't check again right now) there was no NUMA awareness in the scalable allocator, but that's where open source can work miracles.

Actually allocator should not necessary be NUMA aware. Modern OSes inernally use first-touch mapping policy. Thus even if you allocated the whole array on one node, but made the first access to a particular element from another one, the phiysical memory will be mapped from the second node bank.

Evidently to benefit from this feature the program must not initialize the whole array right where it was allocated, but postpone initialization to until the moment the iteration stace is partitioned and its chunks are ready to be processed by worker threads (so that the "first touch" happened on the nodes wheer the processing of the corresponding chunk will take place).

I was wondering how I should apply this with the tbb.
All data is initialized in the main thread, so I assume it always ends up on one board.

I think you could try the following algorithm:

  1. Allocate the array(s) on the main thread, but do not load any data into them at this point
  2. Use your "small parallel_for from 1 to number of cores" to do higher level partitioning, and initialize corersponding parts of your array(s) in its body
  3. Use nested parallel_for to further partition the large chunks you get at the first level

This approach may not achieve optimal load balance, but at least it should demonstrate significant improvement IF our problem really is in inter-board communication.

#7 "Actually allocator should not necessary be NUMA aware."
The scalable allocator recognises threads for block-level ownership because that's all that matters on non-NUMA systems, but aren't blocks allocated all together from an arbitrary (from the point of view of NUMA) mmap region, so that even if a thread first touches a block it still ends up with memory whose mmap region was first touched on another NUMA node? Or have I missed a development? I'll have another look tonight...

From what I know (unfortunately I cannot find an appropriate document at the moment), both Linux and Windows do not actually commit physical memory right at the moment of request (even when you call VirtualAlloc with MEM_COMMIT flag). Instead they postpone mapping of a virtual memory page until the first access to this page. Thus your continuous region of virtual address space ends up being mapped to the physiscal memory pages belonging to different NUMA nodes.

#11 "Thus your continuous region of virtual address space ends up being mapped to the physiscal memory pages belonging to different NUMA nodes."
Ah, so it's per page, not per mmap region? But then the question remains what people who thought they'd get better performance with large memory pages are going to do (including Intel's own Page Size Extension up to 4MB, and with Westmere even going to 1 GB, if I can believe Wikipedia)?

#8 "I was wondering how I should apply this with the tbb."
Assuming embarrassing parallelism except for shared access to the scene description, I suppose you need something like NLS (node-local storage), so to say, with the first user making a deep copy from the original? But how about letting each NUMA node process only its "own" frame, instead, or is there a serial dependency between successive frames that would prevent such a solution?

Even if there aren't dependence issues between frames, memory has to hold twice 2-24 GB of scene data with half the number of available threads working on each frame. But if you're going to double the scene graph, you might as well have two copies of the same scene on each board and let the threads on each chase the local graph to process pixels for, say,its side of the viewport, or interleaved bands, whatever works best.

However, aren't we getting ahead of ourselves? Interboard bandwidth saturation is still only a theory. Do you have any data to determine what might be happening? Intel Core processors have means to measure various bandwidth indicators. I presume those AMD chips have something similar via Oprofile or some such?

Finally, perhaps I should ask what OS. I've wondered about the description provided for VirtualAlloc:

It's the "You can commit reserved pages in subsequent calls..." that gives me pause. I haven't found chapter and verse on Linux mmap (-1) (no file mapping) but what I have seen refers to copy-on-write semantics, which would suggest a dynamic commit of pages. As it stands, the TBB MapMemory call that uses VirtualAlloc does so like this:

return VirtualAlloc(NULL, bytes, (MEM_RESERVE | MEM_COMMIT | MEM_TOP_DOWN), PAGE_READWRITE);

I haven't studied this code for quite a while, but the way I read it, it looks like VirtualAlloc would commit the whole BigBlock on the thread that initializes tbbmalloc, whereas if mmap works the way I think it does, threads writing to allocated pages would commit them locally then. Switching back to wild speculation, allocating the entire scene graph on one board would limit HyperTransport data transfer to one direction, whereas if elements of the scene graph were scattered by the whim of the initializing threads, the randomized accesses would at least use the HyperTransport in both directions.

So, got data?

Yes, you are right, we were ahead of ourselves. I just did a burn in test, a task going into a loooooong loop doing nothing more than summing up numbers.

So for the test the task

-runs 100% inside the cache
-there are no memory accesses that could limit bandwith
-no locks whatsoever

Result: It sticks at 69% CPU usage.

Then I just created 48 Threads myself and rerun the test:

Result: 100% CPU usage.

Next step will be to really do the raytracing using the other threads, I wonder how that will turn out. But right now I would say there is a serious bug inside the TBB, whatever it is.

I will report back once I have the results.

Ok, i finally managed to let the TBB use 100% CPU as well, using a parallel_for to start the task was a bad idea as it seems.

Now it scales equall to the normal Thread version, so when doing a lot of work per Pixel 48 Cores are faster than 32 Cores.

Scaling is still not optimal now but there may be another cause for this (only get a factor of 27.5x for 48 Cores, 24x for 32 Cores and 20x for 24 Cores).

Do you mean that you are not using parallel_for now? What do you do then?

Thats what I am doing now:

_task_group_context->reset();

tbb::task_list tasks;

for( size_t idx = 0; idx < numCores; ++idx)
tasks.push_back( *new( _main_dummy_task->allocate_child())
TraceImageTask(_private, pixelOffset));

_main_dummy_task->set_ref_count(numCores + 1);
_main_dummy_task->spawn_and_wait_for_all( tasks);

This is called for every subframe I render (128 subframes). Id like to skip all the push_back thing, but I am not shure how I can prevent the tasks from getting deletet once they finish.

BUT: I get 100% CPU usage with that but it is slower than using

tbb::parallel_for( tbb::blocked_range(0, numCores, 1), vrRTHQTileTraceTask<....

which only uses at max 69% of the CPU on scenes that dont do so much work between two calls.

And it is not a small difference, the task-based version renders for 300 Seconds while the parallel_for renders for 245 seconds.

So I am not shure why this is and I am finally out of ideas. To me it looks like something is burning CPU cycles without doing any good.

"This is called for every subframe I render (128 subframes)."
So numCores is 48 and you give each of 128 subframes to each of those 48 cores? How does each core know what to do? Also, this way not enough parallel slack is being generated, causing the cores that finish first to sit idle until the next round of tasks. You should always try to generate a number of tasks that is many times the number of cores, so that the fraction of time that some cores are idling is relatively small. Normally parallel_for() does this for you, but if you are ever tempted to use something like numCores an alarm flag should go up and you should probably use something like 10*numCores instead.

"Id like to skip all the push_back thing, but I am not shure how I can prevent the tasks from getting deletet once they finish."
recycle_as_child_of()

"tbb::parallel_for( tbb::blocked_range(0, numCores, 1), vrRTHQTileTraceTask<...."
Again very suspicious code, with use of numCores preventing parallel slack instead of encouraging it.

(Added) As for burning cycles: wouldn't you then be able to catch one in the act by attaching a debugger?

No, for rendering an image (1920x1080) for example, about 3600 tiles are created on a stack. Now for each Core a task is started that does something like

while( stack not empty)
doWork;

Once the stack is empty, the task returns (so each task will run until there is no more work to do). This is a pretty common approach in raytracing software and it seems to work fine....usually....

For the next subframe, the same is done (this cant be changed since there might be some changes between subframes, so it is essential to restart the tasks). So you basicly you render 128x (3600 Tiles) - Id say, there is plenty of work to do in parallel for 48Cores, especially since a single pixel uses up many 1000 CPU Cylces.

And as Ive written, I have tried using parallel_for to pull from the stack or even just a parallel_for on the whole image and it was constantly slower by a large marin, like everything else was that I tried. CPU usage went up, performance went down, thats it. And this is true for all plattforms I tried ( 2x 4Core Nehalem and 2x 6Core Nehalem and the 8x4Core AMD). It just doesnt make any sense at all.

But just tried normal Windows Threads and they show the same behaviour: 100% CPU usage and a slower speed compared to the 69% CPU usage.

Which partitioner do you use with parallel_for: simple, auto or by default?

Good point. I had the affinity_partitioner running, maybe thats why the parallel_for stopped at 69%. I will retry with the simple_partitioner and grainsize of 1.

Just did another test:
Just let the parallel_for loop not run from 0 to number of core but from 0 to number of tiles with an auto_partitioner. Turned out to be about 40% slower (72s vs. 53s) on my Dual Nehalem compared to just popping the tiles from the stack.

In my opinion the partitioners dont work if the amount of work that is done for each entry varies a lot, I guess they assume that each segment does basicly the same which is not the case for a raytracer (for example if you look at a car the headlights require magnitues more work than the rest or the background).

Ah, yes, you have that stack... Weird indeed.

Wild idea:Have you triedboth tbb::mutex and tbb::spin_mutex for the stack, and do you get 100% CPU use in both cases? Have you tried to attach a debugger at various times, to find out how threads are spending their time?

"In my opinion the partitioners dont work if the amount of work that is done for each entry varies a lot, I guess they assume that each segment does basicly the same which is not the case for a raytracer (for example if you look at a car the headlights require magnitues more work than the rest or the background)."
That is correct for auto_partitioner, so if you want to use that you should probably hash the tiles, e.g., by inverting the bits in a binary index.

(Added) To avoid confusion: auto_partitioner (the default) has that "problem", and simple_partitioner doesn't, so you might want to specify simple_partitioner explicitly, or use a hashed index with auto_partioner.

Ok, the simple_partitioner changed the CPU usage to 100%. But it seems to hold true that using less CPU% is faster than using 100% CPU, no matter what I am actually using to distribute the work. So something really seems to be wrong here.

Currently I have used a atomic fetch_and_increment to get a tile from the stack. Just changed that to a spin_mutex and in slightly increased the time from 300s to 285s (the 70% version was at 245s).

I guess I am really hitting a bandwidth limit here, as far as I can understand what the AMD Codeanalyst tells me, there are a lot non-local-memory accesses and the functions that take up most of the time are the intersection function which are read only while the locks nearly disappear in overall time.

Is there any chance you can break that tile stack up into multiple stacks, and partition the threads to seek their work on different stacks? It may ultimately lead to some load imbalance as you eventually try to balance the stacks, but it would reduce the number of threads simultaneously contending for stack access. And being up in the range of 48 threads, you're definitely up there where the number of agents contending for a particular lock could produce more serializing overhead holding up the whole system.

I think ultimately the formula for good work with so many players may be much like playing good jazz: you can get out front and impress everyone with your machine-gun precision and virtuosity, but if you pay attention to the other players and leave room for them, you'll generally geta better result.

"Currently I have used a atomic fetch_and_increment to get a tile from the stack. Just changed that to a spin_mutex and in slightly increased the time from 300s to 285s (the 70% version was at 245s)."
From 300 seconds to 285 seconds would be a decrease, not an increase, so did you mean a "decrease" with those numbers or are the numbers different? But could you then also try tbb::mutex? My wild idea is about the potential overhead associated with 48 threads contending for access to a single resource, and I'm not 100.00% sure that a tbb::spin_mutex scales even if it is more efficient than tbb::mutex for low levels of contention.

(Added) It's sort of like several forum participants falling over each other all trying to be the first to solve your problem and claim the glory. :-)

Both mutices should work worse than lock-free variant that Michael used. And spin mutex normally works better than kernel mode mutex (for which tbb::mutex is a wrapper), unless there is oversubscription. Though tbb::mutex may result in less CPU load because threads will wait inside the kernel instead of spinning.

"Both mutices should work worse than lock-free variant that Michael used."
Hmm... yes, fetch_and_increment uses a hardware lock on x86. But then why doesn't the use of spin_mutex decrease performance? (Plural is just "mutexes", though.)

"And spin mutex normally works better than kernel mode mutex (for which tbb::mutex is a wrapper), unless there is oversubscription."
Can you remind us what the oversubscription is about? You have convoying with both kinds of mutexes, but even without oversubscription there's a potential problem with the thundering herd thing when using a spin_mutex, and I'm not sure that __TBB_LockByte handles that well, although I'm going by what I remember from an earlier version for that.

"Though tbb::mutex may result in less CPU load because threads will wait inside the kernel instead of spinning."
That was the reason for checking: getting a lock through the interconnect might make the spin_mutex scalability more of an issue. But I'm just guessing, really, until we know what happens after Michael removes "spin_".

(Added) __TBB_LockByte has not changed, I see. Still, each tile takes so much processing effort that the contention levels on this lock should be low enough for a significant benefit to be a surprise, anyway. That means I'm out of ideas for now.

Ah, sorry, I meant the performance increased, of course the time decreased. The numbers are correct, with fetch_and_increment the rendering took 300 seconds, with spin mutex it was 285 seconds. I will try the normal mutex (what about queing mutex??) as well.

I also tried having multiple stacks without any lock, but it seams my scene was so much imbalanced that it didnt show much of a difference. But I can recheck that tomorrow.

I will also try to increase the block size, maybe with the stack approach larger block sizes may work better than the last time I tried it (which still used a simple 2D parallel_for). I will try to get about 500 Blocks overall, so each thread will process about 10 blocks on average. Maybe this helps.

I just did another test starting 2 instances of the software with 24 Cores assigned to each instance. Those required between 260 and 270 seconds to render 2 frame of the scene (so essentially the 24 Cores are faster than 48 Cores for that particular scene), so this would mean 135 seconds per frame. I am still not shure how to interpret these numbers, one thing thats for shure is that both instances use different memory areas so they dont interfere with each other.

I have for a long time wanted somebody to try the following potential optimisation of spin locks with a high number of cores (48 certainly qualifies), provided here as a replacement for include/tbb/tbb_machine.h in tbb30_035ss.

(Correction) __TBB_LockByte, if #define'd, now refers to a redundant copy of the same code instead of to a tight loop, but it might then still be replaced just the same.

To #29.

[After skimming the web a little I agree that plural of "mutex" should
be "mutexes", as it is not a loan word :) ]

In case of oversubscription pure spin mutex aggravates convoying issue, because if a thread holding the lock gets preempted, remaining threads will just dumbly spin through the tremainders of their time quanta, while with kernel mode mutex they quickly begin getting parked in the kernel, the ready queue becomes empty, and preempted thread is allowed to continue.

Exponential backoff used in TBB spin mutexes alleviates the issue to some extent, especially when oversubscription is not too high and locked regions of code are short enough (so that if preemption occurs, there was enough mutexes in the short spinning loop to quickly concede their CPUs to all waiting threads including the one holding the lock).

I honestly surprized that FetchAndIncrement variant takes longer than the one with mutex, because

  • with interlocked instruction the locked region is the shortest possible;
  • FetchAndIncrement necer fails (that is there is no need to use it in a loop);
  • mutex also use interlocked operation, and it is a heavier one (CAS that can fail, and thus generally requires repetitions in a loop).

"In case of oversubscription pure spin mutex aggravates convoying issue, because if a thread holding the lock gets preempted, remaining threads will just dumbly spin through the tremainders of their time quanta, while with kernel mode mutex they quickly begin getting parked in the kernel, the ready queue becomes empty, and preempted thread is allowed to continue."
That seems somehow plausible. Still, with spin_mutex preferably used only with resources that block only briefly, a thread waiting for access should get its turn fairly quickly, so it may be more efficient to spin a little while longer than to get preempted too quickly. So I find this a bit strange. I'll assume it has nevertheless been observed and measured?

"Exponential backoff used in TBB spin mutexes alleviates the issue to some extent, especially when oversubscription is not too high and locked regions of code are short enough (so that if preemption occurs, there was enough mutexes in the short spinning loop to quickly concede their CPUs to all waiting threads including the one holding the lock)."
I'm afraid I don't see what you mean. I see the exponential backoff primarily as a way to dynamically tune down contention for exclusive ownership of a lock's cache line, and my proposal in #31 elaborates on that. A problem with the current implementation is that backoff increases while a thread is waiting, not just when an acquisition attempt fails, so on a heavily contented resource many candidates spend more time waiting to try again than needed. Another is that after waiting a bit the contender goes straight to an expensive acquisition attempt instead of first listening whether it's a good time. The current implementation is also rather unfair by favouring newly arrived threads whose backoff has not yet increased so much.

"I honestly surprized that FetchAndIncrement variant takes longer than the one with mutex, because
- with interlocked instruction the locked region is the shortest possible;
- FetchAndIncrement necer fails (that is there is no need to use it in a loop);
- mutex also use interlocked operation, and it is a heavier one (CAS that can fail, and thus generally requires repetitions in a loop)."
Indeed. Unless the hardware implementation of an interlocked increment has its own problems, one would expect it to be rather efficient, although I'm not sure that the difference is so significant in view of processor speed relative to communication across even a fast interconnect (well, I don't know might be more accurate). FetchAndIncrement doesn't fail on x86 because it has a lock attribute on the instruction, but on other architectures it can still be a loop. I have no real idea of performance relative to a CAS loop?

I see the exponential backoff primarily as a way to dynamically tune down contention for exclusive ownership of a lock's cache line, and my proposal in #31 elaborates on that. A problem with the current implementation is that backoff increases while a thread is waiting, not just when an acquisition attempt fails, so on a heavily contented resource many candidates spend more time waiting to try again than needed. Another is that after waiting a bit the contender goes straight to an expensive acquisition attempt instead of first listening whether it's a good time. The current implementation is also rather unfair by favouring newly arrived threads whose backoff has not yet increased so much.
Well, actually the original purpose of the backoff was to

  • graciously handle the oversubscription situation I mentioned above (by dunking the spinning thread into the kernel periodically so that the kernel could reschedule earlier preempted thread holding the lock).
  • decrease power consumption if the code does a blocking operation under the spin lock (not the best practice generally, but people do it, and sometimes it may even be necessary).

But your idea of decreasing bus traffic by avoiding premature attempts of bringing cache line into exclusive state makes sense to me. I remember you already suggested this optimization a year or two ago. Soon after then one of our developers made measurements on the new Intel platforms (Nehalem), and as far as I know what we have now in our machine.h was what gave better results on average. It is possible that (recent) architectures do exactly this kind of state check internally to CAS before attempting more costly actions, and thus achive similar effect as your implementation (or even slightly better one due to packing it all into a single instruction).

More generally speaking doing this kind of perfromance analysis is a time consuming endeavour, as we have to cover multiple platforms/OSes/compilers, and the tests results tend to be not as reproducible as would be sufficient to quickly make reliable conclusions. This is why optimizations that are not absolutely obviously benefitial often do not make it into the codebase fast enough. Anyway I'll talk to our managers if they have someone who could investigate the benefit of your proposal once again.

I guess I'm still in love with the idea, and if the number 48 (plus interconnect) won't tip the scales nothing will, so I decided to try and abuse Michael's predicament to give it one more chance. :-)

I tried to replace the tbb_machine.h but it gives me errors during compilation, probably because I am using the 3.0 Update 1.
To be exact, these wont work ( line 119):

#if !defined(__TBB_CompareAndSwap4) \
|| !defined(__TBB_CompareAndSwap8) \
|| !defined(__TBB_Yield) \
|| !defined(__TBB_full_memory_fence) \
|| !defined(__TBB_release_consistency_helper)

Or is there any compiler flag I need to set for these?

That's the version, building from source tbb30_035oss_src.tgz after replacing include/tbb/tbb_machine.h. On Windows you can probably use rand() instead of rand_r(), leaving out the context argument for this test, but I don't see why it would not build if you can build the unmodified source. I've never tried any of the prebuilt packages, so I don't know what the issues are.

I have high hopes, so I dare ask for a few minutes of your time, but don't count on anything, because it hasn't actually been tested on many cores, apparently disappointed on lower core counts when tested by Intel, and might even need some tweaking and tuning, if it does anything for you at all. Still, I think I remember seeing reports about good results with something similar.

Mmhh, ok, I tried to replace the rand_r with rand (which is not threadsafe, so probably not a good idea anyway). But this does not work at all, just brings the whole programm to a halt, even on my 16 Core machine.

I am not shure what exactly you are trying to do: From skipping over the code I guess the idea is this:

Usually, if you have many cores trying to access the same codeblock at the same time, this will happen over and over again.
Now if you assign a random sleep to the threads when encountering the mutex the chance that next time they want to access the same piece of code they arrive in a different order and probably better interleaved increases so they dont interfere as much.

Am I right with that?

Again, it's just an educated guess at this point, but it also might be just the thing for your 48 cores. Although first you should probably try to reproduce the difference you observed between the fetch-and-increment and the lock (with the lock as the very surprising winner), and try other locks (plain mutex, queueing mutex) to see what the variability is, and whether there's anything to work with. If the idea turns out to be successful, it might also make a difference for scheduling tasks.

The idea is to treat the lock like an Ethernet medium: wait until the line/lock appears to be free, wait a random amount of time after that, and if nobody is talking/has the lock yet just go ahead and try, and if unsuccessful increase the interval from which to choose a random delay, unless yielding seems to be a better solution in the end (that's only for locks). An improvement could be to record in the lock what the current contention levels are, so that newly arrived contenders know how to behave politely, because there are many locks and they are not being constantly monitored by each thread as the few ethernet media are by the connected hardware dedicated to them. Perhaps it might also be better to adopt the actual Ethernet backoff algorithm instead of improvising one, especially with that possibly expensive rand(_r) call.

As a workaround for the freeze-up, you could omit the rand() call and take the full backoff value to see what that gives, or decrease that large number I've put in (if you do a diff you should see it), but I'll have another look tonight for what might have failed.

(Added) Just rand() seems to work on Core Duo/Linux.

(Added 2010-06-24) That big number at line 649 or so appears to have to be decreased to perhaps about 1000 or less, probably for oversubscribed longer blocks. Since I don't know how long a pause really lasts on x86 and friends, maybe the criterium should be a number of ticks instead. The maximum backoff should probably be decreased again as well: it was 16 which I thought was a bit small, but 1<<16 may be a bit too much again, although now it takes a larger number of failures to get there.

Raf,

For the random wait, consider using _rdtsc() and mask off the low n bits and use that in your _mm_pause loop. Note, the value of n can increase (to a ceiling level) on each failure to attain the lock. Pseudo code

n = 0;
while((lock) || (xchg(lock,1) == 1))
{
mask = (1 << n) - 1;
if(n < nMax)
++n;
c = _rdtsc() & mask;
for(int i = 0; i < c; ++i)
_mm_pause();
}

Note, if the hold time can be very long, then think of inserting event or condition variable waits.

TBB is a tasking system not a thread system, so try to rework the code such that you never have long block times.

Jim Dempsey

www.quickthreadprogramming.com

That is probably a much more efficient source of randomness. I don't agree with the algorithm because it increases the potential wait on each unsuccessful probe/attempt, leading to longer waits earlier than needed and more unfairness: the idea is to wait only to avoid the thundering herd problem, which resembles a collision on Ethernet. But I don't have relevant hardware and budget to easily test any of this, so it's all just hypothetical. Surely somebody else must have already explored this, though?

Quoting Raf Schietekat

That is probably a much more efficient source of randomness. I don't agree with the algorithm because it increases the potential wait on each unsuccessful probe/attempt, leading to longer waits earlier than needed and more unfairness: the idea is to wait only to avoid the thundering herd problem, which resembles a collision on Ethernet. But I don't have relevant hardware and budget to easily test any of this, so it's all just hypothetical. Surely somebody else must have already explored this, though?

This is the point of having the clipping mask with an upper limit. When set at 4 you will obtain pseudo random numbers between 0:15.

While you could always take 4 bits of the counter, in the case when 2 threads are competing for a short lived resource then a wait of 15 _mm_pauses, which will happen 1/16th the time may be too long of a latency. As threads collied for the resource, then the universe of random counts increase (to an upper limit), and thus avoiding unnecessary LOCK XCHG instructions that "whack" the cache coherency.

Now, as to what a good upper limit is... the programmer will have to make this decision.

Jim Dempsey

www.quickthreadprogramming.com

We both use clipping, but the code in #41 still starts the backoff at an arbitrary point and will therefore try again at an arbitrary point, with the wait time range always increasing, even though it is now random. It seems better to keep monitoring (within a fraction of a quantum) until the lock appears free, and to only then back off for an increasing random amount of time, so that initially the reaction time remains limited even if the new contender has been blocked for a while.

Instead of pauses<10... in #31 (I don't know what a good number would be in this case), the test should probably be to get the current time quantum from sched_rr_get_interval() or whatever applies to Windows or any other O.S., divided by 5 or so, limited by 10 ms or so (to avoid surprises), multiplied with CLOCKS_PER_SEC, and to then use clock(), which I hope is cheap enough to be used in a test? This would then be the backstop against inappropriately long blocking or preempted owners, causing the thread to yield instead.

>>We both use clipping, but the code in #41 still starts the backoff at an arbitrary point and will therefore try again at an arbitrary point, with the wait time range always increasing, even though it is now random.

The code in #41 starts with an _mm_pause() loop iteration count of 0,{0:1},{0:3},{0:7},...,{0:nMax^2-1}

>>It seems better to keep monitoring (within a fraction of a quantum) until the lock appears free, and to only then back off for an increasing random amount of time, so that initially the reaction time remains limited even if the new contender has been blocked for a while.

You do have a point in that in #41 the late commers start with the shift count of 0 thus giving them a higher averageattempt at lock frequency. This could be adjusted (if desired) by placing the initializer for the shift count in static. Then dynamicly adjust it. e.g. if any thread's local count is bumped after initialization and acquisition is not attained, then perform a clipped increment of the initializer (this need not be an interlocked operation as we are not keeping accurate counts). When a thread obtains the lock immediately, then it can either reset the shift count initializer with direct write of 0, or back it off with decrement and test for underflow:

if(initializer && (--initializer < 0))
initializer = 0;

Where initializer is a volatile static

You need not use interlocked operations on this since it is an aggragate tuning parameter, but you do need protections against quirks. e.g. initializer seen as -n (you set to 0), initializer seen at nMax (you reset to nMax) etc...

Jim Dempsey

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today