Under the hood: count_strings

It’s been a while since I’ve had a chance to play, but I’ve been curious what I might discover by applying Intel® Thread Profiler with user events to other TBB structures besides pipelines. I tried checking out Conway’s Game of Life but the source download page wouldn’t reveal its sourcey goodness, so I moved onto another example, used for showing concurrent hashes: count_strings. It builds a test array of adjective-noun phrases and sorts them into a hash table, counting how many of each there are.

I could collect data in either a Linux or Windows environment, but I’ll start in Windows. Maybe later I’ll try some experiments on a Linux machine and see how the command line collector works. For now, though, we’ll play in a Windows world. Conveniently, the examples come with configuration files for several versions of Visual Studio. These configurations already have all the essentials defined. Those are the TBB include directory, defined under C/C++ General (I always turn on symbols, even on a release build, should I need them to plunge into a source view).

b080222-01.JPG

Likewise, we’ll need to link the TBB library, and that’s been set up already as well:
b080222-02.JPG

Compiling from here is pretty easy, but execution is a little underwhelming:
b080222-03.JPG

0.188/ 0.164 = 1.15x, or about 15% boost by adding three cores. What happened? We built some code and started it running into one end of a dark little pipe and collected some results on the other end, but didn’t see anything in the middle.

That’s easy to fix. After a little bit of setup, just collect a run of Intel Thread Profiler data; first, need to find the execution file (stuck off in $TEMP). Then let the wizard do most of the work.
b080222-04.JPG

This program did a serial and a parallel run, and the image above shows a single threaded portion and a parallel portion, but it took a lot longer to run: from under two tenths of a second for each part to 25 seconds for both. Instrumentation from this tool usually slows programs to no worse than half speed. This must mean the instrumentation is being hit a lot. The big thing, though, is that mass of yellow dominating the parallel section-lots of transitions that might explain why it’s scaling so poorly. Zooming upon the transition into the parallel section:
b080222-05.JPG

There’s quite a bit of transition traffic (the yellow lines) once all the threads start working, but what is that gap between the point where the pool threads are spawned and where they become active? A quick examination of the source code suggests this is when the phrase array is initialized. We can verify this using __itt_event notification. Here’s the code where the random phrases are being generated:
b080222-06.JPG

Each array element gets a phrase with one adjective and one noun, separated by a space. We’ll instrument this code with an event:
b080222-07.JPG

…and add a few lines to set up the event:
b080222-08.JPG

Compiling those additions into the program, recollecting the data and zooming back to the same spot:
b080222-09.JPG

Moving the cursor over the timeline, you can just make out a little purple, L-shaped, blinking line with affiliated pop-up help, which indicates this is the data initialization event, which spans the range of this serial activity. Exploring the start of the time line will reveal the first instance associated with initialization of the serial test. Let’s add a second user event around the parallel kernel, which should reveal the granularity of the kernel for loop:
b080222-10.JPG

With this code in place, the time line view changes substantially, lots of purple. There are events on all four threads of the parallel section and all along the serial section as well.
b080222-11.JPG
Zooming back to the transition from the serial section:
b080222-12.JPG

The Tally operator is called 1024 times in the serial section, as indicated by the number of instances of the Tally event. If you divide the number of phrases, 1,000,000, by 2 repeatedly, it takes 10 divisions to get the number below 1,000, the grain size used in the parallel_for invocation. 210 = 1024, so the grain size seems to be working correctly. Moving to the end of the parallel section:
b080222-13.JPG

The counter on thread 1, which has all the serial test user events, keeps counting, but separate instance counters on the other three threads start at 0. Each gets about 256 events (thread 1 total is roughly 1024+256 events), confirming that the parallel code is doing the same region splitting as the serial code. This makes perfect sense since both versions call the same CountOccurrences() function and only the number of threads in the pool varies.

Zooming in a little closer, there are many collisions between the threads, an indication that the threads are contending for resources. This is most likely due to access contention on the concurrent hash table, the manipulation of which is about all that the CountOccurrences() kernel does:
b080222-14.JPG

Notice that despite the conflicts, the concurrency level is very high. All four threads are being kept busy 99% of the time in this range. High concurrency is not enough, given the poor scaling we’ve already measured. The frequent change of critical path from thread to thread shows how often the threads are interrupted by contention. It shows up better by zooming in still further:
b080222-15.JPG

Each thread must spin periodically. Examining the gaps using the time measurement interval feature of the time line, these contention events seem to last about 3.5 to 4.5 µsec each.

Why all this contention? One problem is that count_strings spends most of its time accessing the concurrent hash. Perhaps using separate hashes per thread for initial collection followed by a reduction of the partial sums might work better. Another problem is the scale of the test: the phrases are synthesized from a list of six adjectives and ten nouns. This simplifies the example but substantially raises the potential for contention. Considering only the raw statistics, there are 6 * 10 = 60 unique phrases that can be put into the Data array. There are 60 * 60 * 60 * 60 = 12,960,000 combinations of 60 strings that might be referenced by four threads simultaneously. It’s not obvious how to compute the number of collision cases, but it’s easy to calculate the number of cases where all four strings are different: each string picked reduces the pool to avoid duplication; the number of combinations of 4 unique things out of 60 is just 60 * 59 * 58 * 57 = 11,703,240. The number of collision cases is just the difference: (60 * 60 * 60 * 60 – 60 * 59 * 58 *57) = 1,256,760. In a perfect hash table with these parameters, collisions will occur 1,256,760 / 12,960,000 = 9.7% of the time. Almost 1 of every 10 accesses nets a collision. A non-uniform hashing algorithm might increase the collision rate from this minimum. Increasing the number of processing elements will also increase the collision rate: at 8 processors working on 60 phrases, (60 * 60 * 60 * 60 * 60 * 60 * 60 * 60 – 60 * 59 * 58 * 57 * 56 * 55 * 54 * 53) / (60 * 60 * 60 * 60 * 60 * 60 * 60 * 60) = 38.6% of all accesses will be a collision, more than one in three.

Fortunately, this is an extremely small phrase count-the more phrases we have in the mix, the less likely there will be a collision. Just raising the number of phrases to 10,000, still a small number, the theoretical collision rate falls to 0.05% for 4 cores or 0.3% for 8 cores. Adding that many phrases might take a bit of work. Perhaps as a future project, we can convert this program to a simple word counter that parses standard input. That would open a rich mine of test data for scaling up the test.

Categorias:
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.