Post-mortem

Post-mortem

Imagen de dotcsw

I'm creating this thread to organize the post-mortem analysis of the Running Numbers problem. Please put any new post-mortems here. If you want to go over Maze of Life and Consecutive Primes, please create similar threads in their respective forums.

Several contestants have already put their post-mortems in another thread. I will include them here by reference:

vdave got things started by describing his lookup table to break the 2 SSE instruction barrier.
lazydodo used a static lookup table to see if a potential solution was within the next 36 adds.
smayne's approach involved a mask of unmodified LSB's to see if the next 36 steps could be skipped.
jmfernandez's algorithm selected a candidate byte. He had other pending improvements.
duncanhopkins went with a batch approach, breaking the work up into 512*37 cycle sections.
VoVanx86 described his parity test for skipping 36 cycles ahead.

Let me know if I missed anything and I will edit this post. My own post-mortem will be forthcoming is below.

- Rick

publicaciones de 7 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.
Imagen de lazydodo

I posted my Post-mortem's of the The Maze of Life and Consecutive Primes in the appropiate subforums should be interesting to see what other people did on these problems, especially curious to hear jmfernandez' maze of life insights, he may have been 'only' third there but he completly dominated that round in pure speed! (imho way too many bonus points were given for shortest answers and forum posts, based on pure speed the top 3 was jmfernandez, yoink and dotcsw but hey its intel's contest, they can do whatever they want :) )

jmfernandez yoink dotcsw

Imagen de vdave

I attached the source code of the final submission, it can be compiled with Intel C++ Compiler 12.0 (I was working on Windows, I did not try the Linux version), it uses only TBB and the standard C/C++ library. But it gives a terrible performance on systems with less than 16 cores, the code has different thread numbers hardwired for the 32-core MTL environment.

Adjuntos: 

AdjuntoTamaño
Descargar ic2011_a3.cpp23.59 KB
Imagen de VoVanx86

There is my solution (fastest 1-4 threaded code):
RunningNumbers.zip - VS2008 solution with binaries
and simple realisation of vdave's ideas (not for contest):
RunningNumbersCycles.cpp- part of solution with changes (2 functions: ParallelOptimizedCycles2 and Precalculate)
RunningNumbersCycles.h


Imagen de dotcsw

When I first read the Running Numbers problem description it screamed "SSE". Meanwhile the 37 cycle pattern whispered, "You have 40 cores. Use 37 threads to decompose it." Yet SSE and threading will only go so far when the basic algorithm is "grind it out". In order to go really fast one needs to optimize the algorithm.

BASIC CONCEPTS

Define block as a set of 37 cycles, beginning with the DWORD addition and ending after the 36th BYTE addition.

Define bucket as each of the 37 positions within a block. Bucket number 0 represents any state that occurs immediately after a DWORD addition. Bucket number 36 falls at the end of a block, after the 36th BYTE addition and immediately before the DWORD addition that will begin the next block.

Consider the sample problem where the solution is 574395734.

574395734 / 37 = 15524209 blocks
574395734 % 37 = 1 cycle

More precisely, a solution was found in bucket 0 after 15524209 blocks. Note that the "off by one" is due to the first cycle being cycle number 0. Bucket 0 represents the state after 1 cycle, 38 cycles, 75 cycles, etc.

THREADING MODEL

Deompose the problem into 74 jobs. The first bank of 37 jobs seeks solutions where all bytes reach zero simultaneously. Each job focuses only on one bucket: job 0 takes bucket 0, job 1 takes bucket 1, and so on. Jobs 37 through 73 seek solutions where the bytes repeat their initial values, again with one job per bucket.

Separating zero jobs from repeat jobs simplifies the problem. While it's possible that the same bucket will have both types of solutions, those two searches will lead in different directions (unless the initial values were all zeros).

Ideally, there would be 74 cores to run all 74 jobs simultaneously. In practice on the MTL there are 40 or fewer cores. Create a thread pool with the number of threads equal to the number of cores. Assign jobs to threads through a simple work queue.

Each job will initially start at its given bucket within block number zero. After that it will advance one or more blocks at a time until it finds its goal state (zero/repeat) or determines that the goal state cannot be reached in this bucket.

Let front be the cumulative BYTE step from the beginning of the block to this bucket:

front = bucket * BYTEadder (8 bit multiply)

Let back be the step from this bucket to the end of the block:

back = (36 - bucket) * BYTEadder (8 bit multiply)

To advance the accumulator acc by one block requires three steps:

acc += back // 8 bit addition
acc += DWORDadder // 32 bit addition
acc += front // 8 bit addition

Suppose the program determines that there is no solution for a given bucket in the next block. It can advance two blocks like this:

acc += back // 8 bit addition
acc += DWORDadder // 32 bit addition
acc += 36 * BYTEadder
acc += DWORDadder
acc += front

The intervening block is said to have been skimmed because no particular bucket was visited there. This concept can be generalized to advance by n blocks, skimming over n-1 blocks:

acc += back
acc += DWORDadder
for i = 1 to n-1
acc += 36 * BYTEadder
acc += DWORDadder
acc += front

This takes 2n + 1 instructions to advance by n blocks (i.e. 37n cycles). Here is the actual C++ code for a function to skim n blocks. It uses SSE intrinsics in an unrolled loop:

skim.cpp

BARREL LOCK

I named this algorithm barrel lock because it's analogous to opening a 4 dial lock when you already know the combination.

Partition the 128 bit accumulator into four logical dials as indicated by the four colors in this diagram:


Each dial consists of 4 non-contiguous bytes. The red dial is made up of the least significant byte of each DWORD while the yellow dial is the set of most significant bytes. A dial is not considered solved until all four bytes reach their target values.

The algorithm works in four phases from right to left. The first phase gets the red dial into position. The second phase works on the green dial without moving the red dial, and so on.

For example consider the test where:

SOURCE = d6b610c012e049601632f70094c145c0
BYTEadder = 1bbffa833f0e0f733ca510f5044ef24a
DWORDadder = 39499f0b48e0f3ec3cca214b7a1e0474

The solution proceeds as follows:

After phase Accumulator Number of cycles initial d6b610c0 12e04960 1632f700 94c145c0 0 red 71f37400 087a6000 5adf7400 df42b000 3520 green cf4f0000 aac80000 cd8f0000 18e40000 344512 blue f8000000 40000000 f8000000 20000000 434389440 yellow 00000000 00000000 00000000 00000000 15332557248

The dials on an actual barrel lock are independent. Once a dial is in
place you don't have to move it again. However, the Running Numbers
problem more closely resembles an odometer than a barrel lock. The green
dial doesn't advance without the red dial also advancing.

The basic concept to make this odometer act like a barrel lock is to figure out the period
of each dial - the minimum number of blocks for it to return to its
current state. The typical period for the red dial (with 94%
probability) is 256 blocks. The green period will be larger because it's the number of blocks for both the green dial and the red dial to roll over to their current values. For now, assume that all four periods have their
typical values:

redPeriod = 28
greenPeriod = 216
bluePeriod = 224
yellowPeriod = 232

Suppose
the red dial is already in its target position of all bytes zero. The
green phase of the algorithm can now safely advance 256 blocks at a time
using the skim function because that is the next time that the red dial will be
in its target position. All intervening blocks are non-solutions
because the red dial will be out of position. This is how the green
phase works:

delta = redPeriod * 37 // cycles per skim
n = greenPeriod / redPeriod // up to 256 iterations this phase
while n-- > 0
if green dial of acc is in target position
proceed to blue phase
acc = skim(redPeriod)
ncycles += delta
report no solution found for this job

The blue phase is nearly identical except it skims bigger chunks, up to 65536 blocks at a time:

delta = greenPeriod * 37
n = bluePeriod / greenPeriod
while n-- > 0
if blue dial of acc is in target position
proceed to yellow phase
acc = skim(greenPeriod)
ncycles += delta
report no solution found for this job

To
put things in perspective, note that some of the other post-mortems
describe checks that allow the next 36 cycles to be skipped. The skim
within the blue phase is skipping up to 2424832 cycles (65536 * 37)
at a time without checking for a solution.

Things get interesting in the
yellow (final) phase. It could just skim bluePeriod blocks like the other
phases did but that's no longer necessary. The red, green and blue
dials are already in place. The only dial still turning is the yellow
one and it doesn't matter how many carries it generates into the bit
bucket. What matters is how far the yellow dial turns for each
bluePeriod. This is a constant that I calculate during initialization
while figuring out the periods (hint: it's often equal to the least
significant byte of each DWORD in DWORDadder.)

Armed with this magic blueCarry constant, the yellow phase is simply:

delta = bluePeriod * 37
n = yellowPeriod / bluePeriod
while n-- > 0
if yellow dial of acc is in target position
report solution found at ncycles
acc += blueCarry // much faster than skimming
ncycles += delta
report no solution found for this job

Now
let's go back and examine the red (first) phase. It can take advantage
of the fact that no carries will be coming in from below. Essentially it
uses a "miniature barrel lock" algorithm with 8 dials, one for each
bit. It dials them in one at a time from right to left:

delta = 37
for b = 0 to 7 // iterate over eight bits of red dial
if any bit b of red dial is not in target state
acc = skim(2b)
cycles += delta
delta *= 2
if all bits of red dial are in target position
proceed to green phase
else
report no solution found for this job

That's the barrel lock algorithm in a nutshell. Four phases to solve four dials. C++ source code is here:

barrel.cpp

INITIALIZING BARREL LOCK

All that's left now is to initialize the four periods (they're not always powers of 256) and the blueCarry. This initialization depends upon the command line inputs but is common to all jobs so it is done prior to starting the threads.

Barrel lock is one of those algorithms where the initialization is more complicated than the main routine. I'm not going to cover all of the details such as the special case where redPeriod = 1. The executive summary is that it's initialized with linear code based on modular arithmetic that I worked out on paper. There are no loops or big lookup tables.

Define block adder as (DWORDadder + 36 * BYTEadder). Examine each DWORD of block adder and count the number of zero bits each ends with. For example, 0x91544B9C ends with two zero bits because 0xC is 1100 in binary. Let z0 = the number of trailing zeros in DWORD 0 of block adder, z1 for DWORD 1, and so on. If the minimum of (z0, z1, z2, z3) is 0, then redPeriod = 256. If the minimum is 1, redPeriod = 128. In general:

redPeriod = 2(8 - min(z0, z1, z2, z3))

Over the course of redPeriod blocks, how many carries will there be from the red dial to the green dial? It's usually just the low byte of each DWORD in DWORDadder. However, set bits in DWORDadder that are lower than bit z0 (or z1,z2,z3) have already been canceled out by the bits of 36 * BYTEadder, resulting in a 0 bit in the block step. Otherwise z0 would have indicated this 1 bit. Build a bitmask from the z's:

zmask = {2z3 - 1, 2z2 - 1, 2z1 - 1, 2z0 - 1}

The carry from red to green is then given by:

redCarry = ((SOURCE & zmask) + DWORDadder) & ~zmask

Where SOURCE is the initial state given on the command-line.

Turn around and count the zero bits in each DWORD of redCarry, just like before. Let zcarry equal the minimum of these four values. Now we can calculate the other periods and blueCarry as:

greenPeriod = 2(16 - zcarry)
bluePeriod = 2(24 - zcarry)
yellowPeriod = 2(32 - zcarry)
blueCarry = redCarry * bluePeriod

RUNTIME ANALYSIS

Define probe as any comparison between the accumulator and a full or partial goal state. Probes are more expensive than steps, not becuase they use more instructions but because of the inherent branching. Barrel lock seeks to minimize probes. In the worst case a job needs:

Phase Max probes red 9 green 256 blue 256 yellow 256 total 777

Here is the table of worst case steps for a job in terms of the number of SSE instructions (i.e. skimming n blocks takes 2n+1 steps):

Phase Max steps
(in instructions)
red 518 green 131,328 blue 33,554,688 yellow 256 total 33,686,790

The blue phase is the bottleneck due to the cost of skimming 64K blocks at a time. Therefore, the barrel lock algorithm is O(n) where:

n = (solution / (37 * greenPeriod)) % 256

In a typical test with an "all bytes zero" solution, only two jobs will make it beyond the red phase. The other 72 jobs fail to launch, freeing up threads to work on other jobs.

The two jobs that do launch are the one that ultimately finds the solution and job #73. You see, job 73 is the one that finds the inevitable repeat after 37 * yellowPeriod cycles. Unfortunately, it needs to take the maximum number of probes (777) to find it. My program includes includes a special early exit for job 73 if it appears to be bound for the maximum cycles.

PERFORMANCE

Here are my benchmarks on the Linux MTL batch node acano02 using the usual test suite:

Solution
(in cycles)
Execution time
(in milliseconds)
4774 1.039 7196 1.027 21593 1.047 7179 1.026 5039 1.038 11419 1.038 431787735 3.818 574395734 14.276 1241513984 1.051 1483695460 6.456 2626626062 4.233 4076988879 9.285 15332557248 10.994 79456894940 11.952

Three of these tests are noteworthy. The 21593 cycles test has a redPeriod of 128. All the rest have the typical 256 redPeriod. On the last test (lazydodo's repeat example) all 37 repeat jobs make it to the yellow phase and find solutions. That's when it's nice to have 40 cores. Finally, the famous 574395734 sample is a near worst-case scenario for barrel lock because it has a heavy blue phase: 237 (out of a possible 256) probes along with the maximum greenPeriod of 64K.

The worst case (max probes, max steps) breakdown for each phase of a job are:

Phase Max execution time
(in microseconds)
red 1 green 64 blue 14,878 yellow 1 total 14,944

The bottom line is that this program solves any problem in about 15 milliseconds or less. I'm pleased with this result but believe there's room for improvement. With sufficient modular arithmetic analysis, it may be possible to optimize the green and blue phases in a fashion similar to that of the yellow phase.

SOURCE CODE

Below is the complete source code as submitted for the contest. It's just one file (463 source lines of code plus heavy commenting) so I didn't bother to zip it.

main.cpp

- Rick LaMont

Imagen de dotcsw
Quoting lazydodo based on pure speed the top 3 [on Maze of Life] was jmfernandez, yoink and dotcsw

jmfernandez yoink dotcsw

...with lazydodo a close fourth! I agree that jmfernandez deserves recognition for having a fast program that solved all 10 data sets. However, I can't complain about my score. My program was neither the fastest nor optimal. I wasn't really expecting any points other than 25 for posting on the forum. The way it was scored keeps the contest interesting and competitive for the next two problems.

- Rick

Imagen de yoink

Thanks for all the detailed information, people; you've clearly put a lot of effort in and will no doubt do well in this round, and thanks for sharing. Unfortunately I don't have a great deal of interest to add about my solution, as it was a bit of a hurried one, and basically consisted of:

- Counting each 37-iteration sequence as a "group" (dword + 36 * byte), I started with a small block size of groups (128) which I processed with a parallel_for
- Each thread would be processing a smaller block of groups (e.g. 0-31, 32-63, 64-95, 96-127). The larger ones would calculate their starting values based on a simple loop of additions (add dword, then add 36 * byte). The last one would store the starting value for the next group of blocks when it was done
- Processing an individual block just used _mm_cmp_epi8 and _mm_movemask_epi8 as many people would have, I assume. I spent far too long trying to optimise the individual block loop unfortunately
- Once the parallel_for was done, it doubled the block size for the next loop, up to a set limit (2^20)
- Making the first block run single-threaded helped slightly

So nothing amazing there, and not much point uploading the code, as there are much more interesting solutions to look at. I had ideas on trying out algorithms to figure out whether a group had a solution without checking each value, but wasted too much time on the first one and ran out of time.

Anyway, good luck all in the results :)

Inicie sesión para dejar un comentario.