I broke the problem down into four tasks:

- Generate primes in the range
- Calculate sums of runs
- Generate perfect powers in range of the sums
- Find matches between the sums and the powers

For step 1 ended up using the Miller-Rabin test.

Instead of choosing random numbers (the way this algorithm usually

works) I used the basis {2, 7, and 61} because it's guaranteed to be

correct up to 4.7 billion (which is greater than 2^{32}.)

Step 2 is more interesting than it sounds. Instead of storing the raw

prime numbers in a table and summing them up as needed, I processed the

primes into a prefix sum

table. To recover the original prime number at index i, use sum[i+1] -

sum[i]. That's inconvenient but you get something big in return. The sum

of any sequence of primes from index i to j is given by sum[j+1] -

sum[i]. That means replacing k additions with 1 subtraction inside the main loop.

It's fast to build a sorted table of

perfect powers using an integer ipow(m, x) function and a priority queue

to keep them sorted. Heck, I assigned 39 threads to step 1 and only one

thread to step 3 and it still finished first. Having a table of perfect powers means never having to take an nth root.

Step 4 is where the action is. On VoVanx86's

huge "hundred million" test, steps 1 through 3 take one second and step 4 takes about 22 minutes.

I wrote five different search strategies only to throw four of them

away when the other one emerged as the fastest on all inputs. I call

it the *stride strategy* where *stride* is defined as the set of all k-pairs with the same value of k. Each thread takes a stride starting with k=2. It examines all k-pairs in the stride from the beginning of the range to the end. The cool thing about this is

that the sums gradually grow larger as you move downrange. Since the

power table is also sorted, I keep a pointer into it and do short

linear searches (usually no more than 2 steps) for values that match the

current sum. The body of the inner loop processes one k-pair:

while (p->n < sum) ++p; if (p->n == sum) file.printresult(left, right, p); sum = *++right - *++left;

Where p points into the power table and left/right are pointers into the prefix sum table spaced k units apart. The stride strategy is O(n^{2}) where n is the number of primes in the range.

That's about it other than threading issues such as load

balancing. There was one astonishing result: frequent barriers cause

the program to run faster! The stride strategy has excellent locality of reference as long as all of the threads run together as a pack. If one

thread falls behind (perhaps to output results) I begin to see some cache

thrashing. So I added a "metronome" counter to resynchronize the threads at constant intervals. The

optimum time between resyncs: 600 milliseconds. I was expecting it to be

closer to a minute.

- Rick LaMont

## Post-mortem

Problem 2 was a fun one (the one i liked best to be honest) these are the optimalisations i used:

1) Prime generation

Generating primes is fun and can be done quickly but there's one thing even better not having to calculate them! However a table of all primes < 2^32 turned out to be about 800+ megs, too massive,the load time will have eaten me alive. Plus a binary search to find the proper start end end indexes wouldn't have been too snappy either.

But what if there's a way to cut that massive table down in size and pretty much give us a direct index to the numbers we want? Well all primes > 5 you can express as 30k1, 30k7, 30k11, 30k13. Which is exactly 8 bits! Cuts the table down to a nicer 136 megs and it'll give you instant primes!

2) Determining if its a power or not

2a) Do we even need to bother checking a number?

Numbers that are a power of something have a funny property in binary, if you look at the lowest nibble it will *NEVER* be 2,6,10,12 or 14, so that quickly gets rid of about 30% of numbers with a single 'and' instruction and a few compares.

2b) The numbers that are left after that:

The sum of all primes in play is 425649736193687430, which square root is 65241837.5119591 meaning the highest number we would ever see as a base is 65241837. So i figured i'd loop though all numbers [2..65241837] ^ [2..100] see if its below 425649736193687431 and store the base+power+value (only the lowest power+base for numbers that are multiple powers) in a lookup table (i bet by now you all are going 'damn this guy *really* likes his lookup tables' )

Well turns out that table gets *BIG* really quickly but is quite managable for powers > 2.

But what about the powers of 2 then?

Well powers of 2 have the interesting property that the lower nibble always is 0,1,4 or 9, so if its that run a good old sqrt and see if its a square or not (i tried a lookup table here too but sqrt turned out to be faster)

Further improvements for powers > 2

Initially i had them all in a big sorted list which i did a binary search on which worked well but due to the size of the table not the best performer. So the speed it up i turned it into a really basic hash table using bits 21-43 of the number as a hash which gave me less then 10 numbers in most buckets which is stupidly fast to search through.

3) Threading

Just a parralel loop though the primes adding them up, not much to it really this was by far the easiest of the 3 problems to thread.

Most of my time on this problem was spend trying to figure out why the 40 core windows MTL box *REFUSED* to use all of the cores using both openmp or TBB, you always ended up on a random processor group (either cores 0-9 or 10-39) but never on all of them. Turns out that in the intel v11 compiler which was on the box OpenMP was not aware of processor groups (new in win7/2008R2) and TBB (which was aware) had a subtile bug in the code that assigned threads to cores. Found the bug made a quick work around (Details are somewhere in a thread in the TBB forum) and figured my solution would definitly have an edge over other ones that would end up not using all cores... and then intel moved us all out of the box cause it 'had issues' (i said it before, i'll say it again: BOO!)

In the end i ran out of time and the code ended up being a bit (and by a bit, i mean ALOT) messy but functional.

Warning due to the *MASSIVE* lookup tables the code is a whopping 112 megs compressed.