# Post-mortem

## Post-mortem

Problem 2 was a fun one (the one i liked best to be honest) these are the optimalisations i used:

1) Prime generation

Generating primes is fun and can be done quickly but there's one thing even better not having to calculate them! However a table of all primes < 2^32 turned out to be about 800+ megs, too massive,the load time will have eaten me alive. Plus a binary search to find the proper start end end indexes wouldn't have been too snappy either.

But what if there's a way to cut that massive table down in size and pretty much give us a direct index to the numbers we want? Well all primes > 5 you can express as 30k1, 30k7, 30k11, 30k13. Which is exactly 8 bits! Cuts the table down to a nicer 136 megs and it'll give you instant primes!

2) Determining if its a power or not

2a) Do we even need to bother checking a number?

Numbers that are a power of something have a funny property in binary, if you look at the lowest nibble it will *NEVER* be 2,6,10,12 or 14, so that quickly gets rid of about 30% of numbers with a single 'and' instruction and a few compares.

2b) The numbers that are left after that:

The sum of all primes in play is 425649736193687430, which square root is 65241837.5119591 meaning the highest number we would ever see as a base is 65241837. So i figured i'd loop though all numbers [2..65241837] ^ [2..100] see if its below 425649736193687431 and store the base+power+value (only the lowest power+base for numbers that are multiple powers) in a lookup table (i bet by now you all are going 'damn this guy *really* likes his lookup tables' )

Well turns out that table gets *BIG* really quickly but is quite managable for powers > 2.

But what about the powers of 2 then?

Well powers of 2 have the interesting property that the lower nibble always is 0,1,4 or 9, so if its that run a good old sqrt and see if its a square or not (i tried a lookup table here too but sqrt turned out to be faster)

Further improvements for powers > 2

Initially i had them all in a big sorted list which i did a binary search on which worked well but due to the size of the table not the best performer. So the speed it up i turned it into a really basic hash table using bits 21-43 of the number as a hash which gave me less then 10 numbers in most buckets which is stupidly fast to search through.

Just a parralel loop though the primes adding them up, not much to it really this was by far the easiest of the 3 problems to thread.

Most of my time on this problem was spend trying to figure out why the 40 core windows MTL box *REFUSED* to use all of the cores using both openmp or TBB, you always ended up on a random processor group (either cores 0-9 or 10-39) but never on all of them. Turns out that in the intel v11 compiler which was on the box OpenMP was not aware of processor groups (new in win7/2008R2) and TBB (which was aware) had a subtile bug in the code that assigned threads to cores. Found the bug made a quick work around (Details are somewhere in a thread in the TBB forum) and figured my solution would definitly have an edge over other ones that would end up not using all cores... and then intel moved us all out of the box cause it 'had issues' (i said it before, i'll say it again: BOO!)

In the end i ran out of time and the code ended up being a bit (and by a bit, i mean ALOT) messy but functional.

Warning due to the *MASSIVE* lookup tables the code is a whopping 112 megs compressed.

AllegatoDimensione
111.64 MB
5 post / 0 nuovi
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

I broke the problem down into four tasks:

1. Generate primes in the range
2. Calculate sums of runs
3. Generate perfect powers in range of the sums
4. Find matches between the sums and the powers

For step 1 ended up using the Miller-Rabin test.
Instead of choosing random numbers (the way this algorithm usually
works) I used the basis {2, 7, and 61} because it's guaranteed to be
correct up to 4.7 billion (which is greater than 232.)

Step 2 is more interesting than it sounds. Instead of storing the raw
prime numbers in a table and summing them up as needed, I processed the
primes into a prefix sum
table. To recover the original prime number at index i, use sum[i+1] -
sum[i]. That's inconvenient but you get something big in return. The sum
of any sequence of primes from index i to j is given by sum[j+1] -
sum[i]. That means replacing k additions with 1 subtraction inside the main loop.

It's fast to build a sorted table of
perfect powers using an integer ipow(m, x) function and a priority queue
to keep them sorted. Heck, I assigned 39 threads to step 1 and only one
thread to step 3 and it still finished first. Having a table of perfect powers means never having to take an nth root.

Step 4 is where the action is. On VoVanx86's
huge "hundred million" test, steps 1 through 3 take one second and step 4 takes about 22 minutes.

I wrote five different search strategies only to throw four of them
away when the other one emerged as the fastest on all inputs. I call
it the stride strategy where stride is defined as the set of all k-pairs with the same value of k. Each thread takes a stride starting with k=2. It examines all k-pairs in the stride from the beginning of the range to the end. The cool thing about this is
that the sums gradually grow larger as you move downrange. Since the
power table is also sorted, I keep a pointer into it and do short
linear searches (usually no more than 2 steps) for values that match the
current sum. The body of the inner loop processes one k-pair:

```while (p->n < sum) ++p;
if (p->n == sum) file.printresult(left, right, p);
sum = *++right - *++left;```

Where p points into the power table and left/right are pointers into the prefix sum table spaced k units apart. The stride strategy is O(n2) where n is the number of primes in the range.

balancing. There was one astonishing result: frequent barriers cause
the program to run faster! The stride strategy has excellent locality of reference as long as all of the threads run together as a pack. If one
thread falls behind (perhaps to output results) I begin to see some cache
thrashing. So I added a "metronome" counter to resynchronize the threads at constant intervals. The
optimum time between resyncs: 600 milliseconds. I was expecting it to be
closer to a minute.

- Rick LaMont

My solution wasn't anything particularly technically amazing, but here goes :)

I just used the Sieve of Eratosthenes in the end; I'd intended to try the Sieve of Atkin, but it was much more complicated and I spent my time optimising other areas. Up to numbers about 1G, my basic sieve was pretty fast, so I didn't worry about it much beyond that; it was just a vector, which seemed good for memory consumption. Unfortunately, testing very late in the day in the 2G-4G range showed that performance dropped exponentially around there, so I'm not too confident; it may be just to do with the ancient laptop I was running it on (I had a lot of connectivity problems and never successfully got onto the MTL), but that's likely a vain hope. I did throw in a bit of threading: a parallel_for, after first sieving the first 200 numbers to ensure a good starting point, which made it a little faster. This was at the last minute, so I hope it didn't break anything for highly-multithreaded code, though it was fine for me.

The sums of primes was the worst bit of my performance, as I didn't come up with a good way to handle the O(N^2) computation. I tried storing recently used results, but that made it worse, as there wasn't much repetition.

Calculating perfect powers turned out pretty well, and was what I ended up spending much of my time on. I worked on the theory that, for a number that's a perfect power, every prime factor must be present at least twice in the factorisation, and also appear the same number of times (or a multiple of the same number of times, such as 2^4 * 3^2 = 2^2 * 2^2 * 3^2) as all other prime factors. Once I got this working, it seemed pretty fast.

My threading for the sums and perfect powers was all together; another parallel_for set individual threads to calculate a sum and then do the perfect power processing on it.

It'll be interesting to see how it goes :)

## Allegati:

AllegatoDimensione
8.36 KB

Hi all,

Here my small write up for the second problem. I have to start saying that I think that with this problem I am pretty sure I didnt make a very good work. Anyway, here is my solution.

For the primes list I have implemented the sieve of Eratosthenes. Since this is part is not critical, as already noted Vovanx86 and Rick Lamont, I didnt worry very much about that.

I used a parallel_for in order to parallelize the computation. In the parallel for, the computation of all sums starting in a specific prime are calculated by the same thread, so in every iteration a thread makes an addiction and check if the total sum is a perfect power.

The most important part of the implementation is how to check if a number is a perfect power. I didnt have time for testing different algorithms, so I decided to use a trial division test based in that some open source math libraries use this method. The basic idea is that if the sum is divisible by the prime e, then we can calculate a number e^r who also divide the sum, and if the sum is a perfect p-th power, then p has to be divisible by r. If we dont find e, we can use this result to limit the number of tries when we are checking for the perfect power.

About the solution source: (Windows/Linux, intel compiler/MSVC, TBB)

Solution attached

## Allegati:

AllegatoDimensione
17.12 KB

I did this - boolean lookup for primes less than 65536 and for those higher a sieve utilizing a lookup of all primes below 65536 - this worked pretty good for me. Now during the totally parallized lookup of these numbers within range I popped in a counter that equals the exact amount of possible combinations of sums for the given range. This counter was then used to determine the amount of threads required for calculating and writing out in case a perfect sum was found, waiting on a file lock served as my mutex. The summation was straightforward and I skipped any sort of optimization here since each ran in parallel with al the rest. Checking for the perfect power attribute was dealt with as follows: determine the approximate root using the log formula - which can bee seen as a lookup I believe, then checking and adapting below and over to check if it is exact. Well that is it.