Problem 2 was a fun one (the one i liked best to be honest) these are the optimalisations i used:
1) Prime generation
Generating primes is fun and can be done quickly but there's one thing even better not having to calculate them! However a table of all primes < 2^32 turned out to be about 800+ megs, too massive,the load time will have eaten me alive. Plus a binary search to find the proper start end end indexes wouldn't have been too snappy either.
But what if there's a way to cut that massive table down in size and pretty much give us a direct index to the numbers we want? Well all primes > 5 you can express as 30k1, 30k7, 30k11, 30k13. Which is exactly 8 bits! Cuts the table down to a nicer 136 megs and it'll give you instant primes!
2) Determining if its a power or not
2a) Do we even need to bother checking a number?
Numbers that are a power of something have a funny property in binary, if you look at the lowest nibble it will *NEVER* be 2,6,10,12 or 14, so that quickly gets rid of about 30% of numbers with a single 'and' instruction and a few compares.
2b) The numbers that are left after that:
The sum of all primes in play is 425649736193687430, which square root is 65241837.5119591 meaning the highest number we would ever see as a base is 65241837. So i figured i'd loop though all numbers [2..65241837] ^ [2..100] see if its below 425649736193687431 and store the base+power+value (only the lowest power+base for numbers that are multiple powers) in a lookup table (i bet by now you all are going 'damn this guy *really* likes his lookup tables' )
Well turns out that table gets *BIG* really quickly but is quite managable for powers > 2.
But what about the powers of 2 then?
Well powers of 2 have the interesting property that the lower nibble always is 0,1,4 or 9, so if its that run a good old sqrt and see if its a square or not (i tried a lookup table here too but sqrt turned out to be faster)
Further improvements for powers > 2
Initially i had them all in a big sorted list which i did a binary search on which worked well but due to the size of the table not the best performer. So the speed it up i turned it into a really basic hash table using bits 21-43 of the number as a hash which gave me less then 10 numbers in most buckets which is stupidly fast to search through.
3) Threading
Just a parralel loop though the primes adding them up, not much to it really this was by far the easiest of the 3 problems to thread.
Most of my time on this problem was spend trying to figure out why the 40 core windows MTL box *REFUSED* to use all of the cores using both openmp or TBB, you always ended up on a random processor group (either cores 0-9 or 10-39) but never on all of them. Turns out that in the intel v11 compiler which was on the box OpenMP was not aware of processor groups (new in win7/2008R2) and TBB (which was aware) had a subtile bug in the code that assigned threads to cores. Found the bug made a quick work around (Details are somewhere in a thread in the TBB forum) and figured my solution would definitly have an edge over other ones that would end up not using all cores... and then intel moved us all out of the box cause it 'had issues' (i said it before, i'll say it again: BOO!)
In the end i ran out of time and the code ended up being a bit (and by a bit, i mean ALOT) messy but functional.
Warning due to the *MASSIVE* lookup tables the code is a whopping 112 megs compressed.


