Mihai Moraru, Thibaut Patel – INSA de Lyon
Acceler8 - Parallel programming contest
Task 1 - Smallbrain numbers
Throughout this document we shall explain our solution and the steps we took in order to optimize it for parallel computing.
Our solution’s objective (besides correctness) is speed. We can approach faster execution times by a) reducing the workload and b) running faster. As such, our efforts focused on two main directions: perfecting the algorithm from a mathematical point of view (reducing the number of unneeded calculations); and trying to get the most out of the multiple processors (cores) we had at our disposal (running faster).
Parallelism and choice of language
In order to benefit from the availability of plenty of cores and to make our program scalable, we posed ourselves questions about optimal management of a thread pool and proper distribution of tasks with load balancing. However this concerns had already been tackled by researchers and were being dealt with by projects like Cilk, MPI and OpenMP. Following a quick documentation, the most compelling choice was the Intel Threading Building Blocks library for the C++ programming language. The factors which weighed in the decision were a previous (brief) exposure to the concepts of TBB, the short time available for the contest, the availability of the library as Free Software.
Our first solution iterated over the interval [begin, end] and checked each number against it’s smallbrain sum. We optimized the way the checking was done up to the point where the program took about 180 seconds to solve for the interval between 1 and 240. Regarding parallelism we used the templates from TBB to great avail. We designed our own classes which implemented the Range and Body concepts needed for parallel_for. The speedup was linear.
However, by extrapolation, our program would have taken almost 50 years to solve for all numbers between 1 and 263-1.
Motivated by the results posted on the Acceler8 forum, we realised that the problem was actually feasible in a reasonable amount of time. The key is in reducing the space over which we search for smallbrain numbers.
The serial part
Let f(x) be the function which assigns to an integer x the smallbrain sum corresponding to its digits. It is a non-continous function. Furthermore, the function is not injective, meaning that there can be multiple integers x which have the same smallbrain sum and that without considering different permutations (i.e. f(170) = f(446) = 344).
Mathematically what we are looking for are the intersection points of f(x) with id(x), where id(x) = x. Our first solution iterated over the X-axis. However, there are much fewer possible values on the Y-axis. Considering that many of them are not valid (i.e.: the small brain sum of a 3 digit number has to be between 100 and 999) we can generate only the valid candidates of smallbrain sums and check if they also belong to id(x).
From an algorithmic point of view, all numbers from 1 to 10D-1 can be arranged in a prefix tree of depth D in which each non-leaf node has exactly 10 children. The leaves represent D-digit numbers. A naive solution (our first one) would search the tree in a BFS manner in the given interval. Our second solution uses backtracking to generate viable candidates for a smallbrain number. Conceptually (the tree is not explicit), the algorithm does a DFS and backtracks as soon as the sum of the prefix generated so far is greater than 10D-1. This heuristic effectively prunes entire subtrees and drastically decreases the workload.
Another observation we can make is that in generating smallbrain candidates the order of the digits is not important. Thus we do not take into account different permutations of the same set of digits. This alone divides the search space by D! (factorial).
The backtracking algorithm was parametrized such that it could search in a well defined subtree of the D-digit number tree. It takes as arguments the interval in which to search and the starting node. The node is identified by depth, digit, path from root (prefix). In order to reduce calculations, we also pass the smallbrain sum of the prefix.
The parallel part
Now that the algorithm can search in a confined subtree, we can spawn multiple searches in parallel without fear of overlapping. We couldn’t easily adapt this search space to templates like parallel_for because of the difficulty of splitting into equal parts. We eventually found in the book of James Reinders on Intel Threading Building Blocks the chapter which describes the TBB task scheduler and how it can be used to split a problem recursively: “The scheduler works best with tree-structured task graphs [...] with tree-structured forking, it takes only O(log n) steps because some of the tasks created can go on to create subtasks.”
Our parallel part of the algorithm does just that. For each choice it makes it spawns a task. The TBB task scheduler chooses whether or not to execute it in parallel based on available physical threads. Once a certain depth has been reached, the algorithm switches to the serial part. The cutoff (actually the depth of the tree minus the depth of the level where we make the decision) was chosen empirically in order to strike a balance between generating potential parallelism and overhead.
Putting it all together
The algorithm searches for smallbrain numbers of a certain amount of digits. In order to search all smallbrain numbers in the given interval [begin end], we divide the interval into powers of ten and start the algorithm in parallel. For example, searching for smallbrain numbers between 187 and 68403 would spawn 3 root-searches in the intervals [187 999], [1000 9999] and [10000 68403] respectively. Each root-search will spawn children recursively thus profiting from available cores.
In order to optimize the speed of the program, we tried variations of the implementation:
iterative backtracking with stack for partial sums (bkt)
iterative backtracking which keeps a single sum (bkt2)
Parallel part - task recurrence patterns (see chapter 9 from James Reinders’ book)
blocking style with children
continuation style with children - recycling the parent as the continuation
Other optimizations (machine dependent)
try different sizes for the matrix which holds precomputed powers (10x20, 16x32, 20x10, 32x16) to see which one is the most cache friendly.
try both gcc and icc
find best -march flag for gcc
find the best value for cutoff
In our opinion the algorithm scales well when presented with more cores. The cutoff limit we imposed represents the distance from the bottom of the tree where the serial parts comes into action. As the depth of the tree grows, the number of tasks spawned at each level is multiplied by ten. Supposing that we had 128bit integers, the maximum depth would be 39 (corresponding to the 39 digits of 2128-1). With a cutoff of 6, there would be 1+10+102+...+1033 tasks which could be executed in parallel if enough cores were available.
Incidentally, the biggest smallbrain number has 39 digits and is smaller than 2128-1. This means that we could compute all smallbrain numbers without any change to the code supposing we had a 128bit computer.
The first solution outperforms the second one for intervals which span over less than 106. In order to achieve the best of both worlds, we have merged the two solutions together.
Our final program manages to find all smallbrain numbers between 1 and 263-1 in under 75 ms on the Intel Many Core Testing Lab.
J. Reinders, Intel Threading Building Blocks, O'Reilly 2007