# Smallbrain Numbers - solution (EN)

Mihai Moraru, Thibaut Patel – INSA de Lyon

Acceler8 - Parallel programming contest

Throughout this document we shall explain our solution and the steps we took in order to optimize it for parallel computing.

1. Objective

Our solutions objective (besides correctness) is speed. We can approach faster execution times by a) reducing the workload and b) running faster. As such, our efforts focused on two main directions: perfecting the algorithm from a mathematical point of view (reducing the number of unneeded calculations); and trying to get the most out of the multiple processors (cores) we had at our disposal (running faster).

1. Parallelism and choice of language

In order to benefit from the availability of plenty of cores and to make our program scalable, we posed ourselves questions about optimal management of a thread pool and proper distribution of tasks with load balancing. However this concerns had already been tackled by researchers and were being dealt with by projects like Cilk, MPI and OpenMP. Following a quick documentation, the most compelling choice was the Intel Threading Building Blocks library for the C++ programming language. The factors which weighed in the decision were a previous (brief) exposure to the concepts of TBB, the short time available for the contest, the availability of the library as Free Software.

1. First solution

Our first solution iterated over the interval [begin, end] and checked each number against it’s smallbrain sum. We optimized the way the checking was done up to the point where the program took about 180 seconds to solve for the interval between 1 and 240. Regarding parallelism we used the templates from TBB to great avail. We designed our own classes which implemented the Range and Body concepts needed for parallel_for. The speedup was linear.

However, by extrapolation, our program would have taken almost 50 years to solve for all numbers between 1 and 263-1.

1. Second solution

Motivated by the results posted on the Acceler8 forum, we realised that the problem was actually feasible in a reasonable amount of time. The key is in reducing the space over which we search for smallbrain numbers.

1. ## Theserialpart

Let f(x) be the function which assigns to an integer x the smallbrain sum corresponding to its digits. It is a non-continous function. Furthermore, the function is not injective, meaning that there can be multiple integers x which have the same smallbrain sum and that without considering different permutations (i.e. f(170) = f(446) = 344).

Mathematically what we are looking for are the intersection points of f(x) with id(x), where id(x) = x. Our first solution iterated over the X-axis. However, there are much fewer possible values on the Y-axis. Considering that many of them are not valid (i.e.: the small brain sum of a 3 digit number has to be between 100 and 999) we can generate only the valid candidates of smallbrain sums and check if they also belong to id(x).

From an algorithmic point of view, all numbers from 1 to 10D-1 can be arranged in a prefix tree of depth D in which each non-leaf node has exactly 10 children. The leaves represent D-digit numbers. A naive solution (our first one) would search the tree in a BFS manner in the given interval. Our second solution uses backtracking to generate viable candidates for a smallbrain number. Conceptually (the tree is not explicit), the algorithm does a DFS and backtracks as soon as the sum of the prefix generated so far is greater than 10D-1. This heuristic effectively prunes entire subtrees and drastically decreases the workload.

Another observation we can make is that in generating smallbrain candidates the order of the digits is not important. Thus we do not take into account different permutations of the same set of digits. This alone divides the search space by D! (factorial).

The backtracking algorithm was parametrized such that it could search in a well defined subtree of the D-digit number tree. It takes as arguments the interval in which to search and the starting node. The node is identified by depth, digit, path from root (prefix). In order to reduce calculations, we also pass the smallbrain sum of the prefix.

1. ## Theparallelpart

Now that the algorithm can search in a confined subtree, we can spawn multiple searches in parallel without fear of overlapping. We couldn’t easily adapt this search space to templates like parallel_for because of the difficulty of splitting into equal parts. We eventually found in the book of James Reinders on Intel Threading Building Blocks the chapter which describes the TBB task scheduler and how it can be used to split a problem recursively: “The scheduler works best with tree-structured task graphs [...] with tree-structured forking, it takes only O(log n) steps because some of the tasks created can go on to create subtasks.”

Our parallel part of the algorithm does just that. For each choice it makes it spawns a task. The TBB task scheduler chooses whether or not to execute it in parallel based on available physical threads. Once a certain depth has been reached, the algorithm switches to the serial part. The cutoff (actually the depth of the tree minus the depth of the level where we make the decision) was chosen empirically in order to strike a balance between generating potential parallelism and overhead.

1. ## Puttingitalltogether

The algorithm searches for smallbrain numbers of a certain amount of digits. In order to search all smallbrain numbers in the given interval [begin end], we divide the interval into powers of ten and start the algorithm in parallel. For example, searching for smallbrain numbers between 187 and 68403 would spawn 3 root-searches in the intervals [187 999], [1000 9999] and [10000 68403] respectively. Each root-search will spawn children recursively thus profiting from available cores.

1. Optimizations

In order to optimize the speed of the program, we tried variations of the implementation:

Serial part

• recursive backtracking

• iterative backtracking with stack for partial sums (bkt)

• iterative backtracking which keeps a single sum (bkt2)

Parallel part - task recurrence patterns (see chapter 9 from James Reindersbook)

• blocking style with children

• continuation style with children - recycling the parent as the continuation

Other optimizations (machine dependent)

• try different sizes for the matrix which holds precomputed powers (10x20, 16x32, 20x10, 32x16) to see which one is the most cache friendly.

• try both gcc and icc

• find best -march flag for gcc

• find the best value for cutoff

1. Scalability

In our opinion the algorithm scales well when presented with more cores. The cutoff limit we imposed represents the distance from the bottom of the tree where the serial parts comes into action. As the depth of the tree grows, the number of tasks spawned at each level is multiplied by ten. Supposing that we had 128bit integers, the maximum depth would be 39 (corresponding to the 39 digits of 2128-1). With a cutoff of 6, there would be 1+10+102+...+1033 tasks which could be executed in parallel if enough cores were available.

Incidentally, the biggest smallbrain number has 39 digits and is smaller than 2128-1. This means that we could compute all smallbrain numbers without any change to the code supposing we had a 128bit computer.

1. Improvements

The first solution outperforms the second one for intervals which span over less than 106. In order to achieve the best of both worlds, we have merged the two solutions together.

1. Results

Our final program manages to find all smallbrain numbers between 1 and 263-1 in under 75 ms on the Intel Many Core Testing Lab.

 Job ID Test count Cores Expectation (ms) Standard deviation Min (ms) Max (ms) 10236 100 max 70.54 0.31 66.93 75.19 10485 100 max 70.5 1.04 63.98 74.77 10487 100 1 2287.92 3.75 2280.71 2311.97 10488 1000 max 70.36 0.63 65.94 75.62

1. Bibliography
1. J. Reinders, Intel Threading Building Blocks, O'Reilly 2007

2. http://mathworld.wolfram.com/NarcissisticNumber.html

4

Catégories:
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

## Commentaires

Hi,

i cant' find the tarball?! is that normal ?!

i would be curious to put recursive traversal against iterative ones.
If you use gcc, do you turn the '-fomit-frame-pointer' flag on for the recursive version?!! that can make a difference!

thanks

I changed the article right after it was approved for publishing. The updated version (with the archive as an attachment) still waits for approval.
In the meantime here is a temporary link to the archive:
http://fex.insa-lyon.fr/get?k=qKpmucZTKevW5Spjk1t

Mihai

Voilà, Article mis à jour !