Acceler8 contest

# Acceler8 contest

Andreolli Cédric - Garcia Pascal - Templé Arthur

Date: October 15th 2011 - November 15th 2011

### Abstract:

This report explains the approach we used for resolving the Maximum Subarray Problem'' during the Intel Acceler8 contest. We are two students in fourth year and a teacher at INSA of Rennes.
The idea of the contest was to build an algorithm able to scale on computers with large number of cores. Here are the different steps of the development process. We hope you will enjoy reading it as much as we enjoyed the contest.

# Introduction

## The maximum subarray problem

The first step was, of course, to understand the problem. The maximum subarray problem is a well known algorithmic problem. It consists of finding the rectangle (a submatrix) with the maximum area in a matrix of integers.

A lot of documentation about this problem can be found on the Internet. First, we started to study and test some algorithms we found such as the Kadane algorithm.

We choose to use C++ for solving the problem. C++ offers the advantage to be quite low level if you need it, but you also have access to higher level objects such as vectors, lists, etc. Besides, we are currently learning this language at INSA so it was a good opportunity to practice.

## OpenMP

For the parallelization part, we chose to use OpenMP. We made this choice for two main reasons.

First, the video tutorial was really easy to understand and went over a lot of functionalities really helpful for what we planned to do. Furthermore, the MTL did not recquire a lot of specific settings to work with OpenMP and we thought it was a good idea not to loose time on compilation problems.

The second reason is that none of us ever used it and it is always interesting to discover new libraries. One of the main interest that offers OpenMP is that it allows you to incrementally parallelized your code. With really few changes, you can improve the speed of a sequential program and this is one of the big interest we found in this library.

Finally, OpenMP offers the advantage of adding very few lines of code. For example, you do not have to use multiple semaphores or mutexes to protect your critical datas.

Once we finished to discover the possibilities of OpenMP, we decided to split the project into some independent tasks.
The next sections explain those different tasks. You will also find the program documentation in the doc directory.

## Problems encountered

As we decided to work with C++, we started to create a really simple file reader. So we used the basic STL operations at first and we proceeded as follows:

std::ifstream file(fileName, std::ios::in);
std::string line;...
...ringstream::in);
while(tmp»num){
//Parsing code here
}
}
}
\end{lstlisting}">

But actually, the line:

while(tmp»num){
\end{lstlisting}">

had really bad performances. We then decided to write our own integer parser. After few tests, it was really faster. Then we started to think about the possibility to parallelize this step.

We first took some time to see what was going on when we read a file and parsed it into a vector. As a matter of fact, we could see that the cores were absolutly not busy during this operation.
As there were a lot of hard drive access, the cores were spending most of their time to wait until data arrived. It was not optimal but for us, it was not possible to parallelize the file reading operation because there is only one bus between the hard drive and the main memory.

## The way we resolved it

Finally, after more tries, we realized that loading the whole file into main memory was quite fast. We call buffer this array of characters.
We then imagined a trick to use parallelization to speed-up the parsing of the files.

Actually, once the file is in main memory, it is really fast to run throught it. So we decided to go two times through the whole file.
The first time, to get the matrix dimensions, the second, to parse the file.

### Getting the matrix dimensions

At this point, we wanted to spend the least time we can on this step but as the buffer is an array of characters, it is really difficult to parallelize the whole process.
So we decided to sequentially read the first line (until a '\n' is found) and count the number of columns of the matrix thanks to the white spaces.
As the input file must respect some specifications, we are sure that the number of columns of the matrix is equal to the number of spaces on a line plus one.
Once this step is over, we just need to rush through the rest of the file to count the '\n'. This last step can easily be parallelized with
OpenMP and a #pragma omp for directive.

During this process, we register the addresses of the new lines into a vector named addressTab.

### Parsing the file

Once we get the matrix dimensions, we get a vector (addressTab) which contains the addresses of all new lines in buffer.
We can now parallelize the file parsing. Depending on the number of cores we have, we split buffer into different parts based on the new lines addresses (see Figure 1).

The goal of the process is to fill a two dimensions vector. We will call this vector: matrix. The elements in addressTab corresponds to the addresses of the beginning of each new lines in buffer and as a core is at least responsible of an entire line, it can put the parsed numbers to the correct position in matrix.

Our algorithm for solving the maximum subarray problem is faster if the matrix of n rows and m columns is such that nm. So we have two different functions for generating an optimal matrix (readLinesOrdered and readLinesReversed).
We did not factorize this part of the code because of optimization concerns. Indeed, this would have add a test condition for every numbers we had to put in matrix. On big files, this could have been an important waste of time.

# The one dimension algorithm

We actually do not have a lot to tell here.
We started to work on this algorithm and wrote some parallelized functions to do it, but as it is already a very fast sequential algorithm (
O(n) complexity), the improvments were not significant. Actually, the time needed for solving the problem, was not significant compare to the time needed to read and parse the file.

The two dimensions problem was hard enough, so we decided not to spend more time on this particular case.

# The two dimensions algorithm

As said before, the two dimensions maximum subarray problem is a well known problem and it is possible to find some documentations on the Internet.
We decided to work on the Kadane algorithm which is quite simple to understand.
We started to work on the parallelization process for this algorithm. The two dimensions Kadane algorithm is a generalization of the one dimension case.
It is based on three overlapped for loops.

As we chose to use OpenMP, we had two main approaches. One was to use the #pragma omp for directive, the second one was to use the #pragma omp task one.
We started with the easy solution (the #pragma omp for). After running some tests, we could see that the cores were not busy during all the process. That is the reason why we imagined a solution with the #pragma omp task directive.

The idea was to compute the number of tasks we wanted to create (let's call it numberOfTasks) and launch the tasks. Each task is responsible for computing the maximal sum (and the associated coordinates) for a part of the matrix.
The first for loop in the Kadane algorithm iterates on the rows of the matrix.
The parallelization is done by dividing the loop in numberOfTasks tasks. Each thread assigned to a task realize a for loop but instead of incrementing by one, we increment by numberOfTasks.
Here is the corresponding part of the code (where n is number of rows of the matrix):

...
for (unsigned int rowStart = taskNumber; rowStart < n; ro...
...
...
for (...) {
...
for(...) {
...
}
...
}
...
}
\par
\end{lstlisting}">

It is important to create more tasks than the number of cores. Indeed, each task does not have the same computation time and in this case, we would have to wait for the longest tasks at the end of the function. Increasing the number of tasks will reduce the time we have to wait because the tasks are shorter.

## Our algorithm

After having parallelized the two dimensions Kadane algorithm, we started to work on the algorithm itself.
First we had the idea to create a new one-dimension array (called maxSumStartingAtRow) which contained for each index i an upper bound on the maximal sum you can obtained for a sub-matrix starting from i to the end of the original matrix.

We used the maxSumStartingAtRow array to break the second for loop if the current maximal sum found in the current task was already bigger than the upper bound on the maximal sum (lines 9 to 12 in figure 2).

 ... break; } ... for(...){ ... } ... } ... } \end{lstlisting} \end{figure}">

For the generation of maxSumStartingAtRow, we first did a preprocessing operation which was parallelized. This operation started from the bottom of the matrix, and computed some one-dimension Kadane and added the value from the bottom to the top of maxSumStartingAtRow as illustrated in figure 3.

With this solution, the problem was that maxSumStartingAtRow was a really bad estimation of the real maximum sum starting at a row. This is quite easy to understand with the example in figure 4.

A solution we found was to compute at regular intervals some Kadane in two dimensions. Indeed, on huge arrays, this last algorithm is way more accurate. The two dimensions algorithm was used to decrease the difference between the real and the computed values in maxSumStartingAtRow.

The last problem we had was about the necessary time needed to compute this preprocessing operation. Even with this preprocessing, the solving time of the two dimensions algorithm was reduced, but on big arrays (10000 x 10000) the total time was only decreased by few seconds (due to a long preprocessing).

Finally, our last two dimensions algorithm does not use preprocessing. The trick is that the classical Kadane two dimensions algorithm spends its time computing sums from a row to an other. This allows us to use the solve part of our algorithm to fill the maxSumStartingAtRow (the following piece of code is placed on line 19 in figure 2):

...
if (!pruningOccured) {
...">

The maxSumStartingAtRow vector is initialized with the next line:

for (int i = 0; i < n; ++i) maxSumStartingAtRow[i] = LONG_MAX»2;
\end{lstlisting}">

As we use the maxSumStartingAtRow in an addition, we want to avoid overflow. This is the reason why we divide by 4 the LONG_MAX value.

Finally, we add a variable bestSoFar shared by all threads to indicate the best value found so far by all achieved tasks. This value is used to initialized the sum variable (line 2 in figure 2. We replace sum = 0 by sum = bestSoFar) to cut part of the matrix based on the best values found on other tasks.

Note that having more tasks than available cores is important for our pruning method too. Because threads can communicate more often partial results to others and in doing so they help each other to prune some part of the computation.

To conclude, the complexity of our Kadane algorithm is still in O(n² x m), but due to the cut we use in the second for loop, most of the time, we can improve the speed of the resolution.

# The final algorithm

The final part of the algorithm is realy simple, the method is in the MaxSubArrayPb class static void computeMaxSubArray(char* fileName).
It only uses the different functions we wrote. Here are the different tasks executed by the algorithm:

This just load the file into main memory.

Get the matrix size:

Here, the only goal is to get the dimensions of the input matrix and to fill a vector with the addresses of all the lines.

Find the good orientation:

As explained before, our algorithm is way more efficient with some particular arrangments of the initial matrix.

Generate the vector :

This operation turn the input file into a C++ two dimensions vector.

Launch the good algorithm :

Regarding the number of rows of the input vector, we choose to launch the one dimension Kadane
algorithm or our two dimensions algorithm.

Reverse the result if necessary :

If the third step reversed the matrix, we need to rotate the result to have the good output coordinates.

Print the result :

Probably no need of explanation here :-).

The main function, in the main.cpp file, only calls the computeMaxSubArray method on each files passed as parameters.
It also defines the number of threads the algorithms have to use.
As it is really short, here is the code :

int main(int argc, char* argv[]){
if(argc < 3){
cout«''Par...
...){
MaxSubArrayPb::computeMaxSubArray(argv[i]);
}
return 0;
}
\end{lstlisting}">

# Conclusion

This constest was really interesting in a lot of different aspects. First of all, it involved team work between teacher and students wich was really rewarding.
We all learned a lot of thing on a topic we didn't know well.
As computers have more more and more cores, this kind of computation is probably going to become a very important issue in the futur application devloppment.
This contest was the occasion to discover existing technologies. It was also the occasion to pratice on a 40 cores computer, a thing that is not possible every day.
The topic of the contest, the maximum subarray problem, was an interesting problem to try to parallelize.
It was quite simple to understand and it allowed us to use multiple ways to parallelize our program.

The available resources, put at our disposal by Intel were adapted to beginners in the parallel computing learning. We enjoyed learning by watching the video tutorial.

To conclude, it was a real rich experience and we want to thank Intel for the organization of this contest.