**Dragoș Papavă - Andrei Zanfir**

"POLITEHNICA" University of Bucharest

**Date:** October 15th 2011 - November 15th 2011

## Introduction - The maximum subarray problem

For the problem of the maximum subarray problem we wrote our code in C++ using the icpc compiler along with profiling. We used Kadane 2D as the solver and OpenMP as the parallelization environment, while adding numerous optimizations on top of that. We will next explain each of these decisions, and how it affected the final outcome of the problem.

## Overview of our approach - parallelization scheme

A good parallel program is the one having a good load balance. This means that there shouldn't be any idle threads, and at every moment each thread is capable of taking some work to be done. We pondered on using pthreads or OpenMP as the parallelization framework. Due to our good previous experience with the latter, we had chosen to work with OpenMP. Furthermore, it is a lightweight framework that maintains and even improves code readability. Also, it is very easy to use, eliminates our low-level work and the possibility of having bugs in that nasty part of the program, and lets us concentrate on the algorithm.

At the first glance on the problem, we considered that the best approach would be to make our code able to handle as many files as possible in parallel, where reading and solving the matrices were each of them parallelized. What would happen is that while some threads are running the algorithm on a large matrix, some others should read some other files and finish the computations on them.

A first approach in the parallelization of our program was to use the * #pragma omp for* construct, but it didn't seem to be the best one as it lacked the possibility of using nested parallelism. Moreover, load balancing was hard to do, as the work of a heavy working thread couldn't be shared if some of the other threads became idle. The second approach with

*had very little more overhead (still largely unnoticeable), but it compensated by giving us the ability of allocating each parallelizable small part of code to a worker thread.*

**#pragma omp task**Now that we had chosen to use the tasking feature of OpenMP, the program's backbone became the following:

*#pragma omp parallel*

{

#pragma omp single nowait

{

for( uint32_t fileNo=0; fileNo<num_files; fileNo++ ) {

#pragma omp task firstprivate(fileNo) untied

{

ReadFile(fileNo, ... );

if (matrix too small) // where parallelizing would be inefficient

RunSerialAlgo(fileNo, ...);

else

RunParallelAlgo(fileNo, ...);

}

}

}

}

{

#pragma omp single nowait

{

for( uint32_t fileNo=0; fileNo<num_files; fileNo++ ) {

#pragma omp task firstprivate(fileNo) untied

{

ReadFile(fileNo, ... );

if (matrix too small) // where parallelizing would be inefficient

RunSerialAlgo(fileNo, ...);

else

RunParallelAlgo(fileNo, ...);

}

}

}

}

Each of the above ReadFile and RunAlgo functions will be explained in detail in their own chapters.

## Reading the files

For reading the files, we tried different approaches like:

- iostream
- open/read
- fopen/fread
- mmap

Out of all of the above, mmap seemed to be the fastest method. This is a brief scheme of our file reader.* int fd = open(file, O_RDONLY);
struct stat stats;
fstat(fd, &stats);
double size_estimate = (double)stats.st_size * 0.16252216462834786774589 * 17.0/16.0; // We estimate the matrix size*

First, an estimate of the number of integers contained in the file is computed by knowing the file size. The 0.1625 constant is the number of integers in a 10000x10000 random generated file, divided by its size. For safety reasons, we increase this estimate by 1/16. The next code makes sure that we'll wait for other computations to finish if the needed memory for processing this file is more than 75% of available memory.

*while (free_OS_memory() * 0.75 <= size_estimate)*

sleep 15 milliseconds

sleep 15 milliseconds

Reading the file seemed a few percents faster if we advised the OS that the memmapped file will soon be needed and should be brought into RAM as soon as possible.

*char *map = (char*)mmap((void*)0, (size_t)stats.st_size, PROT_READ, MAP_SHARED, fd, (off_t)0);*

madvise(map, stats.st_size, POSIX_MADV_WILLNEED);

close(fd);

madvise(map, stats.st_size, POSIX_MADV_WILLNEED);

close(fd);

Next, we decide if the file should be read by only one thread or it should be split into multiple parts and read in parallel. We tried to keep our method as fast as possible, by trying to avoid rereadings (e.g. using fgets then parsing the returned buffer actually processes each character twice).

*if (stats.st_size < 1000 || the executable is being run in serial mode, e.g. 1 thread)*

DoSerialReading();

else

DoParallelReading();

DoSerialReading();

else

DoParallelReading();

For both methods, we have used a no-rereadings approach, by reading one character at a time and deciding whether it is the minus sign, belongs to a number, is a space or an endline, and then processing the character. All numbers being read were stored in a large vector who could be later indexed as a matrix (e.g. v[i*N+j]). The number of lines is the number of endline characters, and the number of columns is computed while reading the numbers, between the beginning of file and the first endline.

More interesting is the parallel reading, where we have split the file into num_threads equal parts. Each part was assigned to a task and was read to a separate vector of integers. We again counted the number of rows as the number of endlines found, but by not knowing the matrix size and how it was split among threads, the easiest idea was to divide the total number of integers read by the number of lines found, and that was the number of matrix columns. Before starting to read the numbers, each of the splitting indices were optimized so that they would point to the beginning of a number. After all threads finished reading, a large chunk of memory was allocated and each of these parts were copied. Parallelizing this copying also improved the speedup of the reading code.

## The algorithm

We used the Kadane 2D algorithm for its simplicity and innate parallel potential: for a starting row x and an ending row z that define a sub-rectangle, compute the sum of its columns; this will yield an array of dimension 1xN (where N is the number of columns) that is optimally solved by the Kadane 1D algorithm in O(n) time. Thus, the complexity is O(M^2 * N). We chose to parallelize this algorithm at its most outer loop to avoid thread overhead.

** Optimization **

Our optimization methods were more on the engineering side of things, emphasizing brute force computations.

* Pointer dereferencing

*storage_type *a_tmp = a + x * N - 1;*

* Loop unrolling of 32 iterations (this has been tested to be the best number):

*for(SizeType i = iend;i--;)*

{

iter_body;

iter_end;

}

{

iter_body;

iter_end;

}

The iter_end macro is the one that actually performs an iteration; the iter_body macro evaluates to something like this:

*#define iter_simple iter_end; i--;*

#define iter_simple4 iter_simple;\

iter_simple;\

iter_simple;

#define iter_simple8 iter_simple4;\

iter_simple4;\

iter_simple;

#define iter_simple16 iter_simple8;\

iter_simple8;\

iter_simple;

#define iter_simple32 iter_simple16;\

iter_simple16;\

iter_simple;

#define UROLL 32

#define iter_body iter_simple32

#define iter_simple4 iter_simple;\

iter_simple;\

iter_simple;

#define iter_simple8 iter_simple4;\

iter_simple4;\

iter_simple;

#define iter_simple16 iter_simple8;\

iter_simple8;\

iter_simple;

#define iter_simple32 iter_simple16;\

iter_simple16;\

iter_simple;

#define UROLL 32

#define iter_body iter_simple32

We made the loop backward because it will now compare to zero - this is actually a much better loop management. Given the fact that this is the critical loop, the most minor changes matter significantly.

* Preferring pre-increment operators to avoid the creations of temporary variables and compacting instructions, in the same sense:

*t += (*(++pr_tmp) += *(++a_tmp));*

* Slighly better branch hints using the

*__builtin_expect*compiler intrinsic:

*if(__builtin_expect(t >= 0, 1)) { }*

else {

t = 0;

j = N - i;

}

else {

t = 0;

j = N - i;

}

This set of optimization alone has provided a 300% running boost to the baseline method.

## Further optimization

** Sum initialization **

We used a smarter sum initialization to help multiple threads to be aware of a better sum already found; we thus avoid entering an if-branch too often. This is very similar to what the winners did, but we didn't exit the loops completely(!), we were satisfied just for the branch not to be entered too often. We also updated the initial guess of the sum, by maximizing over all the threads sums:

*if(S[z] > s_init)*

s_init = S[z] - 1; // -1 for disambiguation of the real maximum sum found

s_init = S[z] - 1; // -1 for disambiguation of the real maximum sum found

**Handling special cases**

The fact that the algorithm is O(M^2 * N) and not O(N^2 * M), is not a matter of randomness. This is actually the most cache-friendly way to go through the matrix. So, given an input matrix for which M > N, we decided to transpose the matrix in O(M*N) time, rather than calling another analogous algorithm that would have run in O(N^2*M). The transposing of the matrix is itself highly optimized, so to handle cases only when O(M * N) is not negligible with respect to an O(M^2*N) running time. We also tested for trivial inputs: all non-negative, all negative and only one non-negative.

**Makefile and profiling**

For the compiler we have used icpc, in order to exploit its profiling abilities that seemed very natural since we were dealing with an Intel platform. This is the makefile:* mkdir prof
icpc -fast -xhost -openmp -parallel -ipo msubarray.cpp -override_limits -o run -funroll-loops
./run
icpc -fast -xhost -openmp -parallel -ipo msubarray.cpp -override_limits -prof_dir=prof -prof_gen -o run -funroll-loops
./run 1 ./prof/profiling.txt >./prof/output.txt
icpc -fast -xhost -openmp -parallel -ipo msubarray.cpp -override_limits -prof_dir=prof -prof_use -o run -funroll-loops
rm -rdf ./prof*

Inside the makefile, we compile the code, we run it without arguments in order to create an input matrix (1000x1000), then we run it with profiling generation, and finally we compile again using that profiling information.

## Conclusions

OpenMP turned out to be a suitable candidate for the parallelization scheme, and its easiness provided invaluable for the smooth flow of the algorithm design. We don't feel like this being a bottleneck whatsoever, and we managed to adapt it very fast to meet our needs.

Though the algorithm we came up with is fast, we feel like more thought needed to be put into exploiting the random nature of the input matrix. We feel that the Takaoka algorithm would have been a much better choice, if not for the fact that it's non-trivial to implement and parallelize efficiently, and the fact that even if it's O(M*N*log(N^2)) running time, it has some big constants attached to it making it viable only for extremely large matrices.

On this note, the whole better sum initialization is working due to the random nature of the matrix, but the fact that we didn't prune accordingly made a whole world of difference.

To further conclude, we enjoyed toying with this problem and we hope the reader will find something helpful in this article.

## Click here to download the code

The archive password is "secret"