Maximum Subarray Problem - Simple Parallelization and Optimizations

Maximum Subarray Problem - Simple Parallelization and Optimizations

University of Novi Sad

Faculty of Technical Sciences, Department of Computing and Control

authors: Predrag Ilkic, Nenad Jovanovic

Date: October 15th 2011 - November 15th 2011


This article is an explanation of our method for solving the maximumim subarray problem during the Intel Acceler8 contest. The team consisted of one fourth year university student and one fourth grade high school student.

The problem was the following: in the given matrix, find the subarray (corner coordinates) that has the maximum sum of elements. The algorithm used for the solution was the well-known Kadane's algorithm adapted for two-dimensional
arrays. We started the project using the OpenMP, but at one point we switched to pthreads because we got much better results. The algorithm wasn't too complex for implementation so the transition went quite smoothly. The only downside of
pthread solution was that we switched to it much too late so we never had the time to try out some of our ideas that failed with OpenMP. The compiler used was gcc, no other compilers were tested.

1 - matrix parsing

Matrix parsing and loading was done sequentially. While in the OpenMP phase of the project, we tried to parallelize the matrix parsing and loading, but without success. The results were pretty much the same at best, so the solution
was dropped. No parsing parallelization was attempted with pthreads due to the lack of time.

Matrix parsing was done using tha standard fread() function. One chunk at the time was loaded into memory. It was parsed, firstly, by counting all of the spaces to get the column count and then by counting the linebreaks to get the
row count. After acquiring the matrix size, we allocated the appropriate amount of memory. The matrix was then loaded into memory in an appropriate way depending on the number of rows and columns (whether or not rows>columns -
this effects the complexity of the solution which is calculated O(rows*rows*columns)). This enabled us not to waste time on matrix transposing later. The loading was done using our own integer reading. We found it much
faster compared to standard fscanf(). The matrix from the file itself was never actually loaded into memory. We just used the integers read to generate the prefix sum matrix. Only the prefix sum matrix was processed in this

2 - parallelization and optimizations

The whole parallelization process was done in OpenMP and later literally translated to pthreads. Since the Kadane algorithm consists of three nested for loops, the parallelization was done on the outer most for loop to decrease
overhead and increase data locality and cache hit rate. All interations in the outer loop are completely independant so parallelizing the algorithm in this way enables great scalability. Every iteration is picked up for processing by one of
the worker threads. Every iteration is processed one by one until all iterations are processed. We tried different types of scheduling with OpenMP. Static scheduling was the least effective and dynamic and guided scheduling gave very
similar results so we preoceeded with dynamic scheduling which has just been explained. After the transition to pthreads, dynamic scheduling was kept and no other scheduling types were tried due to before mentioned lack of time. Every
worker thread collects it's own results from which is later picked the final result. This was done to eliminate any thread synchronization that might slow down performance.

3 - results

For starters, here are some matrix parsing and loading timings:

matrix sizetiming
1000x100039.35 ms
2000x2000153.2 ms
5000x5000984.52 ms
8000x80002466.63 ms





And here are the timings of the whole algorithm:

matrix sizenumber of corestiming
1000x100011.523 s
 80.223 s
 160.132 s
 240.104 s
 320.089 s
 400.082 s
2000x2000112.093 s
 81.683 s
 160.891 s
 240.646 s
 320.525 s
 400.465 s
5000x50002011.886 s
 307.584 s
 406.041 s
















No fancy graphs or pictures :)

4 - conclusion

As the results show, for smaller matrices, speed up drops greatly because of the large part of execution that the sequential part of the code takes. As the matrix gets larger, the speed up grows to almost linear.

The code, makefile and the readme file from the contest are attached. The password for the archive is "secret". We tried to make the code well commented, so it should be easy to understand. You can download all of it here.

For the end, we'd like to say that the contest was a pleasure, a great experience and a great chance to learn something new and useful as well as try it out properly(40 core MTL was great:)).

For more complete information about compiler optimizations, see our Optimization Notice.