Maximum Subarray Problem - Simple Parallelization and Optimizations

Maximum Subarray Problem - Simple Parallelization and Optimizations


University of Novi Sad


Faculty of Technical Sciences, Department of Computing and Control


authors: Predrag Ilkic, Nenad Jovanovic


Date: October 15th 2011 - November 15th 2011


Introduction:


This article is an explanation of our method for solving the maximumim subarray problem during the Intel Acceler8 contest. The team consisted of one fourth year university student and one fourth grade high school student.


The problem was the following: in the given matrix, find the subarray (corner coordinates) that has the maximum sum of elements. The algorithm used for the solution was the well-known Kadane's algorithm adapted for two-dimensional
arrays. We started the project using the OpenMP, but at one point we switched to pthreads because we got much better results. The algorithm wasn't too complex for implementation so the transition went quite smoothly. The only downside of
pthread solution was that we switched to it much too late so we never had the time to try out some of our ideas that failed with OpenMP. The compiler used was gcc, no other compilers were tested.


1 - matrix parsing


Matrix parsing and loading was done sequentially. While in the OpenMP phase of the project, we tried to parallelize the matrix parsing and loading, but without success. The results were pretty much the same at best, so the solution
was dropped. No parsing parallelization was attempted with pthreads due to the lack of time.


Matrix parsing was done using tha standard fread() function. One chunk at the time was loaded into memory. It was parsed, firstly, by counting all of the spaces to get the column count and then by counting the linebreaks to get the
row count. After acquiring the matrix size, we allocated the appropriate amount of memory. The matrix was then loaded into memory in an appropriate way depending on the number of rows and columns (whether or not rows>columns -
this effects the complexity of the solution which is calculated O(rows*rows*columns)). This enabled us not to waste time on matrix transposing later. The loading was done using our own integer reading. We found it much
faster compared to standard fscanf(). The matrix from the file itself was never actually loaded into memory. We just used the integers read to generate the prefix sum matrix. Only the prefix sum matrix was processed in this
solution.


2 - parallelization and optimizations


The whole parallelization process was done in OpenMP and later literally translated to pthreads. Since the Kadane algorithm consists of three nested for loops, the parallelization was done on the outer most for loop to decrease
overhead and increase data locality and cache hit rate. All interations in the outer loop are completely independant so parallelizing the algorithm in this way enables great scalability. Every iteration is picked up for processing by one of
the worker threads. Every iteration is processed one by one until all iterations are processed. We tried different types of scheduling with OpenMP. Static scheduling was the least effective and dynamic and guided scheduling gave very
similar results so we preoceeded with dynamic scheduling which has just been explained. After the transition to pthreads, dynamic scheduling was kept and no other scheduling types were tried due to before mentioned lack of time. Every
worker thread collects it's own results from which is later picked the final result. This was done to eliminate any thread synchronization that might slow down performance.


3 - results


For starters, here are some matrix parsing and loading timings:






















matrix sizetiming
1000x100039.35 ms
2000x2000153.2 ms
5000x5000984.52 ms
8000x80002466.63 ms

 

 

 


 


And here are the timings of the whole algorithm:





















































































matrix sizenumber of corestiming
1000x100011.523 s
 80.223 s
 160.132 s
 240.104 s
 320.089 s
 400.082 s
2000x2000112.093 s
 81.683 s
 160.891 s
 240.646 s
 320.525 s
 400.465 s
5000x50002011.886 s
 307.584 s
 406.041 s
8000x80002052.050
 3034.685
 4026.234

 


 


 


 


 


 


 


 


 


 


 


 


 


 


 


No fancy graphs or pictures :)


4 - conclusion


As the results show, for smaller matrices, speed up drops greatly because of the large part of execution that the sequential part of the code takes. As the matrix gets larger, the speed up grows to almost linear.


The code, makefile and the readme file from the contest are attached. The password for the archive is "secret". We tried to make the code well commented, so it should be easy to understand. You can download all of it here.


For the end, we'd like to say that the contest was a pleasure, a great experience and a great chance to learn something new and useful as well as try it out properly(40 core MTL was great:)).



Для получения подробной информации о возможностях оптимизации компилятора обратитесь к нашему Уведомлению об оптимизации.