Maximum Subarray Problem Parallelization using PThreads
Catalin Ionut Fratila, Vlad-Marian Spoiala
University Politehnica of Bucharest
Faculty of Automatic Control and Computers, Computer Science Department
The algorithm we used for solving the maximum subarray problem was Kadane's 2D algorithm. A implementation of the algorithm is presented here: http://alexeigor.wikidot.com/kadane.
Our implementation was done in C. Parallelization was done using PThreads. We started to develop solutions in both TBB and PThreads, but decided to proceed with the PThreads because we obtained better results.
We used Intel's icc compiler for generating the executable. Using the icc compiler gave us a 3-5% performance improvement over the same code compiled using gcc. Other compilers were not tested.
Profiling was done locally using the Linux command line utility perf and Intel VTune. Perf was used to obtain raw profiling data that was of interest (number of cache misses, number of cycles etc.), while VTune was used for obtaining more complex profiling data like hotspots or concurrency information.
Since the input file does not contain the dimensions of the need input file we need to estimate the dimensions of the matrix. To do this we read the first line and determine the number of columns and the size of the first line. We then determine the total size of the file by doing and fseek to the end. We estimate the number of rows by dividing the total size of the file to the size of the first line.
Our first attempt attempt at reading the numbers was done using fscanf. Better results were obtained by giving up using fscanf and instead using strtok. In our final version we gave up using strtok and decided to parse the line by hand. We implemented a minimal and fast, but unsafe version of the atoi function. Our final version of the read function was 2-3 times faster than using fscanf to get the numbers from the line. VTune's hotspots analysis was used here to determine where most of the time is spent in the read function.
If our initial estimation of the total number of lines is smaller than the actual number of lines we reallocate the memory area reserved for storing the matrix.
We did not attempt to parallelize the read function.
Splitting the workload between threads was done with regard to the number of rows each thread receives. The total workload is represented by the total number of row pairs obtained from the 2 outer for loops in the Kadane 2D algorithm. We have a total of M * (M + 1) / 2 row pairs, where M is the number of lines. We try to obtain a balanced splitting of the pairs among the threads. This is done by assigning iterations of the most outer loop to each thread so that the number of pairs that each thread receives is close to M * (M + 1) / 2 / NT where NT is the number of threads.
For large number of threads and small number of rows threads some threads might not receive any work.
Each thread will compute a partial result for the iterations it was assigned. This partial result is composed of the maximum subarray sum value and the coressponding coordinates in the original matrix. The master thread goes through the partial results and selects the one with the maximum value for the subarray sum.
When running on the MTL each thread is forced to run on a certain core using the pthread_setaffinity_np function. Although this improves speedup significantly when using a large number of cores on the MTL, using this function on other machines caused a decrease of performance. Because of this its use is restricted to running our code on the MTL.
The threads are joined using a semaphore when the workload is completed: the main thread does NT waits on the semaphore, while each thread that finishes its workload (including the main thread) does a post operation on the semaphore. This was used to prevent serialization overhead from calling pthread_join for each thread from the main thread.
Since the problem is of O(M^2*N) complexity we transpose the matrix if the number of rows exceeds the number of columns.
The matrix is stored as a contiguous chunk of memory. This improves data locality.
The three loops are optimized by reducing the number of operations needed to do memory accesses. We use 3 pointers to access the numbers in the matrix: 2 are increased in a sequential manner (we add 1 to the address) while the third is increase in a non-sequential (we add N to the address, where N is the number of columns).
We present the results for three matrices of sizes 2000 x 2000, 5000 x 5000 and 6000 x 6000 when running on the MTL. Time is expressed in seconds. These are the results for single runs on machines that were not exclusively reserved for the job so there might be some variation in time when compared to running on a exclusively reserved batch node.
The code can be downloaded here: solution
Password for the archive is "secret"