# Maximum Subarray Problem using PThreads

Maximum Subarray Problem Parallelization using PThreads

# Maximum Subarray Problem Parallelization using PThreads

University Politehnica of Bucharest

Faculty of Automatic Control and Computers, Computer Science Department

The algorithm we used for solving the maximum subarray problem was Kadane's 2D algorithm. A implementation of the algorithm is presented here: http://alexeigor.wikidot.com/kadane.

Our implementation was done in C. Parallelization was done using PThreads. We started to develop solutions in both TBB and PThreads, but decided to proceed with the PThreads because we obtained better results.

We used Intel's icc compiler for generating the executable. Using the icc compiler gave us a 3-5% performance improvement over the same code compiled using gcc. Other compilers were not tested.

Profiling was done locally using the Linux command line utility perf and Intel VTune. Perf was used to obtain raw profiling data that was of interest (number of cache misses, number of cycles etc.), while VTune was used for obtaining more complex profiling data like hotspots or concurrency information.

# Parsing the input

Since the input file does not contain the dimensions of the need input file we need to estimate the dimensions of the matrix. To do this we read the first line and determine the number of columns and the size of the first line. We then determine the total size of the file by doing and fseek to the end. We estimate the number of rows by dividing the total size of the file to the size of the first line.

Our first attempt attempt at reading the numbers was done using fscanf. Better results were obtained by giving up using fscanf and instead using strtok. In our final version we gave up using strtok and decided to parse the line by hand. We implemented a minimal and fast, but unsafe version of the atoi function. Our final version of the read function was 2-3 times faster than using fscanf to get the numbers from the line. VTune's hotspots analysis was used here to determine where most of the time is spent in the read function.

If our initial estimation of the total number of lines is smaller than the actual number of lines we reallocate the memory area reserved for storing the matrix.

We did not attempt to parallelize the read function.

# Parallelization

Each thread will compute a partial result for the iterations it was assigned. This partial result is composed of the maximum subarray sum value and the coressponding coordinates in the original matrix. The master thread goes through the partial results and selects the one with the maximum value for the subarray sum.

When running on the MTL each thread is forced to run on a certain core using the pthread_setaffinity_np function. Although this improves speedup significantly when using a large number of cores on the MTL, using this function on other machines caused a decrease of performance. Because of this its use is restricted to running our code on the MTL.

The threads are joined using a semaphore when the workload is completed: the main thread does NT waits on the semaphore, while each thread that finishes its workload (including the main thread) does a post operation on the semaphore. This was used to prevent serialization overhead from calling pthread_join for each thread from the main thread.

# Other optimizations

Since the problem is of O(M^2*N) complexity we transpose the matrix if the number of rows exceeds the number of columns.

The matrix is stored as a contiguous chunk of memory. This improves data locality.

The three loops are optimized by reducing the number of operations needed to do memory accesses. We use 3 pointers to access the numbers in the matrix: 2 are increased in a sequential manner (we add 1 to the address) while the third is increase in a non-sequential (we add N to the address, where N is the number of columns).

# Results

We present the results for three matrices of sizes 2000 x 2000, 5000 x 5000 and 6000 x 6000 when running on the MTL. Time is expressed in seconds. These are the results for single runs on machines that were not exclusively reserved for the job so there might be some variation in time when compared to running on a exclusively reserved batch node.