Maximum subarray problem adapted for manycore architectures

Lupescu Grigore, Oprea Mihai
Implementation : C/C++ & OpenMP

Intro



In computer science, the maximum subarray problem is the task of finding the contiguous subarray within a N-dimensional array of numbers which has the largest sum. Several algorithms exist out of which we choose Kadane2D because of the possibility of parallelism, compared to Tadao which cannot be adapted easily to run o multiple cores.

Intel's MTL is a good example of a manycore architecture and will be used to show the problems in scalability even to embarrassingly parallel problems such as Kadane2D.After connecting to the front-end, using qsub, a user is able to submit jobs to one of the available clusters in the queue. A cluster consists of 4 Xeon processors, each 10 cores with 24MB cache, having a total of ~260GB RAM in quad channel. (HT by default is deactivated). Scalability is hard to achieve mainly due to the multiprocessor configuration even on highly parallel problems such as Kadane2D.

Reducing sequential read time



Knowing the number of rows is not vital to the reading process. The “natural” approach is to read the file line by line with fgets (or a C++ equivalent), until the end of file. Which brings about the next problem, which is how to store the matrix. The easiest way is to use stl::vector, namely vector< vector<int> > to store the matrix. However, since vector is a class and uses procedures to store (push_back) and access (overloaded operator []) elements, it is much more inefficient than using a basic int ** matrix.

The actual parsing of a matrix row is done using the atoi function for each ascii-represented number. Breaking the line into tokens can be done either by using strtok (or a C++ equivalent) or by manually iterating through the characters and copying each whitespace delimited number to an auxiliary vector. Both methods follow the same underlying principle and their performance is similar.
Since the algorithm's complexity is highly dependent on the number of rows, O(N*N*M) for Kadane2D, the matrix will be transposed so that N is always smaller than M.
Keeping a linearized matrix (1-dimensional) is probably the most important optimization. Although it does not improve the time spent reading, it gives the overall algorithm an important speed-up.

Traditional 2-D matrix dynamic allocation does not keep the rows in a contiguous block of memory, since each row pointer is allocated separately. Because of that, when the algorithm uses a certain row and its corresponding block of memory is loaded into the cache, the rest of the cache block will usually contain useless data. On the other hand, if the matrix is allocated as one big contiguous block, then loading a row into the cache will load the next few rows as well, depending on the size of the row and on the size of the cache. For example, if a row contains 10000 integers (40 KB), a total of 100 rows will be loaded in a 4 MB cache when the first row is accessed. By comparison, there are no guarantees that more than one row will be loaded in the cache if each row is allocated separately. In order to take full advantage of this feature, the distribution of rows to each thread must be planned accordingly.

A more performant way of reading a file is using a certain capability of the operating system, which consists of mapping the file to the RAM via the zero-copy mechanism. Under UNIX, this is done using the mmap function. To increase the speed even further, the madvise function is used to indicate to the operating system how the mapped memory will be used. In this case, the file is accessed sequentially, so MADV_SEQUENTIAL is used.

In order to eliminate the need to resize the matrix by doing reallocations, an estimate can be done regarding the maximum memory requirements of the input file. Since in this case a file contains ascii numbers separated by whitespace characters, the maximum number of elements a file can contain is file_size / 2. In the worst case scenario, where all numbers are single digit, there are two bytes per number (digit + whitespace). So the maximum size of the matrix is: file_size / 2 * sizeof(int). Like already mentioned, after the reading is completed, the matrix size is reduced via realloc, so that all redundant allocated memory is freed.

A more subtle optimization can be done in regard to the parsing of each number. When manually iterating through the elements of a row, two pointers are used to mark the start and end of a valid number. When a whitespace is encountered, the characters between the two pointers are copied in an auxiliary vector and atoi is applied. However, atoi can be applied directly to the pointer that designates the start of a number. Because atoi stops at the first character that cannot be processed (in this case a whitespace, a '\n' or a EOF), the parsing will be done correctly. This saves an important number of strcpy / memcpy operations. For example, a 10000 x 10000 input file has 100 million integers and requires 100 million extra copy operations. Although this optimization seems to not apply to the strtok approach, strtok also uses an auxiliary vector and is susceptible to the same redundancy.
The last optimization is tailored specifically for the maximum subarray problem. It consists of “pruning” the bordering lines and columns of the matrix which consist only of negative numbers. Although pruning the columns has a very limited impact, and pruning the lines appears to be a highly input-dependent optimization, the cases where it applies benefit from a noticeable speedup.

Finally, some results (exclusively reading the input; random numbers):
Matrix Size (MB) Naive (sec) Optimized (sec)
5000 x 5000 150 3.2 2.5
10000 x 10000 590 15.5 9.5


Kadane2D general multicore approach



The algorithm uses dynamic programming, and searches through all possible combinations (x0,y0,x1,y1) the array with the maximum sum. It is an extension to Kadane1D and has a complexity of O(row^row^col) or O(n^3) for a square matrix. Though there are algorithms that solve the maximum subarray problem with sub-cubic complexity (Tadao O(n^3 * sqrt(loglog(n) / log(n))) Kadane is better suited for parallelism and it is able to recover the difference easily.

1) Compute prefix sum on columns ( parallel )
2) For each row combination (i, j) , i<j, compute the difference diff[k] = A[j]-A[i] and search for local max sum using Kadane1D ( parallel )
3) From all the partial sums computed at 2) select the largest one (sequential)

General optimizations to Kadane2D
1) If row > col we flip the matrix so that col > row
2) If sides are all negative we ignore them (pruning)
3) If all the numbers are positive the solution is (0, 0, row_count, col_count)
4) If all the numbers are negative the solution is the max number

Parallel optimizations to Kadane2D
Kadane2D's algorithm consists mainly of 3 for loops, out of which the first 2 denote the (i, j) combination of rows whilst the last – say k - denotes the columns where we use Kadane1D.

for( int i = 0; i < row_count; i++)
for(int j = i; j < row_count; j++)
Kadane1D( k = 0 → col_count )

In other words if we know the combination (i, j) of rows we can apply Kadane1D using a single thread per combination. This leads to the question : “how do we split the work per thread in (i, j) combinations ?” Since i = 0 → row_count, j = i → row_count, it follows that we could represent it graphically

A) Split the work in slices by first loop i
A simple way would be for each thread to get a slice either by using a thread pool – somewhat inefficient – or simply by a formula such as sliceThread=Thread_TID, sliceThread+=thread_num. This approach does not distribute the work equal amongst threads – for instance in the example bellow the blue thread will always get less work than the red thread.

solutionA

B) Split the work in chunks by first loop – i (good/balanced)
The main problem we faced is how to split the work so that all the threads receive near equal amount of work. For instance if we parallel by “i”, the inner “for” becomes the work (for j=0 → row_count, while i is fixed ) and after a point, more and more threads will receive less work. We have the following problem to solve - “ given a half square how can we evenly divide it so that all areas have the same array ? ”.

solutionB

One way to do this is to recursively compute the start-end positions so that all the arrays are equal. The formula for this is :
x = sqrt( n_start * n_start - n_start * n_start / (threadNum-i) )
where n_start is the end position of the previous thread, and i, is the number/id of the current thread.
Then we have the following :
n_end = x, n_dim = n_start – x, nd = start + n_dim.

For example for a 100*100 matrix, we want to determine for each thread a combination (i_start, i_end) that it will work on. After recursively computing i_start and i_end we will get :

What is interesting is that T0 with (0,14) is near equal as work size with T3 (51, 100) though the difference i_end – i_start is quite large. This highlights how inefficient would have been if every thread had gotten a chunk where the difference would be equal to(T1_end – T1_start = T3_end – T3_start)

C) Split the work in group sequences
In this approach we group all threads and access all elements in a consecutive manner from a logical thread group view. Depending on the architecture and memory allocation choosing to go on lines or on columns can have a significant impact on performance. The high efficiency of this method comes from data share since when going on columns T1...TN will work on combinations of the form {a0, a1}...{a0, aN} => {a0, a1,... aN} => N+1 chunks (columns) that need to be loaded in cache, less than previous methods that all demanded different regions, up to 2N chunks. Combining this with a linear memory allocation will lead to a significant performance increase compared to A) and B).

solutionC

We want to stress the fact that {(0,1), (0,2), (0,3), (0,4)} is better than {(0,1), (10,11), (20,21), (40, 41)} because it requires {0, 1, 2, ,3, 4} = 5 chunks compared to {0, 1, 10, 11, 20, 21, 40, 41} = 8 chunks.

thread_i = tid
while ( thread_i < row_count ^ 2)
{
thread_i += thread_num
… DO WORK = Kadane1D …
}



Kadane2D performance analysis on MTL



Kadane2D's algorithm consists mainly of 3 for loops, out of which the first 2 denote the (i, j) combination of rows whilst the last – say k - denotes the columns where we use Kadane1D.

for( int i = 0; i < row_count; i++)
for(int j = i; j < row_count; j++)
Kadane1D( k = 0 → col_count )

As presented previous we need to assign each thread a set of {(i, j) | i < j} and we can do this in at least 3 ways : equal chunks by i (solution A – inefficient), uneven chunks by i (solution B) equal by Sum(i,j), and finally by grouping zones so to create redundancy ( solution C ).

Solution A does not have a balance workload amongst threads and should yield bad results in anything but a singlecore computer – differences will increase as core count increases. Moving on and comparing the former 2, B will yield weaker results than C due to the lack of data redundancy – in other words B will need more bandwidth and cache than C.

On the MTL working a 10000x10000 matrix will more than double the performance
solution B (2N columns in cache) => 150 sec
solution C (N+1 columns in cache) => 60 sec

Considering this following questions :
1) Will “solution C” scale near perfect on the MTL (given high bandwidth and large cache) ?
2) Will increasing redundancy help more ? - instead of {(0,1), (0,2), (0,3)} use {(0,1), (0,2), (1,2)}

Implementing solution C we have obtained the following scores on 10000x10000 and 5000x5000 matrix.

The answer to both the questions is NO for a similar configuration to MTL. This is due to the fact that multiprocessor systems have different caches and because the OS/middle-ware is in charge of thread distribution. Threads that would require common elements could thus end up on different processors and with the increase of multiprocessors solution C would come closer to solution B that did not benefit from redundancy.

For Intel MTL system the conclusions to solution C are :

Spikes appear when work is not divided equally between processors – mainly due to cache misses, data distribution among processors
Best speedup is up to 20 cores or 2 processors, moderate speedup is up to 30 cores or 3 processors whilst a modest speedup can be seen between 30-40 cores or 3-4 processors

10000 x 10000 scores

scores10K

scores10K_2

5000 x 5000 scores

scores5K

scores5K_2

Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.