Kadane 2d's classic algorithm has a complexity of O(r²c), where r is the number of rows and c the number of cols.
We use it when there is more columns than rows, but instead of tranposing the matrix for the opposite case, we developed a second algorithm that is O(c²r). It's basically a transposition of Kadane 2d's algorithm :
for(size_t colStartIndex=0; colStartIndex!=numberOfCols; ++colStartIndex)
for (size_t colEndIndex = colStartIndex; colEndIndex!=numberOfCols; ++colEndIndex)
// do one step of kadane's 1D maximum subarray search (one step for one row)
// a kadane's 1D search is associated to a (colStartIndex, colEndIndex) pair
The position of the loop on rows is carefully choosen. It's not the most innerloop because we would have very bad cache locality, and it can't be parallelized so we didn't put it as the most outer loop.
We assumed that reading the file was a really slow and sequential process, and our work is based on that assumption. The other teams have proven that it wasn't true - on the MTL at least.
Our goal was to start searching the maximum subarray before the file was entirely read. We modified our algorithm's implementations to work on slices of matrix, in order, coming from the pipeline.
The pipeline serially read slices of the input file, then each slice is parsed in parallel, and sent to the search.
We choosed to work on slices of files of a size inferior to the L3 cache and used circular buffers to avoid memory reallocation.
The O(c²r) algorithm doesn't need to remember the previous slice of matrix to work : only two values, indexed by colStartIndex and colEndIndex, are needed for continuing the kadane 1d search. So the memory footprint is independant on the number of rows.
Here is the pipeline definition for the O(c²r) implementation :
InputFileReader(file, textSlices, ntoken)
) // read a chunk of the file (n-rows)
) // parse this chunk to fill corresponding matrix
// with a prefixedSum done on each line
MaximumSubarraySearchCCRStep(&kadaneSavedResults, &numberOfRows, &result)
); // add the matrix to the search
We already parallelized the parsing of the matrix using the pipeline. For the searching part, we used tbb::parallel_reduce() on the most outer loop (on colStartIndex for both algorithms). Given the triangle-balanced nature of the problem, we defined a specific tbb::range.
The length of this range is defined using it's length, then multiplied by it's position. The grainsize of the parallel_reduce is divided by the number of rows for the O(c²r) algorithms and by the number of cols for the other one. Splitting the range is done at the first 3rd instead of at the middle of the range.
This is a bit tricky, but a range should stay as simple as possible because its split function is used a really high number of times. Every of our tentatives to do smarter splitting involving a bit more calculation didn't succeed.
We didn't have the time to try to directly use parallel tasks instead of ranges and parallel_reduce. It might have much better performances, allowing a better task splitting with less overhead.
Our program scaled very well but there is much overhead that doesn't depend on the number of cores, these are some results we had on 40 :
- 1000x1000 : 1.64s user 0.06s system 2022% cpu 0.084 total
- 2000x2000 : 11.69s user 0.09s system 3138% cpu 0.375 total
- 4000x4000 : 108.18s user 0.25s system 3684% cpu 2.943 total
- 8000x8000 : 758.06s user 1.21s system 3865% cpu 19.642 total
- 10000x10000 : 1431.10s user 1.79s system 3895% cpu 36.781 total
The source code is here : maxSubarraySearch-TBB-Pipelines.zip