Subarray

The subarray problem on a n*m matrix is sequentially solved using an algorithm known as the Kadane 2D algorithm. This algorithm has a O(n²m) complexity. The sequential algorithm is written using 3 loops :

for i in (0..n) // <- We parallelize that

for j in (i..n)

for k in (0..m)

//do work with matrix[j][k]

Our solution does not try to optimize the work performed inside the inner loop, so we skip the details of what is actually done inside. We chose to parallelize only the outer loop (index

**i**).In order to parallelize the outer loop on K cores, we chose to split it into K tasks of equal duration. This approach has several advantages :

- The algorithm is very simple : there is no need to steal work or do complex load balancing between the K cores.
- Each thread works on big continuous portions of the matrix, which maximizes cache usage.
- We know in advance what the threads are going to do and which data are going to be accessed so we can do smart NUMA optimizations.

In this article, we explain: how we achieved to split the work into K equal tasks and how we optimized the treatment of these tasks.

## 1-Creating K tasks of equal duration

Fig. 1 - K=4 equal areas in a triangle | In order to split a for i (0..n) loop into K tasks, one often create K tasks [i=0..n/K],[i=n/K..2*n/K]...[i=(K-1)*n/K,K]. However, this simple solution does not work well in our case because the second loop (index j) starts at index i. This means that when i==0, n iterations are done in the second loop and when i==n-1 only 1 iteration is done in the second loop! The amount of work depending on i is represented in Figure 1. This figure represents an example of the work to be done on a 250*m matrix. When i==0, 250 iterations are done; when i==250, only 1 iteration is done. The total quantity of work to be done is equal to the area of the triangle.Splitting the work into K equal tasks is equivalent to creating K equal areas inside the above mentioned triangle. For example, in Figure 1, representing the work to be done on a 250*m matrix, a close-to-optimal partionning is the following : - Thread 0 doing i (0-34) = 7939
**j**iterations (area A1) - Thread 1 doing i (34-74) = 7875
**j**iterations (area A2) - Thread 2 doing i (74-125) = 7860
**j**iterations (area A3) - Thread 3 doing i (125-250) = 7701
**j**iterations (area A4)
With this partionning, there is at most a 3% difference in the number of iterations performed by each thread. |

In order to find the last index that a thread

**idx**should process (e.g., 34 for thread 0 in the above example), we use the following formula:

int last_index = 0;

do {

last_index++;

} while((last_index)*(n) - (last_index+1)*(last_index)/2 < (idx+1) * n * (n - 1) / 2 / K);

Where n in the number of lines of the matrix, idx is the thread number and K the number of threads.

This loop increments last_index until the amount of work done between i=0 and i=last_index is equal to idx*(total-work-to-be-done/number-of-workers). The calculation of "the amount of work done" is the calculation of the area of a trapeze. (E.g., on figure 1 the area A1, the work done by thread 0, represents the area of a trapeze.)

Actually this could also be calculated with the following formula:

... but is is actually slower than doing the loop! (We think that the compiler is doing really smart things and that the loop is actually optimized and transformed into a much more efficient formula.)

The algorithm has an near optimal speedup between 10 and 40 cores (x3.94) and between 1 and 40 cores (x36.8). This means that, according to Amdhal's law, more than 99.77% of our code is parallel. For those interested, it takes 5.9s at 40 cores to parse a 10K*10K matrix.

We think that the speedup seen by the Intel team might have been a little lower due to our static partitionning of data: on the final test 2 cores were fully loaded, which means that our partionning was no longer optimal. Nevertheless our solution seems to have behaved quite nicely even when (intuitively) load balancing could have been required.

last_index = 2*n - (√(4*n*n-4*n+1)*K*K+((-4*idx-4)*n*n+(4*idx+4)*n)*K+(2*n-1)*K)/(2*K);

... but is is actually slower than doing the loop! (We think that the compiler is doing really smart things and that the loop is actually optimized and transformed into a much more efficient formula.)

## 2-NUMA optimizations

As mentioned earlier, we also do NUMA optimizations. :) In order to improve the locality of the memory accessed by the threads, we have:

- Created a thread pool per NUMA node in the system. Each thread pool is totally independent from the others. Each thread pool is controlled by a master thread scheduled on the same NUMA node as the pool it controls.
- The creation of the K tasks is done in parallel by each master thread (actually each thread creates K/4 tasks since there is 4 NUMA nodes on the MTL).
- Before giving the tasks to its workers, each master thread
**duplicates the matrix on the local NUMA node**. This ensures that, when the matrix does not fit in cache, the worker threads fetch data from their local memory. This optimization actually give a**+25%**performance boost at 40 cores. Lessons learned: pay attention to the data locality. ;) - (Note for those who might think that it is an incredible waste of memory: a 10K*10K matrix occupies 380MB in RAM. The MTL machines has 64GB. So one copy per node = a "waste" of 1.5GB = 2.3% of the memory of the machine = really negligible compared to the gain.)

## 3-Other performance optimizations

- Our approach falls back on the sequential algorithm when the parallel algorithm is considered too costly. (E.g. the cost of duplicating the matrix and managing the thread pool cannot be amortized.)
- Since the subarray algorithm is of O(n²m) complexity, it is sometimes worth to transpose the matrix before any computation, in order to have n<m. Experiments showed that transposing becomes worthy as soon as the difference in complexity is above 5K operations.
- Both reading and transposing the matrix are done in parallel using our thread pool. The input file is mapped in memory and each reader thread is responsible to parse 800Ko of the input file and creates a partial matrix corresponding to what it has read. All submatrices are then merged using a simple memcpy operation.

## 4-Figure for nerds

Time to present some results!

**Fig 2**-

*Speedup of our algorithm on a 10K*10K matrix*

The algorithm has an near optimal speedup between 10 and 40 cores (x3.94) and between 1 and 40 cores (x36.8). This means that, according to Amdhal's law, more than 99.77% of our code is parallel. For those interested, it takes 5.9s at 40 cores to parse a 10K*10K matrix.

We think that the speedup seen by the Intel team might have been a little lower due to our static partitionning of data: on the final test 2 cores were fully loaded, which means that our partionning was no longer optimal. Nevertheless our solution seems to have behaved quite nicely even when (intuitively) load balancing could have been required.

## 5-Code

Finally, here's a link to our code