Extended addition (described below ) is one of the most performance critical kernel in our code (that implements important functions from sparse linear algebra).

AddEx(double* A, int LDA, double *B, int LDB, int*C, ...) { for (jj = 0; jj < col; ++jj) { /*if jj segment is not empty */ if (seg[jj]) { for (i = 0; i < row; ++i) { A[C[i]] -= B[i]; } B += LDB; } A += LDA; } }

Compared, to something like SpMV ( which reads from indirect addresses), this both reads and write to indirect memory addresses

In general, we perform a number of extended additions operations on independent blocks A_{i} and B_{i} concurrently using openMP parallel for.

Assuming A[C[i]] -= B[i] takes 2 reads and 1 write ( C[i] is assumed to be in cache ) , so in total 3*row*col memory ops, measuring time shows low bandwidth of around 52 GB/sec obtained. While there can be load imbalance among openMP threads, but still 52 GB/sec is quite low. I seek suggestions to improve it. My experience with SIMD instructions is limited, however, I tried as follows.

I took the inner loop

for (i = 0; i < row; ++i) { A[C[i]] -= B[i]; }

and replaced it with following SIMDized loop

__m512i v_rel; __m512d v_A; __m512d v_B; __mmask8 mask; int row_8 = row/8*8; for (i = 0; i < row_8; i += 8) { v_rel = _mm512_extloadunpacklo_epi32 (v_rel, &C[i], _MM_UPCONV_EPI32_NONE, _MM_HINT_NT); v_rel = _mm512_extloadunpackhi_epi32 (v_rel, &C[i + 16], _MM_UPCONV_EPI32_NONE, _MM_HINT_NT); v_B = _mm512_extloadunpacklo_pd (v_B, &B[i], _MM_UPCONV_PD_NONE, _MM_HINT_NT); v_B = _mm512_extloadunpackhi_pd (v_B, &B[i + 8], _MM_UPCONV_PD_NONE, _MM_HINT_NT); v_A = _mm512_i32logather_pd (v_rel, A, _MM_SCALE_8); v_A = _mm512_sub_pd (v_A, v_B); _mm512_i32loscatter_pd (A, v_rel, v_A, _MM_SCALE_8); } /*handling remainders*/ ....

The performance improves using SIMD and now it is around 60 GB/sec. There seems to be plently of room for improvement.

Upper bound on sizes of row and col is ~128 (defined by user ) (while LDA and LDB can be large, only a continues portion of length upto 128 of B maps to upto 128 contgeous portion of A). The source vector B has fewer rows than A. In general C[i] is not monotonous (i.e. C[i]>=C[j] if i>j does not hold) .