# Question about vectorization of loops and general culture about floating-point arithmetic

## Question about vectorization of loops and general culture about floating-point arithmetic

Greetings,

I am developing an API for the implementation of Mimetic finite Differences as part of my Ph.D. research. I am currently using:

[ejspeiro@node01 ~]\$ icpc --version
icpc (ICC) 11.0 20081105

As I was coding a CCS Matrix implementation for my API, I came across a curious behavior: Given an array of length num_values_, say we need to count the number of zeros and non-zeros: My first raw approach was to insert an if clause testing for each values, as I traverse the array, but I noticed that I could just keep track of the non-zero values, and subtract this number from the total values. I thought I could then divide each value by itself thus creating a one, which could be casted out to int, and then be accumulated:

for (int ii = 0; ii < num_values_; ii++) {
num_non_zero_ = num_non_zero_ + (int) (array[ii]/array[ii]);
}

But this clearly would yield an error because in the case of array[ii] being equal to zero, this code may crash... although it has not happened yet, while compiling with icpc... so my first question is... why? So I changed it to:

for (int ii = 0; ii < num_values_; ii++) {
num_zero_ = num_zero_ + !(array[ii]);
}

In here, we keep track of of the number of zero values, under the assumption that denying any double number will result in 0, whereas the only way that such negation results in 1, is for the number to be 0.0. However, in the first approach, icpc will actually vectorize such loop:

icpc -c -O2 -Iinclude -MMD -MP -MF build/Release/Intel-Linux-x86/src/mtk_ccs_matrix.o.d -o build/Release/Intel-Linux-x86/src/mtk_ccs_matrix.o src/mtk_ccs_matrix.cc
src/mtk_ccs_matrix.cc(64): (col. 22) remark: LOOP WAS VECTORIZED.
src/mtk_ccs_matrix.cc(91): (col. 3) remark: LOOP WAS VECTORIZED.

However under my corrected approach, it does not...what do you guys think? How correct is my assumption on denying double values?

PS: If anyone have a comment on how to post better looking code, I would really appreciate it :) Thanks!

Eduardo J. Sanchez P.
publicaciones de 10 / 0 nuevos
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Quick comment: I just noticed how to highlight code, but I can't find how to edit my Original Post :( Sorry!

Eduardo J. Sanchez P.

The forum appears not to permit editing of original posts, except possibly by a complicated procedure available only to certain moderators.

Looks are certainly in the eye of the beholder, but I'd prefer num_non_zero += (array[ii] != 0); as a clearer expression than !array[ii] (which seems more appropriate to boolean data types).

You could write

if(array[ii] != 0) ++num_non_zero;

If (annoyingly) -vec-report1 quotes "protects exception" as a reason for not vectorizing,  #pragma vector always should take care of it.

I'm not enough of an expert to know whether there's really a potential benefit to vectorizing this.  You should separate artificial goals of maximizing numbers extracted from a vectorization report from the real goals of your dissertation.

`num_non_zero += (array[ii] != 0);`

Wouldn't it be better to avoid the comparison among doubles? On my way to the office I thought about this suggestion, and maybe I could do this:

`num_non_zero += (abs(array[jj*(num_cols_) + ii] - 0.0) > tol);`

You should separate artificial goals of maximizing numbers extracted from a vectorization report from the real goals of your dissertation.

You are right. My actual goal is to improve my coding, while understanding the functioning of the icpc.

Thanks!

Eduardo J. Sanchez P.

Oops! Wrong chunk of code! With my second snippet of code I actually meant this:

`abs(array[ii] - 0.0) > tol`

Sorry :(

Eduardo J. Sanchez P.

As I tried this, I noticed that I should use '<' instead of '>'. However, with this case, and without utilzing the #pragma, I still do not get a vectorized loop, which once again, it is not necessary for my dissertation, but is curious nonetheless.

Thanks.

Eduardo J. Sanchez P.

I don't know whether the compiler will optimize away the - 0.0.

I assume num_non_zero is typed int.  If so, the latency of the operation is much less than for sum reduction with floating point data types, and there may not be suitable instructions for vectorization with the mixture of floating point and integer operations.  The vec-report2 might even give you a hint in that respect.

Assuming double array[] the initial fragment would be easily vectorizable, producing a hidden vector of type int on which you want to do a sum reduction, so it's fairly complicated.

This reminds me of the people who wrote Cray-I vectorizable code which was slower than optimized non-vector code on their 68020 workstation just to be able to brag about the appearance of their vectorization report.

TimP:

This reminds me of the people who wrote Cray-I vectorizable code which was slower than optimized non-vector code on their 68020 workstation just to be able to brag about the appearance of their vectorization report.

Sorry if I made it look like I was actually trying to attain the vectorization, just for bragging purposes. I do not... Again, I am simply trying to understand these concepts better.

Thanks.

Eduardo J. Sanchez P.

Eduardo,

Something you might try is to enable Cilk++, and not use Cilk++, rather use a feature of Cilk++ called CEAN to get C Array Notations. Then (untested code)

const int VecWidth = ... // 4, 8, 16, ???

int VecCounts[VecWidth];
VecCounts[0:VecWidth] = 0; // [start:length]
int iEnd = num_values - (num_values%VecWidth);
for(int ii=0; ii < iEnd; ii += VecWidth)
{ VecCounts[0:VecWidth] += (Array[ii:VecWidth] == 0) ? 1 : 0; }
if(iEnd != num_values)
VecCounts[iEnd:num_values-iEnd] += (Array[iEnd:num_values-iEnd] == 0) ? 1 : 0;

You can experiment with multiples of the SSE/AVX/AVX2 vector width

Align VecCounts and Array if possible.

Jim Dempsey

primary effects of CEAN notation toward vectorization are implied effects similar to #pragma simd:

cost model suspended (as with #pragma vector always)

aliasing analysis suspended, thus no need for __restrict (as with #pragma ivdep; this would not apply in this example, as you don't store through pointers)

associativity according to K&R over-riding -fp-model

availability of Cilk(tm) Plus reducers in addition to simd reduction

alignment of CEAN can be asserted only by __assume_aligned()

under AVX, CEAN may switch from AVX256 to AVX128 to improve optimization of suspected misalignment

## Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio?