Vectorize Loop : Pointer+Offset => Mixed Data Types ???

Vectorize Loop : Pointer+Offset => Mixed Data Types ???

Hello,

I'm new to vectorization and the ICC compiler. Using build 8.0.055.

After some diligence, I have tracked down a mixed-data-types message
to handle an equation of the type
(float *) = (float *) + (integer offset);

When I replace the integer offset with 0, I get successful vectorization, of
course this is not the equation I want. I've tried all kinds of casts and also
construct like &pArray[intOffset], all to no avail.

Is there a right way to do this ?
Is it really not possible to vectorize with expressions like this ?
Is it a bug ?

The actual code looks like...

for (ifa=0;ifa{
float *pH2 = pHisto+(*pAY++);// *pAY is unsigned short
qst+= ((*pH2)*(*pAC++)*(aTC));// everything else is float or float*
}
Thanks for any help !
-rajeev-
Rajeev Rohatgi
rohatgi@photodetection.com

18 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Did you try assigning the unsigned short value to a local float outside the loop, assuming that it is loop invariant?

Tim,

Thanks for your super-prompt reply. But alas *pAY is not loop-invariant
(note: *pAY++). What I have is a big input table (pHisto), a list of offsets
(pAY) and a list of coefficients (pAC) which I need to multiply and accumulate.
qst gets stored to an output table outside the loop.

Regards,
-rajeev-
Rajeev Rohatgi

Dear Rajeev,

If the compiler is able to report a mixed data type, then this is obviously not a bug, but simply a missing feature:-) in this version of the compiler (incidentally, the conversion of a [un]signed short into a float is supported within the computational context, but that does not help your example). Your specific example has a fewcomplications that make straightforward vectorization with the current SIMD extensions difficult. In particular, the sequence

float *pH2 = pHisto+(*pAY++);

.. *pH2 ..

really induces an indirect load (i.e. subscripted subscript, viz. *(pHisto+pAY[ifa]) in the second dereference), which is hard to implement efficiently in SIMD form. Below, I have crafted your example into code that will vectorize by default (due to the pragma), but I doubt it will exhibit any speedup in this form (give it a try though; I hope Ive got all the data types right).

float doit(unsigned int KFA,
unsigned short *pAY,
float *pH2,
float *pAC,
float *pHisto, float aTC) {
unsigned int ifa;
float qst = 0;
#pragma vector always // overrides all efficieny heuristics of vectorizer
for (ifa=0;ifa float *pH2 = pHisto+(pAY[ifa]);
qst += ((*pH2)*(pAC[ifa])*(aTC));
}
return qst;
}

Maybe your code has certain characteristics that can be exploited to rewrite this loop into code that allows for more efficient vector code? As in an earlier posting, I would like to recommend the following Intel Press book as additional reading (it contains a full section on programming guidelines to make code more amenable to efficient vectorization):

The Software Vectorization Book http://www.intel.com/intelpress/sum_vmmx.htm

Let me know if this was useful.

Aart

Aart,

Thank you for your prompt and insightful reply. Certainly one useful aspect
has been to help me understand what may and may not be achievable, so I
can move forward and try something else. I will try the pragma always as
you suggest, also I am wondering if I may see a benefit to building a vector
of *pH2 prior to entering the loop, ie break my loop into two loops running one
after the other. (I wish it were as easy as just prebuilding a full pH2 table,
however pHisto changes on each iteration of the next-from-inner-loop.) Thank
you also for mentioning your book.

Regards,
-rajeev-
Rajeev Rohatgi

Indeed, if the maximum possible trip-count is somehow known(or a temporary array is made available via other means), then the second loop in

float temp[MAXNEEDED];
for (ifa=0;ifa < KFA;ifa++) {
temp[ifa] = *(pHisto+(pAY[ifa]));
}
for (ifa=0;ifa < KFA;ifa++) {
qst += ((temp[ifa])*(pAC[ifa])*(aTC));
}

will vectorize by default (making it more likely that some speedup will result for this reduction). Such (and other) relatively simply rewritings can have a big impact on how well the vectorizer performs. Ideally, we would like to do all transformations automatically, but until that great day, user rewriting is always greatly appreciated ;-).
Aart

Aart,

Thanks for the followup. Indeed I'm happy to rewrite... if I have a clue as to what might work and what things are problematic. What I'm doing right now
feels like (a) guessing alternate constructions (b) guessing what the diagnostic messages might be trying to tell me.

[A]. Sorry, but I haven't had success with your first proposed formulation.

qst= *pS2;
#if BWDEEEFORCE
#pragma vector always
for (ifa=0;ifa const float *pH2= pHisto+(*pAY++);
//const float *pH2= pHisto+pAY[ifa];
qst+= (*pH2)*(*pAC++)*(aTC);
//qst+= (*pH2)*(pAC[ifa])*(aTC);
}
#endif
*pS2= qst;
(Other than not putting the code into its own function, I believe I've captured
the essential details of your proposal.)

I wasn't able to get the above to vectorize, as you can see I tried different
combinations of pAx[ifa] and *pAx++ and always got either the "too complex"
message or the "mixed datatypes" message.

So what's mixed datatypes all about ? I get the feeling the vectorizer is saying
it can either vectorize integer addition or floating point multiplication but not both. But then how is it able to do any of the standard examples involving
expressions like A[K]*B[K] seems to me this requires being able to do
pointer addition (or incrementing) _and_ dereference _and_ float multiplication ? Sure I don't understand...

[B]. So I'm trying to split loops now and now reached

#ifdef BWDEEESPLIT
const float **pH3;
float *pH2;
#pragma ivdep
for (ifa=0,pH3=pH2Base;ifa *pH3++= pHisto+(*pAY++);
}
#pragma ivdep
for (ifa=0,pH3=pH2Base,pH2=pH1Base;ifa//const float *pH0= *pH3++;
//*pH2++= *pH0;
const float *pH0= pH2Base[ifa];
pH1Base[ifa]= *pH0;// :-( dereference too complex
}
for (ifa=0,pH2=pH1Base;ifa qst += (*pH2++)*(*pAC++)*(aTC);
}
#endif
Here pH1Base is a malloc'ed array of float and pH2Base is an array of float*.
The first and 3rd loop vectorize OK but the middle loop won't go.

Any more thoughts ?

Thanks,
-rajeev-
Rajeev Rohatgi

And the speed is...

Slower by 33% compared to vectorizing off. This is for the split loops with 2 out of 3 loops being vectorized. (So as it stands right now gcc 3.4.2 is still ahead by about 10%.)

Regards,
-rajeev-
Rajeev Rohatgi

Rajeev,

[A] The pragma-version with subscripts vectorizes with 8.1:

#pragma vector always
for (ifa=0;ifa < KFA;ifa++) {
const float *pH2= pHisto+pAY[ifa];
qst += (*pH2)*(pAC[ifa])*(aTC);
}

r.c(26) : (col. 1) remark: LOOP WAS VECTORIZED.

But may not vectorize with 8.0 (I realize this is said way too often at this forum, but I just happen to have improved the indirect addressing detection lately :-). The pragma-version with pointer induction (viz. *pAY++ and *pAC++) should vectorize as well but, although the former constructis correctly normalized, the latter apperently is not. I made a note of this.

[B] I doubt splitting the loop into three rather than two (just to extractparallelism from the light-weight address computations) will pay off. Furthermore, I do not understand why you want the second loop to vectorize (indirectly addressed!). Did you try my 2-loop solution, where only the floating-point reduction itself is vectorized?

Aart,

[A] The pragma-version with subscripts vectorizes with 8.1

No apology necessary. Can you point me to a specific build and tell me where I should find it ?

[B] I think I tried the two loop solution, but it's easy enough to go back and try it again. Maybe I should wait and try it with 8.1 ? But I'm not sure I understand the remark about indirect addressing.... the computation steps are the same in all incarnations, just the order of doing the loops is different. I think I figured that if it refused to vectorize any of the pieces, the compiler wouldn't vectorize the whole package. But maybe this isn't the right way to think about it...

What I would really like is to have a single loop, and for the compiler to say, OK I can vectorize the first and 3rd steps so that's what I'll do, then the middle step (dereference) I'll stick in 4 separate instructions because I can't do any better. Or does thatwaste too much
time waiting for fetches to complete ? (FYI I'm hoping to move two loop levels out and use some parallel threading- but that aspect is stuck on a link issue which I'm waiting also for help on. I mention this because a second thread might be able to make good use of a few wasted cycles.)

And I'd be willing to stick in a few lines of assembly if that would do the trick, but I'm a little out of my depth on that.

Once again thank you for your continuing help,
-rajeev-
Rajeev Rohatgi

Aart,

OK here's the story on splitting the loop into two loops (previous msg was
posted from home last night): my earlier split was different from your suggestion, I'd had the *pH2 (second dereference) kept with the multiplications (thinking that it was incompatible with the pointer+offset
calculation). So this morning I reorganized exactly the way you'd suggested
and got a slight improvement in performance (compared to no vectorization)

I **really** want to try a partially vectorized _single_ loop.

#define NEEESPLIT2
#if NEEESPLIT==2// Aart Bik recommendation
#pragma ivdep// does not vectorize
for (ifa=0,pH2=pH1Base;ifa
//*pH2++ = *(pHisto+(*pAY++));
//*pH2++ = *(pHisto+(pAY[ifa]));
//pH2[ifa]= *(pHisto+(*pAY++));
pH2[ifa] = *(pHisto+(pAY[ifa]));
}
#pragma ivdep// does vectorize
for (ifa=0,pH2=pH1Base;ifa
//qst+= (*pH2++)*(*pAC++)*(aTC);
//qst+= (pH2[ifa])*(*pAC++)*(aTC);
qst+= (pH2[ifa])*(pAC[ifa])*(aTC);
//qst+= (*pH2++)*(pAC[ifa])*(aTC);
}
#else
#if NEEESPLIT==3
<...snip...>
#endif

And here are results for my sample job, all times in seconds. Methodology is
to run the test job several times in succession, report the median time and
standard deviation for the compute phase (ie excluding Input file read and
Output file write, which are on local disk anyway). My test system is a
2.80GHz Xeon(P4) that reports 4 processors and 512kB cache. OS is RedHat
Enterprise Linux 3.0 and although configured for multiple users I am the
only user on it. I have reported times both for last Friday and today
because of the curious (but not huge) variations in the test. The earlier
builds are untouched from 9/10. Although I typically see LSB level
variations in the computed output between builds/platforms/optimizations/
compilers, it is worth noting that builds 816 (gcc) and 836 (icc no Vectorize)
produce _identical_ output. Yes there's probably room for tuning the icc
options.

Sep10Sep14
Best gcc 3.4.2 48.60+/-0.38 46.22+/-0.61build 816
icc 8.0.055 O3 xN 52.82+/-0.16 54.30+/-1.55build 836
+Vectorize(Split=3) 65.62+/-1.97
+Vectorize(Split=2) -n/a- 52.23+/-0.74build 1040

Regards,

-rajeev-
Rajeev Rohatgi

Message Edited by rohatgi@photodetection.com on 09-14-2004 06:58 AM

Rajeev,

Memory references withe.g. anon-unit stride or a subscripted subscript are indeed "vectorized" by loading the elements sequentially, and then shuffling the components into a vector for further SIMD processing. Such cases (and others, like very unaligned memory references) induces a lot of overhead and, hence, are typically rejected byefficiency heuristics (yielding the message vectorization possible but seems inefficient). Vectorization can be forced nevertheless with pragmas to enable the programmer to experiment whether the heuristic was right or not.

As for your example, I timed the following code on a 3GHz Pentium 4 Processor for increasing values of KFA (and sufficiently large N of course).

static float tmp[N];
static float pH2[N];
static float pAC[N];
static float pHisto[N];

static unsigned short pAY[N];

float doit(unsigned int KFA, float aTC) {
unsigned int ifa;
float qst = 0;
for (ifa=0;ifa < KFA;ifa++) {
tmp[ifa] = *(pHisto+(pAY[ifa]));
}
for (ifa=0;ifa < KFA;ifa++) {
qst += ((tmp[ifa])*(pAC[ifa])*(aTC));
}
return qst;
}

The speedup of vectorization over sequential execution of doit() (viz xN over O2) is listed in the table below. Although my code fragment is clearly a simplification of the real-life situation, this experiment shows potential speedup for any loop with more than 16 iterations, going up to more than twice as fast.

KFA S
=========
1 0.67
2 0.75
4 0.80
8 0.76
16 0.82
32 1.30
64 1.60
128 1.81
256 2.03
512 2.26
1K 1.81
2K 1.89
4K 1.93
8K 1.94
--
Aart

Rajeev,

Two more suggestions for improvement (I read your last message only after I had posted my speedup reply).

(1) If thememory references throughpAC[] and tmp[] always start at a 16-byte aligned boundary (as enforced by the compiler in my code, but may become unclear due to the use of pointer arithmetic), adding #pragma vector aligned will save the overhead of runtime alignment optimizations that are applied by the vectorizer.

(2) As you stated in another thread, express a reduction of the form

for (i=0; i < N; i++)
r += a[i] * b[i] * c;

as

for (i=0; i < N; i++)
r += a[i] * b[i];
r *= c;

instead. Compilers are generally good in rewriting expressions or hoisting invariant computations out of a loop, but in this case the simplification of (a0b0c + a1b1c + a2b2c + .) into (a0b0 + a1b1 + a2b2 + .) * c requires a simplification over the iterations of a reduction that is not done (yet) by the compiler.
Aart

Aart,

As always, thank you for your continuing interest. Replying to various things in no particular order...

1. Msg "Vectorization possible but seems inefficient" -- I don't believe I have ever seen this message.

2. speedup of the doit() example. If I understand correctly, you are comparing [two loops, one of which is vectorized] to [two loops, neither of which is vectorized]. From my perspective this is irrelevant. What matters is the comparison of best-case-vectorized to best-case-unvectorized. So far, that would be [two loops, one of which is vectorized] compared to [one loop unvectorized]. For what it's worth KFA ranges from 1 to 124 with median
value 66 in my test job.

3. I am working on alignment as we speak.

4. Hoisting the constant out of the loop. I must say I am surprised that the compiler needs my help on this. I wonder if the fact that qst!=0 when the loop entry causes it difficulty, or whether a pair of parentheses -- (a[i]*b[i])*c -- might help. For what it's worth hoisting the constant out of the loop _did_not_ make the gcc 3.4.2 build any faster.

5. *** But my biggest question for you remains, should I look forward to getting help from you to reach a build that has a single loop, partially vectorized ?

Thanks,
-rajeev-
Rajeev Rohatgi

How about this?

Base (fast sequential code, optimized reduction):
float qst = 0;
for (ifa=0;ifa < KFA;ifa++) { // does not vectorize by default
float t = *(pHisto+(pAY[ifa]));
qst += t*pAC[ifa];
}
qst *= aTC; // post-mult.

Vector1 (forced vectorization of the indirect load, may require 8.1 update):
#pragma vector always
for (ifa=0;ifa < KFA;ifa++) { // vectorizes
float t = *(pHisto+(pAY[ifa]));
qst += t*pAC[ifa];
}

Vector2 (default vectorization, but requires additional storage):
for (ifa=0;ifa < KFA;ifa++) { // does not vectorize by default
tmp[ifa] = *(pHisto+(pAY[ifa]));
}
for (ifa=0;ifa < KFA;ifa++) { // vectorizes by default
qst += tmp[ifa]*pAC[ifa];
}

Speedups of Vector1 over Base (S1) and Vector2 over Base (S2) for the more typical lengths 32..128 you observedin your application.

KFA S1 S2
===============
32 0.96 0.77
48 1.15 0.88
64 1.28 0.96
80 1.38 0.95
96 1.43 1.03
112 1.52 1.07
128 1.53 1.08

So in my unit test, and hopefully in your application (do not forget to get the 16-byte alignment if possible!), simply adding a pragma to a slightly optimized single-loopversion gives you the best performance (up to 1.5 faster than the best sequential version seen so far). So, if you are able to obtain a 8.1 update, let me know if you observe the same!
Aart

Aart,

Looks like I got to get 8.1. What's in the Files Download area seems to be
8[1].1.021. Can you confirm that this is up-to-date in terms of the vectorizer upgrades you've mentioned ?

For my part I converted the inner loop to aligned, and much as your results indicate, it is pretty much on par with "Base". For the heck of it I tried aligning
base, it got a little slower, but could be within variance.

I would also be interested to know if

__assume_align(X,16)
for (pX=X...) {
... *pX++ ...
}

works *identically* to

__assume_align(X,16)
for (i=0;i
...X[i] ...
}

Do you think the non-vectorized code would care between X[i] and *pX++
construction ?

Also had some problems mixing pA=malloc() and pB=_mm_malloc. I got a segmentation fault when the code hit free(pA). I put in a problem report on
the premier support website.

Thanks,
-rajeev-
Rajeev Rohatgi

Aart,

8.1.021 successfully compiled and vectorized the single inner loop.

The speedup is 1.125 for my test problem (median KFA=66), [One loop partial vectorized 8.1.021] to [One loop un-vectorized 8.0.055]
16-byte alignment on all inner loop arrays.

ICC is also at last faster than what I was able to achieve with gcc 3.4.2, ~9% by today's measurements.

Thank you for your considerable help.

-rajeev-
Rajeev Rohatgi

PS Please don't undo any of this stuff in future releases ;-)

Rajeev,

Good to hear and thanks for letting me know! Note that __assumed_aligned(var,base) (sic!) should work for both cases you describe. For this case, however, I would recommend simply using

#pragma vector aligned
for (..) {
..
}

which states that an initial16-byte alignment may be assumed for all unit stride memory references in the loop. Also, in general, the compiler should generate the same code for *pX++ vs. X[i]. As a general guideline, however, use the latter form where possible, since it tends to make code more amenable to full analysis.

Please dont hesitate to post other questions on vectorization, I enjoy this form of interaction (and dont forget to buy the book :-)

Aart

Leave a Comment

Please sign in to add a comment. Not a member? Join today