Vectorize circular buffer?

Vectorize circular buffer?

Hi!I have a lot of circular buffers and I wonder if it is possible to vectorize the inner loops. They will typically look like this (read phase):

[cpp]void read(float *buf, float *u, int pbuf, const int bufferLength)
  int s;

  for(s=0; s I can make the bufferLength power of 2, and simplify the above to something like:
[c]void read(float *buf, float *u, int pbuf, const int bufferLength)
  int s;

  for(s=0; sHowever, it still won't vectorize. The compiler says "loop was not vectorized: dereference too complex", which I kind of understand. So my question is, is there anything smart I can do here? Is it even possible to vectorize this kind of operation?Thanks!
22 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

You may need to tell the compiler that pointers are not aliased. This you can do with "restrict" key word.

Also loop count should be known for vectorization.

Yeah, you are right!I did that by using "#pragma ivdep", which seems to work, but forgot it in my example. Are you sure about the loop count? It will not be known at compile time, but I see other loops in my code using the same loop count which do vectorize.

icc performs automatic loop splitting or peeling only for a limited group of cases, not likely to include any with an explicit conditional. You could write it out explicitly with an additional level of looping so as to take the pointer wrap out of the inner loop. You could write it as (at most a pair of) memcpy() if that is appealing.
Your 2nd alternative is a good idea, with a possible version to try:

voidread(float*restrict buf,float*restrict u, const int pbuf,const intbufferLength)
#pragma vector always
for(int s=0;s u[s]=buf[pbuf + (s &(bufferLength-1))];

which ought to be possible to auto-vectorize, at least with SSE4. If not, you might submit an actual reproducer on your premier.intel.com account to ask for advice. As this leads to scalar loads packed into parallel stores, one would expect less effectiveness than with loop splitting, depending of course on the values of N and bufferLength and whether you align u[].

The requirement on N is that it be clearly loop invariant. I hate to try to say it more technically than Om. A possible alternative to restrict would be to define local copies within the scope containing the loop. As you said, #ivdep should eliminate any analysis for possible aliasing, but it's good to use syntax which supports the optimization you want without having to tell the compiler not to check anything.
The forum seems to have updates delayed by hours today; I didn't see the intervening posts.

Thanks for your replies!I did some more tests, and this is what I found out:This code will vectorize:

[cpp]void read(float *buf, float *u, const int N, int *pBuf, const int bufferLength)
{
  int s;
  int pbuf = *pBuf;
  int moduloMask = bufferLength - 1;

#pragma ivdep
  for(s=0; s While this code won't:
[cpp]void read(float *buf, float *u, const int N, int *pBuf, const int bufferLength)
{
  int s;
  int pbuf = *pBuf;
  int moduloMask = bufferLength - 1;

#pragma ivdep
  for(s=0; s I would have hoped that the addition instruction would have been replaced with something like ANDPS. I guess I can code this using intrinsics like _mm_and_ps instead.

Wait! This can never work. There are no SSE instructions to load data from four different addresses, only from four consecutive addresses, right?So if the wraparound happens at a random boundary, this will never work. I guess it can be solved by writing extra (duplicate) data at the end of the buffer to catch any reads past the end of the buffer.

u[s]=buf[(s+pbuf)+moduloMask];

The above is wrong. You need to & with moduloMask as in your 2nd code example.

Try something like this untested code

[bash]void read(float *buf, float *u, const int N, int *pBuf, const int bufferLength)
{
  int s, e;
  int pbuf = *pBuf;

  s = 0;
  e = min(bufferLength, pbuf + N);

#pragma ivdep
  for(r=pbuf; r
Jim Dempsey

Yes, its wrong! I just wanted to demonstrate that the compiler knows how to vectorize the construct as such (but using another operator than &). Sorry for not being clear about that.Your ide in the example is probably the best thing to do, to explicitly avoid the wraparound. I will look into that. I was hoping to be able to avoid that solution, as I also have more complicated functions with several pointers, and then it gets complicated, but I'll give it a try.Thanks for all your help!

stx,

Will you always be peeling the data out of the circular buffer in vector sized chunks?
IOW 2 for doubles, 4 for floats.

If so, consider defining your buffer as a union of floats (doubles) with __m128i data types (or cast to __m128i). Then use you modulus buffer index into the __m128i buffer.

Note, source and destination buffers must be aligned to 16-byte boundaries.

Jim Dempsey

Yeah, that sounds like a clever thing to do, but only if the modulo is a factor of 4 (for floats). This is typically not the case in my application. I guess this can be solved by explicitly writing samples at both ends of the buffers when close to the wrap around?Thanks for your effort!

If the amount of data per move is sufficiently large, I suggest you take Tim's advice and use the memcpy (twice when necessary). This will take care of having you write the alignment and partial sse vector code into your function (read more portable). When your data packets are small though, it may be faster to just use an index on the array of floats. Not knowing the data flow characteristics makes it rather difficult to recommend a "best way".

Jim Dempsey

I see what you are saying. It seems like there is not perfect solution for this problem, so I will set up a test case to compare memcpy, plain loops and explicitly split-up loops (to avoid modulo). It might be that some of these cases are limited by memory bandwidth, and then SSE won't help much I guess.Thanks again for your help!

Quoting stxI see what you are saying. It seems like there is not perfect solution for this problem, so I will set up a test case to compare memcpy, plain loops and explicitly split-up loops (to avoid modulo). It might be that some of these cases are limited by memory bandwidth, and then SSE won't help much I guess. Thanks again for your help!

memcpy, with the Intel compiler, includes a decision tree that determines lenght, alignment, and SSE capabilityissues andwhen suitable it will take a path that uses SSE instructions. The decision tree has some up front overhead.

Additionaly, the memcpy may incure additional SSE register save/restore and invalidate of transient SSE registers. Meaning, if your circular buffer code gets inlined, it may avoid some SSE register save/restore and invalidation of transient SSE registers.

If your circular buffer is a choke point in your code, you might consider hand tuning it. I would venture to guess that the deciding issue is first, the average length of copy. And second, the size of the ring buffer (determines the frequency of split copies).

The other possible issue is if you have a worst case latency requirement. Usually this is not a concerne as much as average latency.

Jim Dempsey

Quoting jimdempseyatthecove
Quoting stxsome of these cases are limited by memory bandwidth, and then SSE won't help much I guess.

memcpy, with the Intel compiler, includes a decision tree that determines lenght, alignment, and SSE capabilityissues andwhen suitable it will take a path that uses SSE instructions. The decision tree has some up front overhead.

This includes invoking non-temporal move where appropriate, which
seems likely, as you mention it may be memory bandwidth limited.
If you do need a significant part of the code of an optimized memcpy(), the function call saves a lot of code expansion in your application, and may improve instruction cache locality. For this reason, Intel compilers make memcpy() substitutions automatically in for() when it appears appropriate, based on the assumption that the loop count is at least 100 (should you not specify it by #pragma or visible static array size declaration).
As Jim has been reminding us, the choice among such methods depends on whether you have long data moves or many short ones. We can go on and on presenting information you may not want, if you are unwilling to answer those questions.

>>This includes invoking non-temporal move where appropriate

In the case of the circular buffer, if this is a single producer single consumer queue .and. producer and consumer do not share L1 (HT siblings) .and. the time in the buffer is very short .and./.or. the data is processed immediately after extraction from the buffer (IOW producer and consumer do not share L1 and are using circular buffer as inter-thread message passing), then non-temporal moves would not be appropriate as you would want the data to be placed into and extracted from L3 cache (or L2/L1 if HT siblings). Note, L1 onlyis written on non-temporal move (other caches are invalidated), in the special case of message passing between HT siblings the non-temporal moves (of short data) may remain fully cached in L1.

Is there a pragma or other function call to inhibit (or control) non-temporal moves?

#pragma vectortemporal
memcpy(dst, src, n);

On this subject, I did notice that the CEAN extension does pay attention to #pragma vectortemporal. So you could use

if((index + n) <= bufferSize)
{
#pragma vectortemporal
dstAsFloats[0:n] = bufferAsFloats[index:n];
index+= n;
if(index == bufferSize)
index = 0;
} else {
int part1 = bufferSize - index;
int part2 = n - part1;
#pragma vectortemporal
dstAsFloats[0:part1] = bufferAsFloats[index:part1];
#pragma vectortemporal
dstAsFloats[part1:part2] = bufferAsFloats[0:part2];
index = part2;
}

Jim Dempsey

memcpy() would choose non-temporal move only when the data segment is so large that it would evict most of the data currently in cache. Then, the (unforeseen) case where non-temporal is bad is the one where the code is written explicitly to use ("consume") first the tail end of the array which was just moved.
There are a number of cases where the default action of the Intel optimizer can defeat the purpose of code which uses a forward-backward scheme with the intent of improving cache locality between loops for large arrays.
The pragma would not change the action of memcpy(), but it can be used to control instruction selection for vectorizable for() loops.
If you don't know more than you are willing to tell us about your usage, you will have to experiment. If you perform a 2 part move, it's even conceivable that one should be non-temporal and the other (which you want to remain in cache) should be temporal.

Quoting tim18

As Jim has been reminding us, the choice among such methods depends on whether you have long data moves or many short ones. We can go on and on presenting information you may not want, if you are unwilling to answer those questions.

Alright, here is some more info. The length of the circular buffer will range from ~1000 to ~15000. However, the delay between the writepointer and the readpointer is variable over the entire range, more or less, and not known at compile time. The bufferLength, the number of floats processed per call, is also variable in the range 32 to 512 or something like that.

Also, the function will typically write at two places and read from two places in each call, which makes the explicit modulo stuff a bit cumbersome, but still feasible.

And again, thanks for all your help! Even though not all of these ideas might apply in my case, I still find them very interesting.

Quoting jimdempseyatthecove
On this subject, I did notice that the CEAN extension does pay attention to #pragma vectortemporal. So you could use

if((index + n) <= bufferSize)
{
#pragma vectortemporal
dstAsFloats[0:n] = bufferAsFloats[index:n];
index+= n;
if(index == bufferSize)
index = 0;
} else {
int part1 = bufferSize - index;
int part2 = n - part1;
#pragma vectortemporal
dstAsFloats[0:part1] = bufferAsFloats[index:part1];
#pragma vectortemporal
dstAsFloats[part1:part2] = bufferAsFloats[0:part2];
index = part2;
}

Jim Dempsey

Hi again. I have done some experiments with code similar to the one above, but with more reads and writes, and some arithmetic operations. The result is a 2:1 speed-up, which is great! The most simple test cases, with only read and write, will not see the same speed-up. My guess that is due to memory bandwidth being the limiting factor in those situations.

Quoting tim18
memcpy() would choose non-temporal move only when the data segment is so large that it would evict most of the data currently in cache. Then, the (unforeseen) case where non-temporal is bad is the one where the code is written explicitly to use ("consume") first the tail end of the array which was just moved.
There are a number of cases where the default action of the Intel optimizer can defeat the purpose of code which uses a forward-backward scheme with the intent of improving cache locality between loops for large arrays.
The pragma would not change the action of memcpy(), but it can be used to control instruction selection for vectorizable for() loops.
If you don't know more than you are willing to tell us about your usage, you will have to experiment. If you perform a 2 part move, it's even conceivable that one should be non-temporal and the other (which you want to remain in cache) should be temporal.

This is all interesting. I didn't understand the temporal/non-temporal stuff before, but after reading this and experimenting a bit further, I can see that I gain another 10% by using #pragma vector nontemporal. This also makes sense according to what you write above.

Thanks!

The "temporal' choice of words and effect on caches is a bit confusing. Whereas SSE nomenclature of might be less so.

Temporal has the effect of:

write into L1, into L2, into L3, and dribble in sequence to RAM

non-Temporal had the effect of

write into L1, invalidate L2, invalidate L3, and dribble into RAM not necessarily in sequence and potentially with write combining.

The primary difference is

pollute / populate L2 and L3
or
invalidate L2, L3 and dribble into RAM not necessarily in sequence and potentially with WC

As to if it is pollute or populate, this depends on if you want to immediatelyreuse the data or not.

Jim Dempsey

If you write at most 512 floats before continuing to code which may read what you have written, you would not expect a major issue with memory bandwidth. The major use for nontemporal is where you store so much data that temporal would evict what you need next from cache, which doesn't appear to be the case for you.
It does look like your buffer is big enough that if you fill most of it before reading it back, you would not expect the data to remain in L1 cache, which may be a point in favor of nontemporal. If you can get a 10% advantage in what looks like a marginal case, that could be useful information which you could get only by testing.

In your last posted code, it looks like the failure to vectorize is more likely still due to your omission of restrict qualifiers, or possibly to the lack of a #pragma vector always or #pragma vector nontemporal. Note that both those pragmas ask for vectorization without evaluating efficiency, so it is possible that nontemporal could promote vectorization. It might be interesting to see whether the use of unsigned bufferlength has any influence on the compiler's evaluation, possibly combinations of signed and unsigned casts. unsigned tends to be inefficient for loop counters, subscripts, and bounds (counter to the opinion of many), but may be preferred for masking.

Leave a Comment

Please sign in to add a comment. Not a member? Join today