Can icl unroll the loop with auto-vectorization?

Can icl unroll the loop with auto-vectorization?

Example:

#define LEN 1048576

BYTE orign[LEN];

BYTE copy[LEN];

......

for(int i=0; i

{

copy[i] = orign[i];

}

On P4, compiled with /o2 /QxN, the loop is auto-vectorized.

But it just used one register XMM0, how can icl unroll the loop further to use all eight XMM registers?

kalven

17 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The compiler has been tuned to unroll for maximum performance (unless it will optimize more aggressively at /O3). By means of hardware register renaming, it would use multiple physical registers more effectively than can be done by specifying use of all the programmable registers.
For such a long loop, memcpy() is suitable, and the Intel compiler headers will substitute automatically a special version tuned for Intel CPUs.

"For such a long loop, memcpy() is suitable, and the Intel compiler headers will substitute automatically a special version tuned for Intel CPUs"

Did you have a try? I tried this. Yes, with icl, memcpy is compiled to __intel_fast_memcpy, but it took as much time as my loop with /O2 (with one mega data, it took some 0.84ms with streaming store in P4. And to memcpy, /O3 is useless.). But what thenif I have such a loop, but the srcdata type is not the same as dst?

Are you saying only with /O3, the loop can then be unrolled? I'll have a try latter.Thanks.

I think our disagreement is on your hypothesis that more massive unrolling will necessarily out-perform any of the versions you tried.
I have seen performance gains with memcpy(), compared with vectorized compilation in line. Those gains aren't spectacular, and may be associated with factors other than the degree of unrolling.
Massive unrolling isn't an automatic cure for a situation where performance is limited by memory bandwidth, rate of processing TLB misses, and the like. If you want to optimize for an old P4, such as was in production over 5 years ago, that would not be the priority target for current tuning of the compiler, so you may gain a little by tweaking.
Did you read the articles about early P4, where you could accelerate a loop by issuing a string of software prefetch before starting the loop? This didn't work for later models, where the interactions between software and hardware prefetch were cut, because it was much more common to get unfavorable interactions.

"I think our disagreement is on your hypothesis that more massive unrolling will necessarily out-perform any of the versions you tried."

Yes, I think there will be an improvment with further unrolling. And I have the reason. I had a MMX intrinsics optimized version useing eight MMX registers, it took just 0.51ms with the same data length, compared to 0.84ms, there was a 1.64X improvment. Can you explain this? Because MMX using _m64, and SSE2 using _m128i(my P4 supports SSE2), I think there will be an improvment to MMX intrinsics optimized version.Maybe I should write SSE2 intrinsics to try.

And will you please give me a stuation under whichthe loop isunrolled with auto-vectorization, and paste the asm result here?

PS: Don't take care of my processor, now I just want to see the situation under which the loop is unrolled with auto-vectorization. And give me the processor and compiler options together with asm result here. Thanks a lot.

kalven

When the compilers choose to vectorize with non-temporal store, due to a known long loop length, they tend not to do any additional unrolling. There probably are reasons for that. Among those possible reasons, the loop stream detector is intended to eliminate need for extra unrolling on current CPUs. Where the loop length is not known to be so long, they may unroll automatically, but not as much as you advocate. For that matter, libraries such as MKL may include loop versions with more unrolling than the compilers would generate, which may support your argument. They can afford to create more versions, as the multiple versions don't get repeated in line at every opportunity. However, on relatively short loops, the massive unrolling would work well only for certain loop lengths, and too many versions also shows noticeable overhead.

"When the compilers choose to vectorize with non-temporal store, due to a known long loop length, they tend not to do any additional unrolling."

I dont think it just with non-temporal store, even ADD, there isnt any additional unrolling either.

MKL need another pay, right? Untill now I am not ready to buy MKL.

Well, maybe I need to write intrinsics myself if I find additional unrolling is helpful, it's a little disappointing.

kalven, if I remember correctly /Qunroll:n switch controls the loop unroll factor. There is also a pragma — #pragma unroll(n). For example, if you want the loop to be unrolled 8 times then add #pragma unroll(8) before the loop body. Of course, you could have also found that on your own by reading the compiler documentation.

Thank you, Igor.

I tested the loop with #pragma unroll(8).Yes, it's unrolled while auto-vectorized. But what the time test showed is interesting, there wasn't any improvement with my P4 630 CPU, for both streaming storeand non-streaming store.

As I posted here before, I tested a MMX intrinsic version writen by hand, there was animprovement.

kalven,

tim18 already told you there probably won't be any benefit from unrolling that loop. As for your MMX function, there is a chance that you have made a mistake somewhere — Intel's own memcpy() code could hardly be beaten by MMX code written using intrinsics unless you really know what you are doing (that is unless you are optimizing for a specific case). I would suggest checking the MMX code for correctness, and comparing the performance with Intel's own memcpy(). In most cases memcpy() gives best results on all target CPUs because it is already heavily optimized.

I don't think something wrong with my mmx code writen using intrinsics, because I have used it for about half a year in my application.

I think there are some issues I dont know yet.

Then how about your testing methodology? Are you using separate input and output buffers when comparing the speed of your MMX code and memcpy()? Can you share your test code with us?

Dear Igor, I used the same input and output buffer, first I tested StreamStore, and then QuickCopy.
For data size, I tested 256k, 512k, one mega. QuickCopy was all better than StreamStore. Codes
of StreamStore and QuickCopy are listed following. I dont understand QuickCopy beating 
StreamStore,maybe you can give me some signs.
void StreamStore(BYTE* dst, BYTE *src, int len)
{
#pragma vector nontemporal
#pragma ivdep
#pragma unroll(8)
for(int i=0; i {
__assume_aligned(dst, 16);
__assume_aligned(src, 16);
*(dst+i) = *(src+i);
}

//memcpy(dst,src,len);
}
void QuickCopy(void *dst, void *src, int len)
{
const int q_words_8 = len / 64;
if(q_words_8>0)
{
Qmemcpy_Mmx(dst, src, q_words_8*8);
}
 const int q_words_bytes = q_words_8 * 64;
const int edge_bytes = len - q_words_bytes;
if(edge_bytes>0)
{
memcpy((char*)dst+q_words_bytes, (char*)src+q_words_bytes, edge_bytes);
}
}
void Qmemcpy_Mmx(void *dst, void *src, int nQWORDs) 
{
#define CACHEBLOCK 1024 //maybe have different performance with different size. Tested with //512,1024 and 2048 in P4 630 cpu, just the same result.
int n=((int)(nQWORDs/CACHEBLOCK))*CACHEBLOCK;
int m=nQWORDs-n;
if (n)
{
_asm //copy the whole cache block
{
mov esi, src
mov edi, dst
mov ecx, n
lea esi, [esi+ecx*8]
lea edi, [edi+ecx*8]
neg ecx
mainloop:
mov eax, CACHEBLOCK / 16
prefetchloop:
mov ebx, [esi+ecx*8]
mov ebx, [esi+ecx*8+64]
add ecx, 16
dec eax
jnz prefetchloop
sub ecx, CACHEBLOCK
mov eax, CACHEBLOCK / 8
writeloop:
movq mm0, qword ptr [esi+ecx*8 ]
movq mm1, qword ptr [esi+ecx*8+8 ]
movq mm2, qword ptr [esi+ecx*8+16]
movq mm3, qword ptr [esi+ecx*8+24]
movq mm4, qword ptr [esi+ecx*8+32]
movq mm5, qword ptr [esi+ecx*8+40]
movq mm6, qword ptr [esi+ecx*8+48]
movq mm7, qword ptr [esi+ecx*8+56]
 movntq qword ptr [edi+ecx*8 ], mm0 
movntq qword ptr [edi+ecx*8+8 ], mm1
movntq qword ptr [edi+ecx*8+16], mm2
movntq qword ptr [edi+ecx*8+24], mm3
movntq qword ptr [edi+ecx*8+32], mm4
movntq qword ptr [edi+ecx*8+40], mm5
movntq qword ptr [edi+ecx*8+48], mm6
movntq qword ptr [edi+ecx*8+56], mm7
add ecx, 8
dec eax
jnz writeloop
or ecx, ecx
jnz mainloop
}
}
if (m)
{
_asm
{ //copy the remained
mov esi, src
mov edi, dst
mov ecx, m
mov ebx, nQWORDs
lea esi, [esi+ebx*8]
lea edi, [edi+ebx*8]
neg ecx
copyloop:
prefetchnta [esi+ecx*8+512]
movq mm0, qword ptr [esi+ecx*8 ]
movq mm1, qword ptr [esi+ecx*8+8 ]
movq mm2, qword ptr [esi+ecx*8+16]
movq mm3, qword ptr [esi+ecx*8+24]
movq mm4, qword ptr [esi+ecx*8+32]
movq mm5, qword ptr [esi+ecx*8+40]
movq mm6, qword ptr [esi+ecx*8+48]
movq mm7, qword ptr [esi+ecx*8+56]
 movntq qword ptr [edi+ecx*8 ], mm0 
movntq qword ptr [edi+ecx*8+8 ], mm1
movntq qword ptr [edi+ecx*8+16], mm2
movntq qword ptr [edi+ecx*8+24], mm3
movntq qword ptr [edi+ecx*8+32], mm4
movntq qword ptr [edi+ecx*8+40], mm5
movntq qword ptr [edi+ecx*8+48], mm6
movntq qword ptr [edi+ecx*8+56], mm7
add ecx, 8
jnz copyloop
sfence
emms
}
}
else
{
_asm
{
sfence
emms
}
}
}

kalven,

If you are using the same input and output buffers to test two different functions in the same test run, then the function you test second will have better performance because the first function will prefetch the data and the data will already be in the cache for the second function.

You have to use separate input and output buffers or to test only one function in a single test run.

Sorry for my confused expression.

I used the same input and output buffer, but in different testrun. For example, first run with StreamStore, second run I replaced StreamStore with QuickCopy.

What's more,as I know, streaming store technology won't use writing cache, right?

Then your timing results should be correct but I would still suggest measuring with a much larger buffer. You are right, streaming store uses write buffers and bypasses the cache. Is there any reason why your assembler code is not using XMM registers?

Thanks for your suggestion.

I used to use microsoft c++ compiler, it can't auto-vectorize the roop. And for the reason of my processor the application run on, I used just MMX intrinsics.

Now I have intel c++ compiler and new processor, it should doauto-vectorization with XMM. I think for such a simply loop, the compiler should do at least as good as assembler code. And you see, for XMM, it is 128bit, so I take that the result should be better for grand. But the practice shows it is not true.

If for such a simply loop, even with intel c++ compiler, you have to write assembler code to achieve the best performance.

Leave a Comment

Please sign in to add a comment. Not a member? Join today