double loop optim

double loop optim


int main()
const double p0 = 0.51;
const double p1 = 0.49;
const int N = 100000;
__declspec(align(16)) double opt[N];
for (int i=0; i opt[i] = i;
const double t0 = omp_get_wtime();
for (int ts=N-1; ts>0; ts--)
for (int i=0; i opt[i] = p0*opt[i] + p1*opt[i+1];
const double t1 = omp_get_wtime();
std::cout<< t1-t0 <}

I use the 10.0.020 and build it with
icpc -xT -O3 -openmp -o 2 2.cpp
I have a macpro xeon core2duo (SSSE3) L2 4M cache

ps: the loops are vectorized
2.cpp(12): (col. 2) remark: LOOP WAS VECTORIZED.
2.cpp(17): (col. 3) remark: LOOP WAS VECTORIZED.

It takes 4.5s

Are there any suggestions to speed up more, perhaps rearranging some bits, or using intrinsics?

ps: the use of omp is purely to get the time function.

I looked at the source-annotated assembly... In the comments, there are references like #14.20. Are these section 14 item 20 of some document somewhere?


25 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Those comments in the assembly are source line and column references.
Your code sample is recursive in the outer loop (each iteration of the outer loop has to wait for data to be produced from the previous iteration), so it seems unlikely any further speedup could be found. Speed may be limited by the rate at which data can move to and from registers, taking into account that you have an vector unaligned access.
You may be interested in examining which tradeoffs the compiler has made in order to deal with cache line boundary crossing. As you specified alignment, I assume you have arranged for the compiler to use aligned moves for store and one of the loads. A trick there is to recognize that movaps does the same thing as movapd, but sometimes faster, due to saving 1 byte of code. One of the techniques which the compiler sometimes uses is full cache line unrolling, so that unaligned parallel loads can be used for the operands which don't split cache lines, and the pair of operands which cross the boundary can be loaded individually.
The gcc method of using the individual loads for the unaligned operand would be superior to exclusive use of unaligned parallel loads. This then shows that you must allow 4 clock cycles per loop iteration pair for data movement, even with full pipelining disregarding the recursion stall cycles. So that very limited analysis could explain a mere 50% performance increase from vectorization, in case that is what you see.
The incentive for undertaking such detailed work is limited, given that the Penryn CPUs are to be launched to market next week, and they should greatly reduce the cache line split penalty.
Another issue of detail is to try to fit your unrolled loop code (if not full cache line unrolled) into 4 16-byte aligned chunks. As you have no jumps inside the loop, this should be accomplished easily by making the top of the loop code body 16 byte aligned. Making it 32-byte aligned could improve efficiency of hardware prefetch. If icpc didn't do this, you can try it by editing the assembly (or checking a current gcc).

You'll be happy to know that ICC 10.1.011 for Windows doesn't vectorize this code. Great regression testing Intel, way to go.

Igor Levicki

The gain from vectorization only was 1.5s (from 5.8s to 4.3s = 26%).

The source-annotated assembly

;;; for (int ts=N-1; ts>0; ts--)
;;; for (int i=0; i;;; opt[i] = p0*opt[i] + p1*opt[i+1];
movaps _2il0floatpacket.3(%rip), %xmm3 #13.13
xorl %edx, %edx #11.2
movl $99999, %eax #
movaps _2il0floatpacket.4(%rip), %xmm2 #13.25
movsd _2il0floatpacket.5(%rip), %xmm1 #
movsd _2il0floatpacket.6(%rip), %xmm0 #
testl %eax, %eax #12.19
jle L_B1.13 # Prob 50%
movslq %eax, %rcx #12.3
cmpq $8, %rcx #12.3
jl L_B1.15 # Prob 10%
movl %ecx, %edi #12.3
movl %edi, %esi&
nbsp; #12.3
andl $7, %esi #12.3
subl %esi, %edi #12.3
movslq %edi, %r8 #12.3
xorl %edi, %edi #
movq %r8, %rsi #
shlq $3, %rsi
movaps (%rsp,%rdi), %xmm5 #13.16
movsd 8(%rsp,%rdi), %xmm4 #13.28
movhpd 16(%rsp,%rdi), %xmm4 #13.28
movaps 16(%rsp,%rdi), %xmm7 #13.16
movsd 24(%rsp,%rdi), %xmm6 #13.28
movhpd 32(%rsp,%rdi), %xmm6 #13.28
movaps 32(%rsp,%rdi), %xmm9 #13.16
bsp; 40(%rsp,%rdi), %xmm8 #13.28
movhpd 48(%rsp,%rdi), %xmm8 #13.28
movaps 48(%rsp,%rdi), %xmm11 #13.16
movsd 56(%rsp,%rdi), %xmm10 #13.28
movhpd 64(%rsp,%rdi), %xmm10 #13.28
mulpd %xmm3, %xmm5 #13.16
mulpd %xmm2, %xmm4 #13.28
addpd %xmm4, %xmm5 #13.28
movaps %xmm5, (%rsp,%rdi) #13.4
mulpd %xmm3, %xmm7 #13.16
mulpd %xmm2, %xmm6 #13.28
addpd %xmm6, %xmm7 #13.28
movaps %xmm7, 16(%rsp,%rdi) #13.4
pd %xmm3, %xmm9 #13.16
mulpd %xmm2, %xmm8 #13.28
addpd %xmm8, %xmm9 #13.28
movaps %xmm9, 32(%rsp,%rdi) #13.4
mulpd %xmm3, %xmm11 #13.16
mulpd %xmm2, %xmm10 #13.28
addpd %xmm10, %xmm11 #13.28
movaps %xmm11, 48(%rsp,%rdi) #13.4
addq $64, %rdi #12.3
cmpq %rsi, %rdi #12.3
jl L_B1.7 # Prob 99% #12.3
L_B1.9: # Preds L_B1.7 L_B1.15
movq %r8, %rdi&nbs
p; #
shlq $3, %rdi #
movq %rcx, %rsi #
shlq $3, %rsi #
cmpq %rcx, %r8 #12.3
jge L_B1.13 # Prob 1%
movsd (%rsp,%rdi), %xmm5 #13.16
movsd 8(%rsp,%rdi), %xmm4 #13.28
mulsd %xmm1, %xmm5 #13.16
mulsd %xmm0, %xmm4 #13.28
addsd %xmm4, %xmm5 #13.28
movsd %xmm5, (%rsp,%rdi) #13.4
addq $8, %rdi #12.3
cmpq %rsi, %rdi #12.3
jl L_B1.11 # Prob 99% #12.3
addl $-1, %eax #11.2
addl $1, %edx #11.2
cmpl $99999, %edx #11.2
jb L_B1.4 # Prob 99%
xorl %eax, %eax #14.1
addq $800008, %rsp #14.1
L____tag_value__main.12: #
ret #14.1
L____tag_value__main.13: #
# Preds L_B1.5 # Infreq
xorl %r8d, %r8d #12.3
jmp L_B1.9 # Prob 100% #12.3
.align 1
L____tag_value__main.14: #
# mark_end;
.section __DATA,__data
# -- End _main
.section __DATA,__data
.align 4
.align 4

Is movaps _2il0floatpacket.3(%rip), %xmm3the top of the loop body? Do i have to place a

.align(16) just before that?

Thanks tim,

The L_B1.7: is the main loop. Your .align would go in front of that line.

Jim Dempsey


Experiment with

for (int ts=N-1; ts>0; ts--)
for (int i=0; iopt1[i] =p1*opt[i+1];// target 2nd array opt1
for (int i=0; iopt[i] = p0*opt[i] +opt1[i]; // combineopt1 with opt

Although this appears to be twice the work it may run faster.
I assume opt[ts] is initialized to 0. or some innocuous number.

The other method using one statement require code to shuffle the results. This method will not. It does require additional memory for the opt1 array.

Jim Dempsey

with this,
const double t0 = omp_get_wtime();
for (int ts=N-1; ts>0; ts--) {
for (int i=0; i opt1[i] = p1*opt[i+1];
for (int i=0; i opt[i] = p0*opt[i] + opt1[i];
const double t1 = omp_get_wtime();
3.cpp(17): (col. 3) remark: FUSED LOOP WAS VECTORIZED.
>>> 17 is the 1st ... for (int i=0; i

4.23s instead of 4.33s
2.3% gain

I'll try now with aligning to loop code.

Putting .align 16 at the top of the loop says:

2.cpp:48:Alignment too large: 15. assumed.
with .align 15 it takes obviously way too long

I tried with align loop code on 8bytes-boundary, it's 5.2s instead of 4.3s

perhaps this is as fast as one can go on this processor config?

Check your alignment directives. If you are using a linux assembler, you might follow the patterns used by gcc, where
.p2align 4,,7
.p2align 3
means to use up to 7 bytes of padding in order to get 16-byte alignment, followed by unconditional 8-byte alignment. If your alignment directive is taken to mean 2^15 (2 raised to power 15) byte alignment, it will not do what you think. "p2" reminds us that this facility was instituted for Pentium II, so people who use assembler have had some opportunity to get used to it.
Try making an object file and viewing the code there (for gnu binutils, objdump -S) so you can see how the align is implemented.
It looks like your loop body may be too large for alignment to help much. The most likely reason for alignment to help would be if it avoids an awkward split of the last instruction (jl L_B1.7) across a 16-byte boundary.
The largest loop alignment which is likely to help is 32-byte alignment, but the platform doesn't give adequate support for this. It's worth bearing in mind when you test.


If you go the double inner loop method each loopcan be aligned seperately.

Also, if ts gets large to the point where both arrays opt and opt1 won't fit in the processor cache, then consider fragmenting up the work. From your other code examples posted I would assume ts would have been reduced to a cache fitable amout of memory.


But why does 10.1.011 for windows fail to vectorize this at all?

Igor Levicki

>>But why does 10.1.011 for windows fail to vectorize this at all?

Premier Support should be able to tell you.

Another potential work around is for the two inner loops (opt and opt1) is to perform each loop in increments of 2 using two statements (explicitly stating pairs). Then after each loop test to see if ts was odd and if so perform the last step. Forcing and declaring the opt arrays aligned would help too.

The compiler should have been able to make this determination. The "cost" to you is a handful of statements.

You can wait for a fix, if that is what you want.

Jim Dempsey

There's a bug in the Windows compiler related to optimization with EH (exception handling).

If you comment out the last code line (std::cout), it should vectorize.

Jennifer, can that bug affect anything else? Would the code vectorize if you disable SEH?

Igor Levicki

It only affects loop vectorization with code containing EH. It happens with some cases like this, not always.

Disabling SEH doesn't help. But moving code around does. For this case, moving the "std::cout" line into a funtion works.

Would using printf() instead of std::cout help? In other words, is it related to using std:: namespace somehow?

Igor Levicki

Yes, using print() instead helps, at least on 10.0 windows.
On linux/mac, there's no bug.



aligning the 2 inner loops didn't yield much compared to the non-explicit align case.

With the 2 inner loops version, the assembly generated looks better than previous case.

movaps stuff from opt[]and opt1[](8 doubles , bcz cache line size=64bytes) and put it to xmm registers, some unaligned moves (shifted by 1) from opt[] to other xmm registers. addition of xmm registers (after multiplication) and then aligned stores back to opt[] and to opt1[], and then jump to begining label of loop.

All examples run with sizeof(opt[N] + opt1[N]) < 2MB, so inside L2 cache. (so i haven't tested with fragmentation)

4.2s... I doubt much more gain can be obtained?

I'm curious if Penryn SSE4 Dotproduct instructions could bring any benefit here?
In the 1 inner-loop case => p0*opt[i]+p1*opt[i+1], that looks neatly like a DotProduct instruction.

If you want the compiler to consider special optimizations for short fixed loop lengths, such as 1, you need the
#pragma loop count (1,n2,...)
which expresses your preferences, as well as setting -xS. You may not like the generated code, if it does in fact generate separate versions for various cases.
On Penryn, optimized SSE2 dot product code may be best for long loops.


The optimizer may have done this for you but from the ASM code you posted a while back it looked like the optimizer burned some instructions to avoid what looked like an unnecessary load from memory. Try making the inner loop go in steps of 2

opt[i] = p0*opt[i] + p1*opt[i+1];
opt[i+1] = p0*opt[i+1] + p1*opt[i+2];

The ASM code you posted seemed like the code tried to move the opt[i+1] from register to register. In this case I think what looks like a redundant read might be offset by what may end up as always writing in aligned pairs of doubles.

You might be at the case of diminishing returns and would be better off focusing your efforts elsewhere.

Jim Dempsey

"Optimizer" is generating awfull code for this loop:

	const int N = 100000;
	__declspec(align(16)) double opt[N];
	for (int i = 0; i < N; i++) {
		opt[i] = i;

Take a look at this HORRIBLE asm code:

	xor		eax, eax
	movdqa		xmm1, oword ptr ds:oword_41F1C0 ; constant 2 2 2 2
	movdqa		xmm0, oword ptr ds:oword_41F1D0 ; constant 3 2 1 0
	cvtdq2pd	xmm2, xmm0
	paddd		xmm0, xmm1
	movaps		oword ptr [eax+427580h], xmm2
	cvtdq2pd	xmm3, xmm0
	paddd		xmm0, xmm1
	movaps		oword ptr [eax+427590h], xmm3
	cvtdq2pd	xmm4, xmm0
	paddd		xmm0, xmm1
	movaps		oword ptr [eax+4275A0h], xmm4
	cvtdq2pd	xmm5, xmm0
	paddd		xmm0, xmm1
	movaps		oword ptr [eax+4275B0h], xmm5
	add		eax, 40h
	cmp		eax, 0C3500h
	jb		short loc_40104E

Compiler is using cvtdq2pd so abundantly like if it has ZERO clocks latency!!! I understand that Core architecture has lower latency so it may not be a problem, but Netburst has 8 clocks latency and 3 clocks throughput for CVTDQ2PD. The question is — is it really neccessary?

Why not something like this?

		xor	eax, eax
		movapd	xmm0, xmmword ptr [c1] ; constant 8 8 8 8
		movapd	xmm1, xmmword ptr [c1 + 16] ; constant 0 1
		movapd	xmm2, xmmword ptr [c1 + 32] ; constant 2 3
		movapd	xmm3, xmmword ptr [c1 + 48] ; constant 4 5
		movapd	xmm7, xmmword ptr [c1 + 64] ; constant 6 7
		movapd	[opt + eax + 0], xmm0
		movapd	[opt + eax + 16], xmm1
		movapd	[opt + eax + 32], xmm2
		movapd	[opt + eax + 48], xmm3
		addpd	xmm0, xmm7
		addpd	xmm1, xmm7
		addpd	xmm2, xmm7
		addpd	xmm3, xmm7
		add	eax, 64
		cmp	eax, 800000 ; N * 8
		jb	fill

For me it works faster — 0.0368 .vs. compiler's 0.0371. If you change movapd to movntpd in loop body and insert sfence after jb you get 0.0355.

Igor Levicki

Maybe you meant something more like
double di = 0;
#pragma vector nontemporal
for (int i = 0; i < N; i++) {
opt[i] = di++;
"the compiler" isn't clairvoyant about your wishes. It does know enough to cut back on unrolling by default when using nontemporal. Down in the 1 to 3% performance difference range you are interested in, so many factors intervene that I'm not going to ask for enough information to attempt to reproduce your assertions.

Tim, that loop I disassembled is from the above example you were debating with finjulhich. I just pointed out that compiler has chosen weird instructions for the task at hand, nothing more. Your example of workaround it is fine, but tell me honestly who writes their C++ code like you just did? Certainly not people who are new to the compiler and to the optimization in general.

Furthermore, speed difference I got was 4.5% which is nothing to sneeze at especially if the processing takes more time than in this simple example.

Finally, your workaround doesn't unroll to the full cache line size even though the compiler "knows" that the array is aligned. Instead you get this:

	movaps	xmm1, oword ptr ds:oword_41F1C0 ; constant 2 2
	xor	eax, eax
	movaps	xmm0, oword ptr ds:oword_41F1D0 ; constant 0 1
	movntpd	oword ptr [eax+427580h], xmm0
	addpd	xmm0, xmm1
	add	eax, 10h
	cmp	eax, 0C3500h
	jb	short loc_401055

Note that there is also no sfence instruction after this loop even though values are used in next.

Also note that the compiler uses movaps instead of movapd which is something Optimization manual says we shouldn't do — it says that we should use properly typed instructions whenever possible. Either fix the compiler or change the rule in the optimization manual if it is really beneficial to use movaps instead of movapd.

Igor Levicki

You knew about the desirability of moving int to double conversions out of the inner loop. It's been written in optimization guides going back to before the IA was invented, and it was not until P4 that there was any instruction support for double to int conversion in IA. As far as I know, the only reason it may not be common knowledge among C++ programmers is most of them don't care about 100% changes in performance, let alone 4%. I doubt that OP cared about this much performance change in a single initialization loop.
As already mentioned, the compiler defaults to minimum unrolling when using movnt. I assume this is based on testing; you can throw in #pragma unroll(4) if it suits you. The compiler performs whole cache line unrolling in order to avoid unaligned moves which cross the cache line boundary; the possibility of that has already been eliminated in this example. movnt would be used only where memory speed is limiting, so it makes sense to minimize the size of generated code.
There appear to be differences of opinion about replacing movapd by movaps. A movapd which is awkwardly aligned, with a 16-byte boundary in the first few bytes of the instruction, could be much slower than the movaps, or saving space so as to fit the parameters for loop acceleration could make the difference. It's difficult for a compiler to recognize and avoid those problems, other than by always using movaps.
Why don't you submit an issue on pointing out the discrepancy between documentation in optimization manual and what the compiler does here, asking for clarification? Maybe some gcc developer wrote something about the pros and cons? What happens on AMD when they extend movaps to permit unaligned moves?

You knew about the desirability of moving int to double conversions out of the inner loop.

Tim, this is not about what I know, but about the compiler optimization. It was the compiler who didn't move the conversions out of the loop — on the contrary it unrolled them!

What I am saying is that the compiler should be smart enough to figure out that the conversion is not needed in the loop. You shouldn't need to write any special code constructs or to add any pragmas especially for such a simple loop.

As already mentioned, the compiler defaults to minimum unrolling when using movnt.

And I believe this to be an error. Intel's optimization manual says that non-temporal stores use write buffers and they do write combining so they can write out 64 bytes in one burst instead of using 8x 8 byte bus transactions.

If they don't have a full cache line to operate on, there is a chance they will have to be flushed before they fill completely if write buffers (which are scarce resource) are needed for some other purpose, thus negating any benefit of using MOVNT in the first place.

Now the question arises whether those developers who write compiler optimizations have read the optimization manual or not?!? I am beginning to seriously doubt that.

As for movaps .vs. movapd — manual clearly states that one should use typed instructions. I see this rule broken and I wonder what is the purpose of such rule if those who wrote it are the same ones who are breaking it?!?

About submitting an issue over this discrepancy — isn't it enough that I am submitting issues for serious bugs I am discovering in the compiler code generation (like the latest issue #458830 for example)? You expect me to be the beta tester for an unpolished product which we paid for plus I am forced to perform my own regression testing for each compiler build you push out and as if that is not enough now I should also be the one who will make sure that your compiler writers work in sync with your documentation writers and CPU engineers?!?

I think that is a little bit too much to ask from a customer. On the other hand, I can offer you my services so if you hire me I will gladly start coordinating those groups the way I believe they should be coordinated but don't expect me to do it for free.

About gcc developers — I don't care. The only person whose opinion on optimization I value is Agner Fog. About AMD — I don't care. If they want movaps to access unaligned memory its fine with me. My memory is always aligned so I won't see any difference.

And what about the compiler not using SFENCE after non-temporal store?

Igor Levicki

Leave a Comment

Please sign in to add a comment. Not a member? Join today