Instruction dependencies

Instruction dependencies

Hi, I am new to SSE programming and I was just wondering if someone more experienced could explained this to me:

When I sum elements in an array like this:

for(i...N; i+=16){
sumA += array[i]
sumB += array[i+4]
sumC += array[i+8]
sumD += array[i+12]

Why is it not faster when compared to: for(i...N;i+=4) { sum +=array[i]; } ?

It is actually a bit slower. I was under the impression that this inctruction level parallelism should speed it up. I checked it using Intel Architecture Code Analyzer and the performace critical path is about the same (while thesecond version sums only one float4 at a time), so I would expect the first version to be faster. Can someone explain why is my assumption wrong and what is limiting the performance?

I have been testing it on 06_17H, compiled in both 32 and 64bits by VS2010, memory of the array is locked to get rid of page faults, prefetches are in place

14 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

What data types are sumA, sumB, sumC, sumD and array[N]?

Jim Dempsey

sumA, sumB, etc are __m128, the array is 16byte aligned array of floats

So inside the loop I do this for every sum variable:
sumA = _mm_add_ps(sumA, _mm_load_ps(&elements[i]));

And I sum the temp sums after the loop finishes

Have you looked at the dissassembly code? (/O3)

Your loop should require 4 temps for the fetch of the elements and 4 temps for the sumA, ...
However, this is an AVX forum so consider using _mm256_add_ps(... where sumA, ... are __mm256 types.
IOW each instruction adds 8 floats. (change loop to use +8)

Jim Dempsey

As Jim said, you should take a look at the disassembly, particularly as you don't care to give us a sufficient example to understand what you are up to, or to try it ourselves, nor do you tell us which options you use. If you were not using a Microsoft compiler, there would be little reason to second guess how a compiler would like to optimize a sum reduction for a C standard data type. Even so, it's difficult to understand why you wouldn't sum in the usual fashion (add 4 elements to each sum, if you went to the trouble of making those sums m128 data types). In normal circumstances, one could expect 2 parallel sums to give more throughput than a single one, if the loop is long enough, but more than 2 parallel sums would not be likely to gain more.


Jim, I am aware of the fact that this is an AVX forum; however, I have not found a one that is more suitable. Please, let me know if there is one. The compiler has all optimizations enabled.

Tim, I appologize if it is not clear what I am trying to achieve. I am just trying to understand, why I don't experience any speedup if I unroll the loop and use more temporary variables. Furthermore, I am adding 4 elements to each sum, since it doesnt seem to be very clear, I am including the code.

__m128 Sums[4];

for(unsigned int i=0;i<4;i++) { Sums[i] = _mm_set_ps(0.f, 0.f, 0.f, 0.f); }

for(unsigned int i=0;i Sums[0] = _mm_add_ps(Sums[0], _mm_load_ps(&pBuffer[i]));
Sums[1] = _mm_add_ps(Sums[1], _mm_load_ps(&pBuffer[i+4]));
Sums[2] = _mm_add_ps(Sums[2], _mm_load_ps(&pBuffer[i+8]));
Sums[3] = _mm_add_ps(Sums[3], _mm_load_ps(&pBuffer[i+12]));

Sums[0] = _mm_add_ps(Sums[0], Sums[2]);
Sums[1] = _mm_add_ps(Sums[1], Sums[3]);
Sums[0] = _mm_add_ps(Sums[0], Sums[1]);

float Sum = 0.f;
for(unsigned int j=0;j<4;j++){
Sum += Sums[0].m128_f32[j];

And the loop complies into this:

000D1A70 addps xmm3,xmmword ptr [eax-20h]
000D1A74 addps xmm2,xmmword ptr [eax-10h]
000D1A78 addps xmm1,xmmword ptr [eax]
000D1A7B addps xmm0,xmmword ptr [eax+10h]
000D1A7F add eax,40h
000D1A82 dec ecx
000D1A83 jne 000D1A70

Which is what I would expect, soI did not try to compile it with the Intel compiler. So if I compare this unrolled versionto the one with one sum variable I can indeed see the dependency path is shorter(As I mentioned before, they are +- same, but the unrolled version sums four vectors in oneinteration of the loop). This corresponds with Intel's presentation and recommendationsabout instruction level parallelism. (eg GDC 2011)However, despite this fact, it is not faster (itis slightly slower). And I would expect that to be faster since the load should take about 5 cycles and the latency of ADDPS is 3 cycles and these dependencies should be partly avoided in the unrolled version. Finally, there is not any speedup even with only 2 parallel sums.

it will be interesting to know typical values for ElementsCount in your tests, i.e. to let us know if the buffer fit in the L1D cacheor not

also are you sure that you initialize the array properly before the test? if there is NAN values or denormals in the data it can completely skew the timings

just a detail:since you are targetinga Penryngeneration CPU I'll advise to replace the sequence dec ecx, jne with something elseallowing macrofusion

bronxzv, thank you very much for your post. I am aware of the macrofusion thingy, but unfortunately I have had problems in these cases to make VS generate the code I want. When I change the comparision operator of the loop, it generates cmpjne, but it also adds a lea/add. And since I can't write asm code directly with VS x64, I just have to live with that.

The array is filled withrandom <0,1> floats.

Thank you for the question about the array size! I should have realized that and tested it with various sizes. When the data set is small enough to fit into l1 dcache it really runsabout2 times faster than the unrolled version. When the array fits into l2, it is about 20-25% faster than the unrolled version. When the array does not fit into l2 (my cpu doesn't have l3 cache) it runs about the same.

Thank you very much for your help. I am just wondering - should I not be able to gain the speedup even when the array is bigger than the cache size? Since I prefetch the data (I use prefetchnta, +- 400 cycles ahead). Why doesthe speed dependon the size of the array? Is it some overhead for managing data in the caches ?

The Microsoft compiler used to have a minimally documented option like /favor:em64t which would avoid the unfavorable dec usage. It was unpopular, apparently because it carried the implication of catering to a deficiency of the Intel CPUs vs. AMD.
As your experiments show, when you don't have cache locality, you are likely to find that extra unrolling produces no performance gain, and hardware prefetch may work as well as software prefetch. If you are fetching data at the full rate of which the memory system is capable, those optimizations for cached data aren't effective.

Looking at the disassembly, something doesn't seem right with the loop.
eax is being incremented by 0x20, where in the statements above, each one
of the four instructions handles a 0x10-byte quadword of the data.

Have you verified the two loops being compared are generating the correct

Chris, I apologize for that. You are obviously right, I was experimenting with only two temp variables and forgot to change backthe loop step. I realized that when I was pating the code, but forgot to change the asm too. Thanks for pointing it out, I editted that post.

Tim, I am not sure I understand that (or I dont understand what you are trying to say). I cannot see the difference between the small data set and the large one. The data is not in cache in neither of the cases and it has to get there. So ifI prefetch by software/hardware prefetch it should be there. The only difference I can see is that if the data set is large, some data must be evicted from the cache. However, since the data is not modified it should not be copied back or anything like that (as far as I know). Thus I believe the speedup should be same in both cases. Now when I know it is cache related, I will run it through a profiler and see what it says.

Now look at the dissassembly of the other (faster)loop.

Also, consider looking at the code generated by the Intel compiler.

The dissassembly you posted for your _mm... loop looks as if it is optimized for size as opposed to speed. Use of temp registers might hide some of the fetch latencies.

Jim Dempsey

>The data is not in cache in neither of the cases and it has to get there

when you initialize the array the data are brought to cache, probably kept in the L1D for small arrays (if you are not forcing a full cache flush between the initialization andeach test (*1))
anyway, the way you explain itmakes me think thatyou conduct the test only once per array, the timings should be very imprecise for small arrays I guess,not only some data may be already in some cache level but the resolution of your timer isprobably not much better than the duration of the full run

for accurate timings I'll suggest to run each test several times with the same buffer, for example 1 timeto warm up the data (load in caches) then time with RDTSC 10'000x or more the same inner loop, for the case where the entire array fit in the L1D you'll be able to really compare the performance of yourvariants, otherwise it's more or less a L2/LLC/RAM bandwidth test

also for reproducible timings you should disable SpeedStep in the BIOS (+ disable turbo for Nehalem/Sandy Bridge targets)

*1: a classical issue goes as follow :

// test2 is measured asa lot faster than test1

// hey now it's test1 the fastest!

I do indeed run multiple iterations of the tests to get some reasonable timings. I flush caches after each iteration ofa test and after I init the data. Thanks for that hint about SpeedStep. I have not realized thatit was screwing up the results a bit. So when the data set is small, I can see about 20% speedup that decreses with the incresing size of the data set. However, Ican now understand it after you pointed out it is moreless a bandwidth test. So when the data set is small the bus can deal with the memory transfer requests, but when the size increases it cannot deliver the data fast enough and it is limitingtheperformance. Thank you and everyone else for having the patience and explaining it to me.

Leave a Comment

Please sign in to add a comment. Not a member? Join today