missed optimization on SSE multiply-add loop

Matthias Kretz
Total Points:
310
Status Points:
260
Green Belt
July 10, 2009 3:31 AM PDT
Rate
 
#4 Reply to #3
I am more concerned about the loop stream detector than the instruction cache. Furthermore, I wonder why ICC puts all these loads next to each other. This should put some stress on the OOO engine.

Have you tried to experiment with "#pragma unroll"? Reducing the unrolling might improve the situation.

The LSD is what I meant. I just didn't recall the right name...

No, I didn't try #pragma unroll. You mean I should try to let icc unroll the whole 1000000 iterations loop? I guess you meant something else.

The "manual unrolling" that is in the code is there to enable the code to make use of instruction level parallelism. If you loop only over the 0s then 1s... then you have a dependency chain between all the instructions in the loop and there's nothing left to execute in parallel. The four independent multiply-add make gcc able to reach peak-performance.


Intel Software Network Forums Statistics

8482 users have contributed to 31618 threads and 100685 posts to date.
In the past 24 hours, we have 34 new thread(s) 120 new posts(s), and 176 new user(s).

In the past 3 days, the most popular thread for everyone has been gemm(A,A,A) like possible? The most posts were made to gemm(A,A,A) like possible? The post with the most views is Dear Steve, excuse me for a d

Please welcome our newest member rohit5575