missed optimization on SSE multiply-add loop

Thomas Willhalm (Intel)
Total Points:
2,365
Status Points:
1,865
Brown Belt
July 10, 2009 2:52 AM PDT
Rate
 
#3

The details of why the icc code is slower is not entirely clear to me. Is it the instruction decoder? (because the loop just barely doesn't fit into the instruction cache anymore) Or is the icc code worse at making use of instruction level parallelism?

Is there anything I can do to make icc generate the most performant variant?


Matthias,

I am more concerned about the loop stream detector than the instruction cache. Furthermore, I wonder why ICC puts all these loads next to each other. This should put some stress on the OOO engine.

Have you tried to experiment with "#pragma unroll"? Reducing the unrolling might improve the situation.

Kind regards
Thomas

Intel Software Network Forums Statistics

8491 users have contributed to 31629 threads and 100767 posts to date.
In the past 24 hours, we have 30 new thread(s) 136 new posts(s), and 192 new user(s).

In the past 3 days, the most popular thread for everyone has been gemm(A,A,A) like possible? The most posts were made to Crash when loading skeleton The post with the most views is Dear Steve, excuse me for a d

Please welcome our newest member shadowwolf99