Vectorization - pragma asm interpretation

srimks
Total Points:
5,884
Status Points:
5,384
Brown Belt
April 28, 2009 11:45 PM PDT
Rate
 
#3 Reply to #1
Quoting - Igor Levicki
Two LEA instructions at the function end are simply fillers (NOPs) to ensure proper alignment for the next function -- they aren't part of the function epilogue.

As for the prologue difference it is hard to tell without seeing the rest of the surrounding code. Most likely vectorization enables the compiler to "see" an opportunity for some other optimizations thus resulting in a bit shorter code which uses less variables.
As qouted "As for the prologue difference it is hard to tell without seeing the rest of the surrounding code. Most likely vectorization enables the compiler to "see" an opportunity for some other optimizations thus resulting in a bit shorter code which uses less variables.", probably iif you see prologues of both -

(a) Prologue with pragma vectorization -
{
44d960:                      55                                    push %rbp
44d961:                      48 83 ec 50                     sub $0x50,%rsp
44d965:                      49 89 f0                           mov %rsi,%r8
44d968:                      4c 63 c9                          movslq %ecx,%r9
...
...
}

(b) The same code w/o using any pragma's call, the prologue asm are as -
{
44d960:                       48 83 ec 68                     sub $0x68,%rsp
44d964:                       49 89 f9                           mov %rdi,%r9
44d967:                       49 89 d0                          mov %rdx,%r8
44d96a:                       4c 63 d1                          movslq %ecx,%r1
...
...
}

With above (a) i.e with pragma, the "PUSH %RBP" instructions is internally split into two micro-operations which can be represented as  "SUB  RSP, 4" and "MOV  [RDI], %r9" . The advantage of this is that the "SUB RSP, 4" micro-operation can be executed even if the vale of RBP is not ready yet.

I don't think much gain can be obtained with both the prologues with and w/o pragma vectorization, their meanings are same, the only important factor which makes a difference is having "lea" instructions twice for alignment with pragma call of vectorization.

But the questions arises - why the "sub $0x68,%rsp" & "mov %rdi,%r9" w/o pragma have been replaced with single "push %rbp"?

is it becoz "push %rbp" has better latency and reciprocal throughput.

~BR


Intel Software Network Forums Statistics

8472 users have contributed to 31603 threads and 100653 posts to date.
In the past 24 hours, we have 31 new thread(s) 112 new posts(s), and 166 new user(s).

In the past 3 days, the most popular thread for everyone has been gemm(A,A,A) like possible? The most posts were made to gemm(A,A,A) like possible? The post with the most views is Dear Steve, excuse me for a d

Please welcome our newest member Edwin B. Ramayya