iaca analysis loops

iaca analysis loops

I have used iaca for a few days , I mainly used it to analysis the performance of my assembly code.
I noticed that it can only anlyisis the first pass of a loop , though the code in the loop runs thousands of passes.

How I find this ?
I write two assembly codes, for convenient to state, I give them a name respectly, code A and code B. they differers a little.
Code A is faster than code B in speed.
Both Code A and Code B has a loop in side.
The first pass loop of Code A is slower than Code B,
but in later passes, faster than Code B.

But , the iaca give the result report that Code B is faster than Code A for 2 clock cycles both in througout and latency.

Is that IACA can only analysis the first pass of o loop ?

8 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

Hello zhangxiuxia,

Intel Architecture Code Analyzer, calculate an optimistic
theoretical throughput of a basic block.

I noticed you mentioned " Both Code A and
Code B has a loop in side"
, This is a static analyzer so
having loops inside the analyzed code will lead to false analysis.

Also the analyzer doesn't take into account Cache misses, split
loads bank conflicts and other dynamic events that can occur and this could
have big effect on the actual performance and the ratio between block A and
block B that the analyzer predicted.

If you can share the two block and their respected Intel Architecture
Code Analyzer analysis I can further comment on the numbers

Thanks,

Tal Uliel

Thanks very much for your replying . My code is not at hand now ,Later I will paste my code in the forum.

By the way , I have a question about the micro-architecture of Intel processor.
when does "Branch prediction" happen?
When the branch instruction uop is issued ?
or when branch instruction is decoded?

I think it is the latter ,but I am not sure.

code A faster than code B
part of code A
1 .L11:
2 testl %ebp, %ebp
3 jle .L9
4 xorl %r11d, %r11d
5 xorq %rcx,%rcx
6 .p2align 4,,7
7 movdqu (%r13,%rcx,4), %xmm7 #col_idx ---four cols a time
8 movdqa %xmm7,%xmm5
9 movd %xmm5, %eax
10 psrldq $4, %xmm5
11 .L4:
12 movl 4(%rbx,%r11,4), %r10d
13 xorpd %xmm1, %xmm1
14 xorl %r8d, %r8d
15 subl $3,%r10d
16 movl (%rbx,%r11,4), %ecx
17 subl %ecx, %r10d
18 cmpl $0 ,%r10d
19 jle .L7
20 .p2align 4,,7
21 .L8:
22 movupd (%r14,%rcx,8), %xmm0 #A._value---two a time
23 movdqu 16(%r13,%rcx,4), %xmm7 #col_idx ---four cols a time
24 addl $4, %r8d #like j,but each inner loop it start from 0
25 movlpd (%rdi,%rax,8),%xmm2 #xvalue ---one a time
26 movd %xmm5,%eax
27 psrldq $4, %xmm5
28 movhpd (%rdi,%rax,8),%xmm2
29 mulpd %xmm2,%xmm0
30 movupd 16(%r14,%rcx,8),%xmm4 #A._value---two a time
31 addq $4, %rcx #j
32 movd %xmm5,%eax
33 psrldq $4,%xmm5
34 movlpd (%rdi,%rax,8),%xmm6
35 movd %xmm5,%eax
36 movhpd (%rdi,%rax,8),%xmm6
37 movdqa %xmm7,%xmm5
38 mulpd %xmm6,%xmm4
39 movd %xmm5, %eax
40 psrldq $4, %xmm5
41 addpd %xmm0,%xmm1
42 addpd %xmm4,%xmm1
43 cmpl %r8d, %r10d
44 jg .L8
45 .L7:
46 movhlps %xmm1,%xmm3
47 incl %esi
48 addsd %xmm3,%xmm1
49 movsd %xmm1, (%r12,%r11,8)
50 incq %r11
51 cmpl %ebp, %r11d
52 jne .L4
53 .L9:
54 subl $1 ,%r9d
55 cmpl $0 ,%r9d
56 jg .L11

code A analysis report
1 Intel Architecture Code Analyzer Version - 1.1.3
2 Analyzed File - spmv_pre_.o
3 Binary Format - 64Bit
4 Architecture - Intel microArchitecture - codename Nehalem
5
6 Analysis Report
7 ---------------
8 Total Throughput: 14 Cycles; Throughput Bottleneck: Port5
9 Total number of Uops bound to ports: 52
10 Data Dependency Latency: 34 Cycles; Performance Latency: 36 Cycles
11
12 Port Binding in cycles:
13 -------------------------------------------------------
14 | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |
15 -------------------------------------------------------
16 | Cycles | 13 | 0 | 13 | 10 | 10 | 1 | 0 | 1 | 14 |
17 -------------------------------------------------------
18
19 N - port number, DV - Divider pipe (on port 0), D - Data fetch pipe (on ports
20 CP - on a critical Data Dependency Path
21 N - number of cycles port was bound
22 X - other ports that can be used by this instructions
23 F - Macro Fusion with the previous instruction occurred
24 ^ - Micro Fusion happened
25 * - instruction micro-ops not bound to a port
26 @ - Intel AVX to Intel SSE code switch, dozens of cycles penalty is expe
27 ! - instruction not supported, was not accounted in Analysis
28
29 | Num of | Ports pressure in cycles | |
30 | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
31 ------------------------------------------------------------
32 | 1 | 1 | | X | | | | | | X | | data16 nop
33 | 1 | X | | X | | | | | | 1 | | test ebp, ebp
34 | 0F | | | | | | | | | | | jle 0xc4
35 | 1 | X | | 1 | | | | | | X | CP | xor rcx, rcx
36 | 1 | 1 | | X | | | | | | X | | xor r11d, r11d
37 | 1 | | | | 1 | 1 | | | | | CP | movdqu xmm7, xmmwo
38 | 1 | X | | 1 | | | | | | X | CP | movdqa xmm5, xmm7
39 | 1 | X | | X | | | | | | 1 | | movd eax, xmm5
40 | 1 | 1 | | | | | | | | X | CP | psrldq xmm5, 0x4
41 | 1 | | | | 1 | 1 | | | | | | mov r10d, dword pt
42 | 1 | | | | | | | | | 1 | | xorpd xmm1, xmm1
43 | 1 | X | | 1 | | | | | | X | | xor r8d, r8d
44 | 1 | 1 | | X | | | | | | X | | sub r10d, 0x3
45 | 1 | | | | 1 | 1 | | | | | | mov ecx, dword ptr
46 | 1 | X | | 1 | | | | | | X | | sub r10d, ecx
47 | 1 | X | | X | | | | | | 1 | | cmp r10d, 0x0
48 | 0F | | | | | | | | | | | jle 0x72
49 | 1 | 1 | | X | | | | | | X | | nop
50 | 1 | | | | 1 | 1 | | | | | | movupd xmm0, xmmwo
51 | 1 | | | | 1 | 1 | | | | | | movdqu xmm7, xmmwo
52 | 1 | X | | 1 | | | | | | X | | add r8d, 0x4
53 | 2 | | | | 1 | 1 | | | | 1 | | movlpd xmm2, qword
54 | 1 | 1 | | X | | | | | | X | CP | movdqa xmm13, xmm5
55 | 1 | X | | 1 | | | | | | X | CP | movd eax, xmm5
56 | 1 | X | | | | | | | | 1 | CP | psrldq xmm13, 0x4
57 | 2 | | | | 1 | 1 | | | | 1 | CP | movhpd xmm2, qword
58 | 1 | 1 | | | | | | | | | CP | mulpd xmm0, xmm2
59 | 1 | | | | 1 | 1 | | | | | | movupd xmm4, xmmwo
60 | 1 | X | | 1 | | | | | | X | | add rcx, 0x4
61 | 1 | 1 | | X | | | | | | X | | movd eax, xmm13
62 | 1 | X | | | | | | | | 1 | CP | psrldq xmm13, 0x4
63 | 1 | X | | 1 | | | | | | X | | movdqa xmm5, xmm7
64 | 2 | | | | 1 | 1 | | | | 1 | | movlpd xmm6, qword
65 | 1 | 1 | | X | | | | | | X | CP | movd eax, xmm13
66 | 2 | | | | 1 | 1 | | | | 1 | CP | movhpd xmm6, qword
67 | 1 | X | | 1 | | | | | | X | | movd eax, xmm5
68 | 1 | 1 | | | | | | | | X | | psrldq xmm5, 0x4
69 | 1 | 1 | | | | | | | | | CP | mulpd xmm4, xmm6
70 | 1 | | | 1 | | | | | | | CP | addpd xmm1, xmm0
71 | 1 | | | 1 | | | | | | | CP | addpd xmm1, xmm4
72 | 1 | X | | X | | | | | | 1 | | cmp r10d, r8d
73 | 0F | | | | | | | | | | | jnle 0xfffffffffff
74 | 1 | | | | | | | | | 1 | CP | movhlps xmm3, xmm1
75 | 1 | 1 | | X | | | | | | X | | inc esi
76 | 1 | | | 1 | | | | | | | CP | addsd xmm1, xmm3
77 | 2 | | | | | | 1 | | 1 | | CP | movsd qword ptr [r
78 | 1 | 1 | | X | | | | | | X | | inc r11
79 | 1 | X | | X | | | | | | 1 | | cmp r11d, ebp
80 | 0F | | | | | | | | | | | jnz 0xffffffffffff
81 | 1 | X | | 1 | | | | | | X | | sub r9d, 0x1
82 | 1 | X | | X | | | | | | 1 | | cmp r9d, 0x0
83 | 0F | | | | | | | | | | | jnle 0xfffffffffff

code B
1 .L11:
2 testl %ebp, %ebp
3 jle .L9
4 xorl %r11d, %r11d
5 .p2align 4,,7
6 .L4:
7 movl 4(%rbx,%r11,4), %r10d
8 xorpd %xmm1, %xmm1
9 xorl %r8d, %r8d
10 subl $3,%r10d
11 movl (%rbx,%r11,4), %ecx
12 subl %ecx, %r10d
13 cmpl $0 ,%r10d
14 jle .L7
15 .p2align 4,,7
16 .L8:
17 movdqu (%r13,%rcx,4), %xmm5 #col_idx ---four cols a time
18 movupd (%r14,%rcx,8), %xmm0 #A._value---two a time
19 movupd 16(%r14,%rcx,8),%xmm4 #A._value---two a time
20 addl $4, %r8d #like j,but each inner loop it start from 0
21 addq $4, %rcx #j
22 movd %xmm5, %eax
23 psrldq $4, %xmm5
24 movlpd (%rdi,%rax,8),%xmm2 #xvalue ---one a time
25 movd %xmm5,%eax
26 psrldq $4, %xmm5
27 movhpd (%rdi,%rax,8),%xmm2
28 mulpd %xmm2,%xmm0
29 movd %xmm5,%eax
30 psrldq $4,%xmm5
31 movlpd (%rdi,%rax,8),%xmm6
32 movd %xmm5,%eax
33 movhpd (%rdi,%rax,8),%xmm6
34 mulpd %xmm6,%xmm4
35 addpd %xmm0,%xmm1
36 addpd %xmm4,%xmm1
37 cmpl %r8d, %r10d
38 jg .L8
39 .L7:
40 movhlps %xmm1,%xmm3
41 incl %esi
42 addsd %xmm3,%xmm1
43 movsd %xmm1, (%r12,%r11,8)
44 incq %r11
45 cmpl %ebp, %r11d
46 jne .L4
47 .L9:
48 subl $1 ,%r9d
49 cmpl $0 ,%r9d
50 jg .L11
"tick.s" 50L, 1157C written

1 Intel Architecture Code Analyzer Version - 1.1.3
2 Analyzed File - csr_spmv.o
3 Binary Format - 64Bit
4 Architecture - Intel microArchitecture - codename Nehalem
5
6 Analysis Report
7 ---------------
8 The analyzed block contains one or more instructions with issues.
9 The throughput and latency cycle counts do not account for those instructions.
10
11 Total Throughput: 9 Cycles; Throughput Bottleneck: Port0, Port2_ALU, Port2_DATA, Port5
12 Total number of Uops bound to ports: 35
13 Data Dependency Latency: 31 Cycles; Performance Latency: 35 Cycles
14
15 Port Binding in cycles:
16 -------------------------------------------------------
17 | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |
18 -------------------------------------------------------
19 | Cycles | 9 | 0 | 8 | 9 | 9 | 0 | 0 | 0 | 9 |
20 -------------------------------------------------------
21
22 N - port number, DV - Divider pipe (on port 0), D - Data fetch pipe (on ports
23 CP - on a critical Data Dependency Path
24 N - number of cycles port was bound
25 X - other ports that can be used by this instructions
26 F - Macro Fusion with the previous instruction occurred
27 ^ - Micro Fusion happened
28 * - instruction micro-ops not bound to a port
29 @ - Intel AVX to Intel SSE code switch, dozens of cycles penalty is expe
30 ! - instruction not supported, was not accounted in Analysis
31
32 | Num of | Ports pressure in cycles | |
33 | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
34 ------------------------------------------------------------
35 | 1 | | | | 1 | 1 | | | | | | mov r10d, dword pt
"l4_tick.report" 68L, 4145C
36 | 1 | | | | | | | | | 1 | | xorpd xmm1, xmm1
37 | 1 | 1 | | X | | | | | | X | | xor r8d, r8d
38 | 1 | X | | 1 | | | | | | X | | sub r10d, 0x3
39 | 1 | | | | 1 | 1 | | | | | CP | mov ecx, dword ptr
40 | 1 | 1 | | X | | | | | | X | | sub r10d, ecx
41 | 1 | X | | X | | | | | | 1 | | cmp r10d, 0x0
42 | 0F | | | | | | | | | | | jle 0x7f
43 | 1 | X | | 1 | | | | | | X | | nop dword ptr [rax
44 | ! | ! | ! | ! | ! | ! | ! | ! | ! | ! | | ud2
45 | 1 | 1 | | X | | | | | | X | | mov ebx, 0x6f
46 | 1 | X | | 1 | | | | | | X | | addr32 nop
47 | 1 | | | | 1 | 1 | | | | | CP | movdqu xmm5, xmmwo
48 | 1 | | | | 1 | 1 | | | | | | movupd xmm0, xmmwo
49 | 1 | | | | 1 | 1 | | | | | | movupd xmm4, xmmwo
50 | 1 | 1 | | X | | | | | | X | | add r8d, 0x4
51 | 1 | 1 | | X | | | | | | X | | add rcx, 0x4
52 | 1 | X | | 1 | | | | | | X | | movd eax, xmm5
53 | 1 | X | | | | | | | | 1 | CP | psrldq xmm5, 0x4
54 | 2 | | | | 1 | 1 | | | | 1 | | movlpd xmm2, qword
55 | 1 | 1 | | X | | | | | | X | CP | movd eax, xmm5
56 | 1 | 1 | | | | | | | | X | | psrldq xmm5, 0x4
57 | 2 | | | | 1 | 1 | | | | 1 | CP | movhpd xmm2, qword
58 | 1 | 1 | | | | | | | | | CP | mulpd xmm0, xmm2
59 | 1 | X | | 1 | | | | | | X | | movd eax, xmm5
60 | 1 | X | | | | | | | | 1 | | psrldq xmm5, 0x4
61 | 2 | | | | 1 | 1 | | | | 1 | | movlpd xmm6, qword
62 | 1 | X | | 1 | | | | | | X | | movd eax, xmm5
63 | 2 | | | | 1 | 1 | | | | 1 | | movhpd xmm6, qword
64 | 1 | 1 | | | | | | | | | | mulpd xmm4, xmm6
65 | 1 | | | 1 | | | | | | | CP | addpd xmm1, xmm0
66 | 1 | | | 1 | | | | | | | CP | addpd xmm1, xmm4
67 | 1 | X | | X | | | | | | 1 | | cmp r10d, r8d
68 | 0F | | | | | | | | | | | jnle 0xfffffffffff

I have a question about the definition of some terms. I wonder whether I understand it correctly.

Data Dependency Latency: the number of cycles it takes to execute the data dependency
critical path (see below)

Data Dependency critical path: identifies the longest latency chain(s) of instructions
where inputs of one instruction depend on the output of previously executed
instructions.

What I comprehended about Data Dependency critical path is :
The latency of instructions need to be cumulated together ?
let me give a example
// assume there is no instrument that write %eax
addl %eax,%ebx // thoughout 1 ,latency 1
mul %ebx ,%ecx //thoughout 1, latency 3
xorl %edx, %edx
incl %edx //1
mul %ecx, %ecx //throughout l latench 3

then the latency is 1+3+1=5 cycles

Performance Latency: the number of cycles it takes to execute the performance critical
path (see below).
Port binding cycle summary.

Detailed report on the port binding of each instruction and the number of cycles the port
was bound.

Instructions on one or more of the code section critical paths. There are two types of
critical paths:

Performance critical path: identifies the longest latency chains based on the following
criteria:
Instructions whose inputs depend on the output of previously executed instructions.
Instructions that were delayed due to frond-end pressure.

What Iis :
performacce latency is data latency plus dalays due to front-end pressure

Hello,

sorry for the delayed response.

"What Iis
:
performacce latency is data latency
plus dalays due to front-end pressure"
You undersatnd correctly.

Regarding the two blocks of code you posted,
The analyzer is a static tool so it can't figure out how many times each of the
basic block inside of the macro block you had asked it to analyze will execute.

For TPT analysis, analyzing only the basic block is critical as instruction
from the epilog and the prolog of the loop has small effect on the overall
performance (assuming the loop is executed many times). The analyzer assume the
code he received is only the loop body, it doesn't identify the basic block from the large macro
block.

For Latency analysis, if all the branches in the macro block are not taken than
the Latency reported by the analyzer will be valid.

Tal

发表评论

登录添加评论。还不是成员?立即加入