Understanding Latency and Throughput

Understanding Latency and Throughput

For a loop with 20 statements, attached below are the latency and throughput analysis by IACA:

<pre>
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - foo_avx.o
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Latency

Latency Analysis Report
---------------------------
Latency: 27 Cycles

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - Intel(R) AVX to Intel(R) SSE code switch, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

The Resource delay is counted since all the sources of the instructions are ready
and until the needed resource becomes available

| Inst | Resource Delay In Cycles | |
| Num | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | FE | |
-------------------------------------------------------------------------
| 0 | | | | | | | | | | | mov rdx, qword ptr [rcx+0x40]
| 1 | | | | | | | | | | | vmovaps ymm0, ymmword ptr [rdx+rax*1]
| 2 | | | 1 | | | | | | | | vmulps ymm15, ymm0, ymmword ptr [r12+rax*8]
| 3 | | | | | | | | | 1 | | vmulps ymm14, ymm0, ymmword ptr [rbp+rax*8]
| 4 | 1 | | 1 | | | | | | 1 | | vmulps ymm13, ymm0, ymmword ptr [rdi+rax*8]
| 5 | | 1 | | | | | | | 2 | | vmulps ymm12, ymm0, ymmword ptr [rsi+rax*8]
| 6 | | | | | | | | | | | vaddps ymm8, ymm8, ymm15
| 7 | 2 | | 1 | | | | | | 2 | | vmulps ymm11, ymm0, ymmword ptr [rbx+rax*8]
| 8 | | 1 | | | | | | | | | vaddps ymm4, ymm4, ymm14
| 9 | 3 | | | | | | | | 3 | | vmulps ymm10, ymm0, ymmword ptr [r11+rax*8]
| 10 | | 1 | | | | | | | | | vaddps ymm7, ymm7, ymm13
| 11 | 4 | | | | | | | | 4 | | vmulps ymm9, ymm0, ymmword ptr [r10+rax*8]
| 12 | | 2 | | | | | | | | | vaddps ymm3, ymm3, ymm12
| 13 | 4 | | | 1 | | | | | 5 | CP | vmulps ymm0, ymm0, ymmword ptr [r9+rax*8]
| 14 | | | | | | | | | 5 | | add rax, 0x20
| 15 | | | | | | | | | | | cmp rax, r8
| 16 | | 2 | | | | | | | | | vaddps ymm6, ymm6, ymm11
| 17 | | 2 | | | | | | | | | vaddps ymm2, ymm2, ymm10
| 18 | | 2 | | | | | | | | | vaddps ymm5, ymm5, ymm9
| 19 | | 2 | | | | | | | | CP | vaddps ymm1, ymm1, ymm0
| 20 | | | | | | | | | | | jnz 0xffffffffffffff92

Resource Conflict on Critical Paths:
-----------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------
| Cycles | 4 0 | 2 | 0 0 | 1 0 | 0 | 0 | 0 | 0 |
-----------------------------------------------------------------

List Of Delays On Critical Paths
-------------------------------
1 --> 13 1 Cycles Delay On PORT3_AGU
4 --> 13 1 Cycles Delay On Port0
7 --> 13 1 Cycles Delay On Port0
9 --> 13 1 Cycles Delay On Port0
11 --> 13 1 Cycles Delay On Port0
17 --> 19 1 Cycles Delay On Port1
18 --> 19 1 Cycles Delay On Port1

</pre>

<pre>
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - foo_avx.o
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 8.00 Cycles Throughput Bottleneck: Port0, Port1

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 8.0 0.0 | 8.0 | 5.0 5.0 | 5.0 5.0 | 0.0 | 1.5 | 1.5 | 0.0 |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | | mov rdx, qword ptr [rcx+0x40]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm0, ymmword ptr [rdx+rax*1]
| 2 | 1.0 | | 1.0 1.0 | | | | | | CP | vmulps ymm15, ymm0, ymmword ptr [r12+rax*8]
| 2 | 1.0 | | | 1.0 1.0 | | | | | CP | vmulps ymm14, ymm0, ymmword ptr [rbp+rax*8]
| 2 | 1.0 | | 1.0 1.0 | | | | | | CP | vmulps ymm13, ymm0, ymmword ptr [rdi+rax*8]
| 2 | 1.0 | | | 1.0 1.0 | | | | | CP | vmulps ymm12, ymm0, ymmword ptr [rsi+rax*8]
| 1 | | 1.0 | | | | | | | CP | vaddps ymm8, ymm8, ymm15
| 2 | 1.0 | | 1.0 1.0 | | | | | | CP | vmulps ymm11, ymm0, ymmword ptr [rbx+rax*8]
| 1 | | 1.0 | | | | | | | CP | vaddps ymm4, ymm4, ymm14
| 2 | 1.0 | | | 1.0 1.0 | | | | | CP | vmulps ymm10, ymm0, ymmword ptr [r11+rax*8]
| 1 | | 1.0 | | | | | | | CP | vaddps ymm7, ymm7, ymm13
| 2 | 1.0 | | 1.0 1.0 | | | | | | CP | vmulps ymm9, ymm0, ymmword ptr [r10+rax*8]
| 1 | | 1.0 | | | | | | | CP | vaddps ymm3, ymm3, ymm12
| 2 | 1.0 | | | 1.0 1.0 | | | | | CP | vmulps ymm0, ymm0, ymmword ptr [r9+rax*8]
| 1 | | | | | | 1.0 | | | | add rax, 0x20
| 1 | | | | | | 0.5 | 0.5 | | | cmp rax, r8
| 1 | | 1.0 | | | | | | | CP | vaddps ymm6, ymm6, ymm11
| 1 | | 1.0 | | | | | | | CP | vaddps ymm2, ymm2, ymm10
| 1 | | 1.0 | | | | | | | CP | vaddps ymm5, ymm5, ymm9
| 1 | | 1.0 | | | | | | | CP | vaddps ymm1, ymm1, ymm0
| 1 | | | | | | | 1.0 | | | jnz 0xffffffffffffff92
Total Num Of Uops: 29

</pre>

My naive analysis for the loop would have been:

<pre>

| Inst | Clock |
| Num | Num |
------------------
| 0 | 0 | mov rdx, qword ptr [rcx+0x40]
| 1 | 1 | vmovaps ymm0, ymmword ptr [rdx+rax*1]
| 2 | 2 | vmulps ymm15, ymm0, ymmword ptr [r12+rax*8]
| 3 | 3 | vmulps ymm14, ymm0, ymmword ptr [rbp+rax*8]
| 4 | 4 | vmulps ymm13, ymm0, ymmword ptr [rdi+rax*8]
| 5 | 5 | vmulps ymm12, ymm0, ymmword ptr [rsi+rax*8]
| 6 | 6 | vaddps ymm8, ymm8, ymm15
| 7 | 6 | vmulps ymm11, ymm0, ymmword ptr [rbx+rax*8]
| 8 | 7 | vaddps ymm4, ymm4, ymm14
| 9 | 7 | vmulps ymm10, ymm0, ymmword ptr [r11+rax*8]
| 10 | 8 | vaddps ymm7, ymm7, ymm13
| 11 | 8 | vmulps ymm9, ymm0, ymmword ptr [r10+rax*8]
| 12 | 9 | vaddps ymm3, ymm3, ymm12
| 13 | 9 | vmulps ymm0, ymm0, ymmword ptr [r9+rax*8]
| 14 | 9 | add rax, 0x20
| 15 | 10 | cmp rax, r8
| 16 | 10 | vaddps ymm6, ymm6, ymm11
| 17 | 11 | vaddps ymm2, ymm2, ymm10
| 18 | 12 | vaddps ymm5, ymm5, ymm9
| 19 | 13 | vaddps ymm1, ymm1, ymm0
| 20 | 14 | jnz 0xffffffffffffff92

</pre>

In the above I am assuming that a vmulps and a vaddps can happen in the same clock if the four inputs are available. So instruction pair 6 and 7 execute in the same clock 6; likewise the pairs 8 and 9, 10 and 11, and 12 and 13 execute in the same clock 7, 8, and 9 respectively. The integer instructions 14 and 15 can happen in parallel to the floating point instructions.

So my naive conclusion would have been that the latency is 15 clocks. And, since 64 32 bit floats (the output of the 8 vaddps instructions) are generated every 14 clocks, I would have said the throughput is 64/14 floats per clock or 4.57 floats per clock.

Given my naive static analysis of the code, it is clear that I have no clue about the information being provided by iaca. So please help me understand latency and throughput information provided by iaca.

Thanks,

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Sorry, the format of the post got messed up, and I am not seeing the button to edit the post.  So I am attaching the post as a plain text file.  (Would help if the post-editor provided a means to preview a post before hitting submit.)  (Can see the edit button for replies, but not for the original post.)

Attachments: 

AttachmentSize
Download post-as-plain-text.txt9.84 KB

Leave a Comment

Please sign in to add a comment. Not a member? Join today