Source code of IACA

Source code of IACA

zhangxiuxia的头像

we'd like to read source code of some softwares we interested in,
Can IACA software be open source ?
How long will it be open source ?

7 帖子 / 0 new
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项
Israel Hirsh (Intel)的头像

we do not plan to open source Intel Architecture Code Analyzer.
what are your requirements?

zhangxiuxia的头像

I wonder whether IACA is based on pipeline simulator .
I want to get the information which instructions is issued ,excuted ,retired in each cycle (so that I can use a good instruction order to get each port full . )
rather than just the Analysis Report.

Israel Hirsh (Intel)的头像

providing a more detailed view is beyond the scope of Intel Architecture Code Analyzer. We believe that the current analysis report provides useful info to guide your instruction selection and scheduling. If you have questions regarding specific code snippets you are welcome to post them and we'll try to help.

zhangxiuxia的头像

Thanks for your kindhearted help.

In fact , I have many question about Intel processor detail .
1) what are the instructions decoded into .
for example "pextr" are decoded into 2 uops and occupied 2 ports
what are the uops "pextr" decoded ?
pextrd $01, %xmm1, %eax

is 1st uop need to read xmm1 ?
and 2st uop need to write eax ?

How about other instructions ? Is there reference

2) let's goto a code snippets of my test.
I want to know whether pure load can fully use pipeline (data is less than L1 cache) ?

in c code

1 int pure_load(double *list,int n, double *sump)
2 {
3 int i;
4 double sum=0.0;
5 for (i=0;i 6 {
7 sum=list[i]+sum;
8 }
9 *sump=sum;
10 return 0;
11 }

when using SIMD instruction in assembly ,it looks like this

23 .L4:
24 movupd (%rdi,%rax,8),%xmm0
25 movupd 16(%rdi,%rax,8), %xmm1
26 movupd 32(%rdi,%rax,8), %xmm2
27 movupd 48(%rdi,%rax,8), %xmm3
28 addpd %xmm0,%xmm5
29 addpd %xmm1,%xmm6
30 addpd %xmm2,%xmm7
31 addpd %xmm3,%xmm8
32 addq $8, %rax
33 cmpl %eax, %esi
34 jg .L4

I write a pipeline analysis tools , and anaysis what instruciton is executed,retired, bufferd , at each cycle
It shows that except the tail of the code , every cycle have one load instruction issued.

I use IACAtoo, It show that when excuted many times ,the througput is 4 for this snippet.

the total data is 4000*8 byte.
need to read 4000*8/16 =2000 cycle
but the actual performance is 3000cycle.

where the performance lost ?

zhangxiuxia的头像

This is the output of IACA:

Analysis Report
---------------
Total Throughput: 4 Cycles; Throughput Bottleneck: Port1, Port2_ALU, Port2_DATA
Total number of Uops bound to ports: 10
Data Dependency Latency: 9 Cycles; Performance Latency: 13 Cycles

Port Binding in cycles:
-------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |
-------------------------------------------------------
| Cycles | 1 | 0 | 4 | 4 | 4 | 0 | 0 | 0 | 1 |
-------------------------------------------------------

N - port number, DV - Divider pipe (on port 0), D - Data fetch pipe (on ports
CP - on a critical Data Dependency Path
N - number of cycles port was bound
X - other ports that can be used by this instructions
F - Macro Fusion with the previous instruction occurred
^ - Micro Fusion happened
* - instruction micro-ops not bound to a port
@ - Intel AVX to Intel SSE code switch, dozens of cycles penalty is expe
! - instruction not supported, was not accounted in Analysis

| Num of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
------------------------------------------------------------
| 1 | | | | 1 | 1 | | | | | CP | movupd xmm0, xmmwo
| 1 | | | | 1 | 1 | | | | | CP | movupd xmm1, xmmwo
| 1 | | | | 1 | 1 | | | | | CP | movupd xmm2, xmmwo
| 1 | | | | 1 | 1 | | | | | CP | movupd xmm3, xmmwo
| 1 | | | 1 | | | | | | | CP | addpd xmm5, xmm0
| 1 | | | 1 | | | | | | | CP | addpd xmm6, xmm1
| 1 | | | 1 | | | | | | | CP | addpd xmm7, xmm2
| 1 | | | 1 | | | | | | | CP | addpd xmm8, xmm3
| 1 | 1 | | X | | | | | | X | | add rax, 0x8
| 1 | X | | X | | | | | | 1 | | cmp esi, eax
| 0F | | | | | | | | | | | jnle 0xfffffffffff

Tal Uliel (Intel)的头像

Hello,

I couldn't reproduce the cycle count you are seeing above (3000 cycles for 4000
elements). I see ~2000 cycles as the analyzer expected.
How exactly did you measure the loop performance?

Thanks,
Tal

登陆并发表评论。