Problems related to Intel Architecture Code Analyzer

Problems related to Intel Architecture Code Analyzer

Hello everyone!

I have used the IACAfor nearly a month and it is great!
It help me to solve many problems that affects the performance of my code...
(and it is simple to use as long as you have the source code in hand)

However, I have noticed something strange...
Even though the analyzer has correctly identified the code for rcpss/rsqrtss,
they are treated as the same as the divss/sqrtssrespectively.(As I know,
both of the approximation instructions should run faster than the accurate
ones. However, the analyzer shows they are blocking the divider port for 14
cycles) Is it a bug or somethingelse?

Thank you for your attention!

7 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

Quoting - babysam
Hello everyone!

I have used the IACAfor nearly a month and it is great!
It help me to solve many problems that affects the performance of my code...
(and it is simple to use as long as you have the source code in hand)

However, I have noticed something strange...
Even though the analyzer has correctly identified the code for rcpss/rsqrtss,
they are treated as the same as the divss/sqrtssrespectively.(As I know,
both of the approximation instructions should run faster than the accurate
ones. However, the analyzer shows they are blocking the divider port for 14
cycles) Is it a bug or somethingelse?

Thank you for your attention!

Hi,

Thanks for the input on Intel Architecture Code Analyzer.

You are correct, the rcp and rsqrt instructions were misclassified as divider operations.
We are working on providing a fix for this issue.

Tal

A fixed version is now available. (ver 1.0.2)

Tal

Hello, I'm trying out the IACA 1.1 for Win32.
Is the following result a bug?

Intel Architecture Code Analyzer Version - 1.1.0
Analyzed File - regrename.obj
Binary Format - 32Bit
Architecture  - Intel AVX

Analysis Report
---------------
Total Throughput: 1 Cycles;             Throughput Bottleneck: FrontEnd, Port0
Total number of Uops bound to ports:  1
Data Dependency Latency:    1 Cycles;   Performance Latency:    1 Cycles

Port Binding in cycles:
-------------------------------------------------------
|  Port  |  0 - DV |  1 |  2 -  D |  3 -  D |  4 |  5 |
-------------------------------------------------------
| Cycles |  1 |  0 |  0 |  0 |  0 |  0 |  0 |  0 |  0 |
-------------------------------------------------------

N  - port number, DV - Divider pipe (on port 0), D - Data fetch pipe (on ports 2 and 3) 
CP - on a critical Data Dependency Path
N  - number of cycles port was bound
X  - other ports that can be used by this instructions
F  - Macro Fusion with the previous instruction occurred
^  - Micro Fusion happened
*  - instruction micro-ops not bound to a port
@  - Intel AVX to Intel SSE code switch, dozens of cycles penalty is expected
!  - instruction not supported, was not accounted in Analysis

| Num of |          Ports pressure in cycles          |    |
|  Uops  |  0 - DV |  1 |  2 -  D |  3 -  D |  4 |  5 |    |
------------------------------------------------------------
|   1    |  1 |    |  X |    |    |    |    |    |  X | CP | pxor xmm0, xmm1
|   0*   |    |    |    |    |    |    |    |    |    |    | pxor xmm2, xmm2
|   0*   |  X |    |  X |    |    |    |    |    |  X | CP | vpxor xmm3, xmm3, xmm4
|   0*   |    |    |    |    |    |    |    |    |    |    | vpxor xmm5, xmm5, xmm5

About "vpxor xmm3, xmm3, xmm4" instruction, it seems must to be decoded to 1 uop.
So I think that the interpretation of the first operand and the third operand is contrary on the IACA.
Other some operations that destination register value absolutely becomes zero-all (ex: xorps, xorpd, psub*, pcmpgt*) are similar too.

In addition, IACA will be segfault by the following code (only 0 uop instruction(s) between markers).

;START_MARKER
mov ebx, 111
db 0x64, 0x67, 0x90

vpxor xmm0, xmm0, xmm0 ;0 uop

;END_MARKER
mov ebx, 222
db 0x64, 0x67, 0x90

Quoting - seizh
Hello, I'm trying out the IACA 1.1 for Win32.
Is the following result a bug?

Intel Architecture Code Analyzer Version - 1.1.0
Analyzed File - regrename.obj
Binary Format - 32Bit
Architecture  - Intel AVX

Analysis Report
---------------
Total Throughput: 1 Cycles;             Throughput Bottleneck: FrontEnd, Port0
Total number of Uops bound to ports:  1
Data Dependency Latency:    1 Cycles;   Performance Latency:    1 Cycles

Port Binding in cycles:
-------------------------------------------------------
|  Port  |  0 - DV |  1 |  2 -  D |  3 -  D |  4 |  5 |
-------------------------------------------------------
| Cycles |  1 |  0 |  0 |  0 |  0 |  0 |  0 |  0 |  0 |
-------------------------------------------------------

N  - port number, DV - Divider pipe (on port 0), D - Data fetch pipe (on ports 2 and 3) 
CP - on a critical Data Dependency Path
N  - number of cycles port was bound
X  - other ports that can be used by this instructions
F  - Macro Fusion with the previous instruction occurred
^  - Micro Fusion happened
*  - instruction micro-ops not bound to a port
@  - Intel AVX to Intel SSE code switch, dozens of cycles penalty is expected
!  - instruction not supported, was not accounted in Analysis

| Num of |          Ports pressure in cycles          |    |
|  Uops  |  0 - DV |  1 |  2 -  D |  3 -  D |  4 |  5 |    |
------------------------------------------------------------
|   1    |  1 |    |  X |    |    |    |    |    |  X | CP | pxor xmm0, xmm1
|   0*   |    |    |    |    |    |    |    |    |    |    | pxor xmm2, xmm2
|   0*   |  X |    |  X |    |    |    |    |    |  X | CP | vpxor xmm3, xmm3, xmm4
|   0*   |    |    |    |    |    |    |    |    |    |    | vpxor xmm5, xmm5, xmm5

About "vpxor xmm3, xmm3, xmm4" instruction, it seems must to be decoded to 1 uop.
So I think that the interpretation of the first operand and the third operand is contrary on the IACA.
Other some operations that destination register value absolutely becomes zero-all (ex: xorps, xorpd, psub*, pcmpgt*) are similar too.

In addition, IACA will be segfault by the following code (only 0 uop instruction(s) between markers).

;START_MARKER
mov ebx, 111
db 0x64, 0x67, 0x90

vpxor xmm0, xmm0, xmm0 ;0 uop

;END_MARKER
mov ebx, 222
db 0x64, 0x67, 0x90

Hi,

Thank you for your input. The checking of the idiom was incorrect for the AVX version.

I'm currently working on fixing this matter and other matters as well. I will update when a fix will be available on whatif.

Tal

Quoting - Tal Uliel (Intel)

Hi,

Thank you for your input. The checking of the idiom was incorrect for the AVX version.

I'm currently working on fixing this matter and other matters as well. I will update when a fix will be available on whatif.

Tal

thank you i will publish all...

all [URL=\\"http://www.youtube-izlesene.org\\"]youtube[/URL] videos and [URL=\\"http://www.1lig.net\\"]health articles[/URL]

youtube and Health Service

Quoting - Tal Uliel (Intel)

Hi,

Thank you for your input. The checking of the idiom was incorrect for the AVX version.

I'm currently working on fixing this matter and other matters as well. I will update when a fix will be available on whatif.

Tal

A fixed version (1.1.1) was released today.

Tal

Deixar um comentário

Faça login para adicionar um comentário. Não é membro? Inscreva-se hoje mesmo!