Forum Jump

Select Group :
Select Forum :
Sorted By :
Sort Order :
From The :
 
Thread Tools  Search this thread 
babysam
Total Points:
10
Registered User
August 19, 2009 4:05 AM PDT
Problems related to Intel Architecture Code Analyzer
Hello everyone!

I have used the IACA for nearly a month and it is great!
It help me to solve many problems that affects the performance of my code...
(and it is simple to use as long as you have the source code in hand)

However, I have noticed something strange...
Even though the analyzer has correctly identified the code for rcpss/rsqrtss,
they are treated as the same as the divss/sqrtss respectively.(As I know,
both of the approximation instructions should run faster than the accurate
ones. However, the analyzer shows they are blocking the divider port for 14
 cycles) Is it a bug or something else?

Thank you for your attention!
Tal Uliel (Intel)
Total Points:
255
Status Points:
205
Green Belt
August 20, 2009 12:15 AM PDT
Rate
 
#1
Quoting - babysam
Hello everyone!

I have used the IACA for nearly a month and it is great!
It help me to solve many problems that affects the performance of my code...
(and it is simple to use as long as you have the source code in hand)

However, I have noticed something strange...
Even though the analyzer has correctly identified the code for rcpss/rsqrtss,
they are treated as the same as the divss/sqrtss respectively.(As I know,
both of the approximation instructions should run faster than the accurate
ones. However, the analyzer shows they are blocking the divider port for 14
cycles) Is it a bug or something else?

Thank you for your attention!

Hi,

Thanks for the input on Intel(R) Architecture Code Analyzer.

You are correct, the rcp and rsqrt instructions were misclassified as divider operations.
We are working on providing a fix for this issue.

Tal


Tal Uliel (Intel)
Total Points:
255
Status Points:
205
Green Belt
August 24, 2009 1:08 PM PDT
Rate
 
#2 Reply to #1
A fixed version is now available. (ver 1.0.2)

Tal


seizh
Total Points:
20
Registered User
October 25, 2009 11:15 PM PDT
Rate
 
#5 Reply to #2
Hello, I'm trying out the IACA 1.1 for Win32.
Is the following result a bug?
Intel(R) Architecture Code Analyzer Version - 1.1.0
Analyzed File - regrename.obj
Binary Format - 32Bit
Architecture  - Intel(R) AVX

Analysis Report
---------------
Total Throughput: 1 Cycles;             Throughput Bottleneck: FrontEnd, Port0
Total number of Uops bound to ports:  1
Data Dependency Latency:    1 Cycles;   Performance Latency:    1 Cycles

Port Binding in cycles:
-------------------------------------------------------
|  Port  |  0 - DV |  1 |  2 -  D |  3 -  D |  4 |  5 |
-------------------------------------------------------
| Cycles |  1 |  0 |  0 |  0 |  0 |  0 |  0 |  0 |  0 |
-------------------------------------------------------

N  - port number, DV - Divider pipe (on port 0), D - Data fetch pipe (on ports 2 and 3) 
CP - on a critical Data Dependency Path
N  - number of cycles port was bound
X  - other ports that can be used by this instructions
F  - Macro Fusion with the previous instruction occurred
^  - Micro Fusion happened
*  - instruction micro-ops not bound to a port
@  - Intel(R) AVX to Intel(R) SSE code switch, dozens of cycles penalty is expected
!  - instruction not supported, was not accounted in Analysis

| Num of |          Ports pressure in cycles          |    |
|  Uops  |  0 - DV |  1 |  2 -  D |  3 -  D |  4 |  5 |    |
------------------------------------------------------------
|   1    |  1 |    |  X |    |    |    |    |    |  X | CP | pxor xmm0, xmm1
|   0*   |    |    |    |    |    |    |    |    |    |    | pxor xmm2, xmm2
|   0*   |  X |    |  X |    |    |    |    |    |  X | CP | vpxor xmm3, xmm3, xmm4
|   0*   |    |    |    |    |    |    |    |    |    |    | vpxor xmm5, xmm5, xmm5
About "vpxor xmm3, xmm3, xmm4" instruction, it seems must to be decoded to 1 uop.
So I think that the interpretation of the first operand and the third operand is contrary on the IACA.
Other some operations that destination register value absolutely becomes zero-all (ex: xorps, xorpd, psub*, pcmpgt*) are similar too.

In addition, IACA will be segfault by the following code (only 0 uop instruction(s) between markers).
;START_MARKER
mov ebx, 111
db 0x64, 0x67, 0x90

vpxor xmm0, xmm0, xmm0 ;0 uop

;END_MARKER
mov ebx, 222
db 0x64, 0x67, 0x90


Tal Uliel (Intel)
Total Points:
255
Status Points:
205
Green Belt
October 26, 2009 9:32 AM PDT
Rate
 
#6 Reply to #5
Quoting - seizh
Hello, I'm trying out the IACA 1.1 for Win32.
Is the following result a bug?
Intel(R) Architecture Code Analyzer Version - 1.1.0
Analyzed File - regrename.obj
Binary Format - 32Bit
Architecture - Intel(R) AVX

Analysis Report
---------------
Total Throughput: 1 Cycles; Throughput Bottleneck: FrontEnd, Port0
Total number of Uops bound to ports: 1
Data Dependency Latency: 1 Cycles; Performance Latency: 1 Cycles

Port Binding in cycles:
-------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |
-------------------------------------------------------
| Cycles | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
-------------------------------------------------------

N - port number, DV - Divider pipe (on port 0), D - Data fetch pipe (on ports 2 and 3)
CP - on a critical Data Dependency Path
N - number of cycles port was bound
X - other ports that can be used by this instructions
F - Macro Fusion with the previous instruction occurred
^ - Micro Fusion happened
* - instruction micro-ops not bound to a port
@ - Intel(R) AVX to Intel(R) SSE code switch, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
------------------------------------------------------------
| 1 | 1 | | X | | | | | | X | CP | pxor xmm0, xmm1
| 0* | | | | | | | | | | | pxor xmm2, xmm2
| 0* | X | | X | | | | | | X | CP | vpxor xmm3, xmm3, xmm4
| 0* | | | | | | | | | | | vpxor xmm5, xmm5, xmm5
About "vpxor xmm3, xmm3, xmm4" instruction, it seems must to be decoded to 1 uop.
So I think that the interpretation of the first operand and the third operand is contrary on the IACA.
Other some operations that destination register value absolutely becomes zero-all (ex: xorps, xorpd, psub*, pcmpgt*) are similar too.

In addition, IACA will be segfault by the following code (only 0 uop instruction(s) between markers).
;START_MARKER
mov ebx, 111
db 0x64, 0x67, 0x90

vpxor xmm0, xmm0, xmm0 ;0 uop

;END_MARKER
mov ebx, 222
db 0x64, 0x67, 0x90

Hi,

Thank you for your input. The checking of the idiom was incorrect for the AVX version.

I'm currently working on fixing this matter and other matters as well. I will update when a fix will be available on whatif.

Tal


capturetr
Total Points:
40
Registered User
October 30, 2009 9:59 AM PDT
Rate
 
#7 Reply to #6

Hi,

Thank you for your input. The checking of the idiom was incorrect for the AVX version.

I'm currently working on fixing this matter and other matters as well. I will update when a fix will be available on whatif.

Tal
thank you i will publish all...

--------
all [URL="http://www.youtube-izlesene.org"]youtube[/URL] videos and [URL="http://www.1lig.net"]health articles[/URL]

youtube and Health Service


Tal Uliel (Intel)
Total Points:
255
Status Points:
205
Green Belt
November 5, 2009 11:04 AM PST
Rate
 
#8 Reply to #6

Hi,

Thank you for your input. The checking of the idiom was incorrect for the AVX version.

I'm currently working on fixing this matter and other matters as well. I will update when a fix will be available on whatif.

Tal

A fixed version (1.1.1) was released today.

Tal




Intel Software Network Forums Statistics

8478 users have contributed to 31609 threads and 100661 posts to date.
In the past 24 hours, we have 30 new thread(s) 108 new posts(s), and 167 new user(s).

In the past 3 days, the most popular thread for everyone has been gemm(A,A,A) like possible? The most posts were made to gemm(A,A,A) like possible? The post with the most views is Dear Steve, excuse me for a d

Please welcome our newest member zhpn