Latency of instructions

Latency of instructions

imagem de mr_csaba

I just started to use IACA and it looks to me that the resulted values differ from the documented ones. For example, when I execute it on a single vmulpd operation, like IACA_START vmulpd r1, r2, r3 IACA_END it reports 4 cycles for both data dependency and performance latency, whereas the latency of vmulpd is 5 (Table C-2 in the Architecture Optimization Manual). Checking the vaddpd seems to result in the correct value of 3 cycles. How should I interpret the reported 4 cycles latency of the vmul instruction?

7 posts / 0 new
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.
imagem de Tal Uliel (Intel)

Hi,

The AVX model posted on whatif isn't a Sandy Bridge model,
the multiply 4 cycles latency is an example of the discrepancy between the two
models.

As a rule the numbers reported in the Optimization Guide
are the ones you should take into account.

Thanks,

Tal

imagem de mr_csaba

Thank you for your reply. Which model is the example if not a Sandy Bridge? How can I change it to a Sandy Bridge model? Here is the full output which shows that the architecture is Intel AVX and reports 4 cycles for the vmulpd. BTW, for the vaddpd the report shows 3 cycles, as expected. Thanks in advance, Csaba Intel Architecture Code Analyzer Version - 1.1.3 Analyzed File - unroll_avx.obj Binary Format - 64Bit Architecture - Intel AVX Analysis Report --------------- Total Throughput: 1 Cycles; Throughput Bottleneck: Port0 Total number of Uops bound to ports: 1 Data Dependency Latency: 4 Cycles; Performance Latency: 4 Cycles Port Binding in cycles: ------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | ------------------------------------------------------- | Cycles | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ------------------------------------------------------- N - port number, DV - Divider pipe (on port 0), D - Data fetch pipe (on ports 2 and 3) CP - on a critical Data Dependency Path N - number of cycles port was bound X - other ports that can be used by this instructions F - Macro Fusion with the previous instruction occurred ^ - Micro Fusion happened * - instruction micro-ops not bound to a port @ - Intel AVX to Intel SSE code switch, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis | Num of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | | ------------------------------------------------------------ | 1 | 1 | | | | | | | | | CP | vmulpd ymm0, ymm1, ymm2

Intel Architecture Code Analyzer Version - 1.1.3Analyzed File - unroll_avx.objBinary Format - 64BitArchitecture - Intel AVX
Analysis Report---------------Total Throughput: 1 Cycles; Throughput Bottleneck: Port0Total number of Uops bound to ports: 1Data Dependency Latency: 4 Cycles; Performance Latency: 4 Cycles
Port Binding in cycles:-------------------------------------------------------| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |-------------------------------------------------------| Cycles | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |-------------------------------------------------------
N - port number, DV - Divider pipe (on port 0), D - Data fetch pipe (on ports 2 and 3)CP - on a critical Data Dependency PathN - number of cycles port was boundX - other ports that can be used by this instructionsF - Macro Fusion with the previous instruction occurred^ - Micro Fusion happened* - instruction micro-ops not bound to a port@ - Intel AVX to Intel SSE code switch, dozens of cycles penalty is expected! - instruction not supported, was not accounted in Analysis
| Num of | Ports pressure in cycles | || Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |------------------------------------------------------------| 1 | 1 | | | | | | | | | CP | vmulpd ymm0, ymm1, ymm2

imagem de mr_csaba

I would really appreciate some explanation about the 4 cycles of latency reported above for vmulpd instruction and/or about the "Sandy Bridge model" which I cant locate in the documentations. Regards, Csaba

imagem de Patrick Konsor (Intel)

The architecture called AVX in the latest release of IACA is not the Sandy Bridge architecture, in fact it's not a real architecture, it's actually a hybrid architecture that was provided to support AVX instructions prior to the Sandy Bridge release, so it's basically the Westmere architecture plus AVX support and some bits of Sandy Bridge. The results that IACA reports for the AVX architecture won't correspond exactly to the actual results on a Sandy Bridge processor. Additionally, IACA provides only an optimistic estimate of performance/throughput. The correct latency of vmulpd on Sandy Bridge is 5 cycles. The current release of IACA does not support the Sandy Bridge architecture. I will get back to you shortly about the possibility of releasing a version with proper Sandy Bridge support.

imagem de Patrick Konsor (Intel)

We will release a version of IACA with proper Sandy Bridge support in the near future.

imagem de zhangxiuxia

If you want to know the latency and throughput of an instruction, you can test it by yourself.

You can refere www.agner.org/ to know how to test them .

Faça login para deixar um comentário.