Intel® Architecture Code Analyzer

What If Home | Product Overview | Features and Benefits | Throughput Analysis |
Technical Requirements | Discussion Forum | Blog

Version 3.0 is out! This is a rewrite of the tool with some improved features.

Product Overview

Intel® Architecture Code Analyzer helps you statically analyze the data dependency, throughput and latency of code snippets on Intel® microarchitectures. The term kernel is used throughout the rest of this document instead of code snippet.

Features and Benefits

For a given binary, Intel Architecture Code Analyzer:

  • Performs static analysis of kernel throughput and latency under ideal front-end, out-of-order engine and memory hierarchy conditions.
  • Identifies the binding of the kernel instructions to the processor ports.
  • Identifies kernel critical path.

The Intel Architecture Code Analyzer enables you to do a first order estimate of relative kernel performance on different micro architectures. The Intel® Architecture Code Analyzer does not provide absolute performance numbers.

Intel Architecture Code Analyzer is a command-line tool with ASCII output. It handles one or more kernels that are marked for analysis within an executable, a shared library, or an object file.

Throughput Analysis

The Throughput Analysis treats the kernel as a body of an infinite loop. It computes the kernel throughput and highlights its bottlenecks.

The Throughput Analysis report contains the following whole kernel information:

  • Throughput of the analyzed kernel, counted in cycles.
    • The kernel bottleneck: front-end, port #, divider unit or inter-iteration dependency.
    • Total number of cycles each processor port was bound with micro-ops.

The Throughput Analysis also provides the following information per instruction:

  • Number of instruction micro-ops.
  • Average number of cycles the instruction was bound to each processor port, per loop iteration
  • An indication whether the instruction is on the critical path of the analyzed kernel.
  • Instruction disassembly in Intel® Software Developer’s Manual (MASM) style.

Technical Requirements

Intel Architecture Code Analyzer is a command-line utility that can analyze a kernel, contained in a binary file, that is delimited with special markers. The tool is capable of analyzing Intel® 64 code, including Intel® AVX, AVX2 and AVX-512 instructions.

Intel Architecture Code Analyzer is available on Windows*, Linux*, and MacOS* operating systems. Only Intel® 64 operating systems are supported.

Release Notes for 3.0

  • Version 3.0 is a rewrite of Intel Architecture Code Analyzer. No new microarchitectures are added, but the UI changed. For example, port columns in the output are now large enough to accommodate up to 99.9 cycles per port. This may affect tools that automatically process Intel Architecture Code Analyzer output.
  • The tool now accepts the -trace <file> switch which generates an Intel Architecture Code Analyzer trace directly to a file without the need for post-processing. A separate switch (-trace-cycle-count) can be used to control how many cycles to trace.
  • Various switches were deprecated; See user guide.

Release Notes for 2.3

  • Added support for Intel® microarchitecture code name Skylake (client and server).
  • Added support for Intel® Advanced Vector Extensions 512 (Intel® AVX-512).
  • Added support for tracing the execution (see user guide).
  • Dropped the -no_interiteration flag.

Release Notes for 2.2

  • Added support for Intel® microarchitecture code name Broadwell.
  • Better support for Intel® Advanced Vector Extensions (Intel® AVX) Gather operations.
  • Replaced the "InterIteration" throughput bottleneck indication with a more general "long dependency chains" indication.
  • Added an indication when front end bubbles occur (see user guide).
  • Numerous improvements in modelling supported processors.
  • Unsupported instructions are now marked with 'X' instead of '!' for better readability.
  • NHM, WSM microarchitectures are not actively supported any more.
  • Removed support for running Intel Architecture Code Analyzer on 32 bit operating systems and for analyzing 32 bit programs.
  • Dropped latency analysis support.
  • Added support for Windows* OS.

Release Notes for 2.1

  • Added support for Intel® microarchitecture codenamed Haswell.
  • Added support for Microsoft Visual Studio* 64 compiler.
  • Added 64-bit binaries.

Release Notes for 2.0.1

  • Fixed a bug where –graph option failed to produce graph file.

Release Notes for 2.0

  • Added support for Intel® microarchitecture codenamed Sandy Bridge. This replaces the Intel® AVX microarchitecture previously in Intel Architecture Code Analyzer.
  • Added support for Intel® microarchitecture codenamed Ivy Bridge.
  • Added support for MacOS*.
  • Improved analyzer algorithm for throughput analysis
    (new analysis output, see more details in User Manual)
  • Improved analyzer algorithm for latency analysis, output also includes microarchitecture events that will affect the latency. (new analysis output, see more details in the User Manual)
  • Added support for graphic output of the dependency graph

Release Notes for 1.1.3

  • Fixed a bug where using -o option produced truncated output
  • Fixed IACA_UD_BYTES definition in iacaMarks.h to include {}.

Release Notes for 1.1.2

  • Intel Architecture Code Analyzer now supports adding START and END marks in code compiled with Microsoft Visual C++ Compiler* (64-bit). See iacaMarks.h
  • Intel Architecture Code Analyzer now supports multiple block analysis. You can direct the tool to analyze the nth block that is delimited with analyzer marks. When used with n=0, all surrounded blocks in the file are analyzed and the output contains separate reports per block.

Release Notes for 1.1.1

  • Fixed Intel AVX zero idiom instructions wrong identification
  • Fixed empty code blocks (containing only zero idiom instructions / not supported instructions) crashing the analyzer
  • Fixed Analyzer arch nehalem option to treat AES and PCLMUL instructions as illegal. These aren't supported on Intel® microarchitecture codename Nehalem.
  • Changed analyzer marks to abort if the binary is executed. To deactivate the marks when building for execution #define IACA_MARKS_OFF or use -DIACA_MARKS_OFF option in the compiler command line. Binaries with active marks should be used for analysis only.

Release Notes for 1.1

  • Intel Architecture Code Analyzer is now hosted on Linux* operating systems, in addition to Windows* operating systems. Both IA-32 and Intel® 64 operating systems are supported.
  • Intel Architecture Code Analyzer now supports two existing Intel® processors: Intel microarchitecture codenamed Nehalem and Westmere
  • Two critical path types are detected:
    • DATA_DEPENDENCY critical path (similar to previous releases - reflects instruction data dependencies only)
    • PERFORMANCE critical path (new - reflects port conflicts and front-end pressure, as well)

Release Notes for 1.0.2

  • Ignoring pop ebx / push ebx that Intel Architecture Code Analyzer Markers add to IA32 code
  • Fixed misclassifying rcp / rsqrt as divider operations

Release Notes for 1.0.1

  • Graceful handling of unsupported instructions, they are quietly ignored in the analyzed block analysis and do not impact the throughput and latency calculations.
  • A few unsupported instructions are now supported, e.g. CMOV instruction family
  • Intel AVX to Intel® SSE code switch detection. The performance penalty associated with such code switch is noted but not accounted for.
For more complete information about compiler optimizations, see our Optimization Notice.

38 comments

Top
Andreas A.'s picture

I found a number of differences in the port usages reported by IACA for several instructions (for the SKL microarchitecture), and the port usages obtained by executing the same instructions on an actual Skylake system (Core i7-6500U):

  • "CMPPS XMM1, XMM2, 1" can use ports 0,1,5 in IACA, but only 0,1 on the actual hardware
  • same for CMPPD, CMPSD
  • "CVTPI2PS XMM0, MM0" only uses port 0 in IACA, but ports 0,1 on the actual hardware
  • "CVTPS2PI MM0, XMM0" can use ports 0,1,5 in IACA, but only 0,5 on the actual hardware
  • "CVTSI2SS XMM0, RAX" one of the 3 uops of this instruction can use ports 0,1,5 in IACA, but only ports 0,1 on the actual hardware
  • "CVTSS2SI EAX, XMM0" can use port 5 in IACA, but not on the actual hardware
  • "CVTSS2SI RAX, XMM0" has 2 uops in IACA, but 3 uops on the actual hardware
  • similar problems exist for the memory variants of these instructions and the CVTT* variants
  • "MAXPS XMM0, XMM1" can use port 5 in IACA, but not on the actual hardware
  • same for MAXSS, MAXPD, MAXSD, MINPS, MINSS, MINPD, MINSD, PMADDWD, PMULHW, PMULLW, PMULUDQ
  • "MOVSS XMM0, XMM1" can use ports 0,1,5 in IACA, but only port 5 on the actual hardware
  • "MOVDQ2Q MM0, XMM0" one of the uops can only use port 5 on the actual hardware, but port 0 and 5 in IACA
  • "MOVQ2DQ XMM0, MM0" can only use port 5 in IACA, but port 0,1,5 on the actual hardware
  • "PMADDWD XMM0, XMM1" can use port 5 in IACA, but not on the actual hardware
  • "BLENDVPD XMM1, XMM2" has 2 uops in IACA, but only 1 on the actual hardware
  • same for BLENDVPS, PBLENDVB
  • "DPPD XMM0, XMM1, 2" in IACA, all 3 uops can use port 5; on the actual hardware, 2 of the uops can only use ports 0,1
  • "DPPS XMM0, XMM1, 2" in IACA, all 4 uops can use port 5; on the actual hardware, 3 of the uops can only use ports 0,1
  • "PHMINPOSUW XMM0, XMM1" can use ports 0,1,5 in IACA, but only port 0 on the actual hardware
  • "PMULDQ XMM0, XMM1" can use ports 0,1,5 in IACA, but only ports 0,1 on the actual hardware
  • same for PMULLD, ROUNDPD, ROUNDPS, ROUNDSD, ROUNDSS, PMADDUBSW, PMULHRSW
  • "PABSB MM0, MM1" can use ports 0,5 in IACA, but only port 0 on the actual hardware
  • same for PABSD, PABSW
  • "PSHUFB MM0, MM1" has 1 uop in IACA, but 2 on the actual hardware
  • "PSIGNB MM0, MM1" can use ports 0,5 in IACA, but only port 0 on the actual hardware
  • same for PSIGND, PSIGNW
  • "MOVBE [R13], AX" can use ports 1,5 in IACA, but 0,6 on the actual hardware
  • "ADC AX, 256" (with 66 15 encoding) has 1 uop in IACA, but 2 uops on the actual hardware
  • "BSWAP EAX" has 2 uops in IACA, but only 1 on the actual hardware
  • "LAHF" can use ports 0,1,5,6 in IACA, but only ports 0,6 on the actual hardware
  • same for SAHF
  • "LEA AX, [R13]" has 1 uop in IACA, but 2 on the actual hardware
  • "ROL RAX, 2" has 2 uops in IACA, but only 1 on the actual hardware
  • same for ROR

 

F L.'s picture

Gideon,

1. Latency: I think I need to think about this some more.

2. Bypass Delay: I just did not see it mentioned anywhere, so I made the wrong assumption. I will let you know if I see otherwise.

Thanks much

Gideon S. (Intel)'s picture

Hi F.L., thank you for using IACA!

1. Could you possibly expand a bit on how you intend to use the latency? We removed it because IACA was created to help optimize the performance of tight inner loops and we found the latency of a single iteration not very useful.

2. IACA should account for vector bypasses as described in Table 2-3 in the Intel® 64 and IA-32 Architectures
Optimization Reference Manual. If you see otherwise please share your code.

Thanks, Gideon. 

  

 

F L.'s picture

Hi,

Thanks for a great tool! I have a couple of questions:

1. Since latency analysis was dropped, can it be estimated by looking at the throughput report?

2. It looks like the tool does not account for Bypass Delays, is there a plan to add support for this?

Thanks

 

Yakir G. (Intel)'s picture

Hi Boming L,

I have compiled and successfully analyzed  your code example on Visual Studio 2017,

It is possible that  Visual Studio 2015 compiler somehow messes with the markers, 

anyway, I recommend upgrading to Visual Studio 2017 if possible, or compiling with another compiler.

Thanks, Yakir.

Boming L.'s picture

Hi Yakir,

Thanks for the reply. The code I tried to analyze was simple one, as following:

#include <intrin.h>
#include "iacaMarks.h"

void simply_add(int* src, int* dst) {
    IACA_VC64_START
    __m128i a = _mm_stream_load_si128((__m128i*)src);
    __m128i res = _mm_add_epi32(a, a);
    _mm_stream_si128((__m128i*)dst, res);
    IACA_VC64_END
}

I passed the object file to iaca, but it ended up with the error message I mentioned.

Additionally, it seems not containing a header file in the latest IACA for win64, so "iacaMarks.h" which I use is actually ported from Linux version.

Yakir G. (Intel)'s picture

Hi Boming L,

Could you please share the code that produced this error?

Thanks Yakir.

Gideon S. (Intel)'s picture

Hi Craig,

We are going to release a new version in a few weeks in which IACA emits the trace directly for all micro-archs without depending on Python. That should solve all these issues. Stay tuned!

Thanks, Gideon.

Craig R.'s picture

I think I've found some bugs in 2.3 particularly w.r.t. trace/pt.py.  The forum is "archived" and can't be posted to, so I'll post here in hopes that someone at Intel can fix things or explain what I'm doing wrong.  (Sorry for the formatting, this commenting mechanism is really not suited for this).  Anyway, here are the two issues I've run across:

1) the manual says Python 3.6.1 should work for pt.py but there seem to be some problems:

   a) line 35 starts with spaces whereas the others start with tab.  This may not be a python3-ism, but I had problems till I fixed it with tabs

   b) lines 102, 106, 147 use '/' for division.  Under python2 this is integer division, under 3 floating point (the latter causing problems).  If these are changed to '//' (integer division in python2 and 3) things work.

2) I generated some nonsense code to see what the iaca tools can do:

#include <immintrin.h>
#include <iacaMarks.h>
#include <stdio.h>

typedef __m256 m256;
#define N 16
m256 fn(m256 *xin)
{
  m256 t = xin[0];
  t = _mm256_and_ps(t, xin[1]);
  t = _mm256_or_ps(t, xin[2]);
  t = _mm256_xor_ps(t, xin[3]);
  t = _mm256_add_ps(t, xin[4]);
IACA_START
  t = _mm256_and_ps(t, xin[5]);
  t = _mm256_or_ps(t, xin[6]);
  t = _mm256_xor_ps(t, xin[7]);
  t = _mm256_add_ps(t, xin[8]);
IACA_END
  t = _mm256_and_ps(t, xin[9]);
  t = _mm256_or_ps(t, xin[10]);
  t = _mm256_xor_ps(t, xin[11]);
  t = _mm256_add_ps(t, xin[12]);
  return t; 
}
int
main(int argc, char *argv[])
{
  m256 xin[N];
  int i;
  for (i=0; i<N; i++) xin[i] = _mm256_set1_ps(i);
  m256 xout = fn(xin);
  printf("%8X\n", *(unsigned int *)&xout);
}

Compile with "gcc -O3 -mavx avx.c -Iiaca-2.3/iaca-lin64/include". (gcc is 4.4.7 if that matters).

I then generate trace files using iaca for IVB, HSW, BDW, SKX using "iaca-2.3/iaca-lin64/bin/iaca -trace ivb -arch IVB a.out", etc...

Using (a fixed) version of pt.py it looks like there are problems with all but the SKX version (I've narrowed the lines to try to make things presentable):

===== bdw.iacatrace
                                                         00000000001111111111222222222233333333333
0 |0 |vandps ymm0, ymm0, ymmword ptr [rdi+0xa0]         :          |         |         |          
0 |0 |    OP (1 uop)                                    :A+++++++++sdeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |0 |    LOAD (1 uop)                                  :s+++deeeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |1 |vorps ymm0, ymm0, ymmword ptr [rdi+0xc0]          :          |         |         |         
0 |1 |    LOAD (1 uop)                                  :s---deeeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |1 |    OP (1 uop)                                    :A+++++++++|sdeeeeeee|eeeeeeeee|eeeeeeeeee
0 |2 |vxorps ymm0, ymm0, ymmword ptr [rdi+0xe0]         :          |         |         |          
0 |2 |    LOAD (1 uop)                                  :s---cdeeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |2 |    OP (1 uop)                                    :A+++++++++|+sdeeeeee|eeeeeeeee|eeeeeeeeee
0 |3 |vaddps ymm0, ymm0, ymmword ptr [rdi+0x100]        :          |         |         |          
0 |3 |    LOAD (1 uop)                                  :s---cdeeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |3 |    OP (1 uop)                                    :A+++++++++|++sdeeeee|eeeeeeeee|eeeeeeeeee
1 |0 |vandps ymm0, ymm0, ymmword ptr [rdi+0xa0]         :          |         |         |          
1 |0 |    LOAD (1 uop)                                  : s---cdeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
===== hsw.iacatrace
                                                         00000000001111111111222222222233333333335
0 |0 |vandps ymm0, ymm0, ymmword ptr [rdi+0xa0]         :          |         |         |          
0 |0 |    OP (1 uop)                                    :A+++++++++sdeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |0 |    LOAD (1 uop)                                  :s+++deeeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |1 |vorps ymm0, ymm0, ymmword ptr [rdi+0xc0]          :          |         |         |          
0 |1 |    LOAD (1 uop)                                  :s---deeeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |1 |    OP (1 uop)                                    :A+++++++++|sdeeeeeee|eeeeeeeee|eeeeeeeeee
0 |2 |vxorps ymm0, ymm0, ymmword ptr [rdi+0xe0]         :          |         |         |          
0 |2 |    LOAD (1 uop)                                  :s---cdeeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |2 |    OP (1 uop)                                    :A+++++++++|+sdeeeeee|eeeeeeeee|eeeeeeeeee
0 |3 |vaddps ymm0, ymm0, ymmword ptr [rdi+0x100]        :          |         |         |          
0 |3 |    LOAD (1 uop)                                  :s---cdeeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |3 |    OP (1 uop)                                    :A+++++++++|++sdeeeee|eeeeeeeee|eeeeeeeeee
1 |0 |vandps ymm0, ymm0, ymmword ptr [rdi+0xa0]         :          |         |         |          
1 |0 |    LOAD (1 uop)                                  : s---cdeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
===== ivb.iacatrace
                                                         00000000001111111111222222222233333333335
0 |0 |vandps ymm0, ymm0, ymmword ptr [rdi+0xa0]         :          |         |         |          
0 |0 |    OP (1 uop)                                    :A+++++++++sdeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |0 |    LOAD (1 uop)                                  :s+++deeeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |1 |vorps ymm0, ymm0, ymmword ptr [rdi+0xc0]          :          |         |         |          
0 |1 |    LOAD (1 uop)                                  :s---deeeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |1 |    OP (1 uop)                                    :A+++++++++|sdeeeeeee|eeeeeeeee|eeeeeeeeee
0 |2 |vxorps ymm0, ymm0, ymmword ptr [rdi+0xe0]         :          |         |         |          
0 |2 |    LOAD (1 uop)                                  :s+++ccdeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |2 |    OP (1 uop)                                    :A+++++++++|+sdeeeeee|eeeeeeeee|eeeeeeeeee
0 |3 |vaddps ymm0, ymm0, ymmword ptr [rdi+0x100]        :          |         |         |         
0 |3 |    LOAD (1 uop)                                  :s---ccdeee|eeeeeeeee|eeeeeeeee|eeeeeeeee
0 |3 |    OP (1 uop)                                    :A+++++++++|++sdeeeee|eeeeeeeee|eeeeeeeee
1 |0 |vandps ymm0, ymm0, ymmword ptr [rdi+0xa0]         :          |         |         |         
1 |0 |    LOAD (1 uop)                                  : s---cccde|eeeeeeeee|eeeeeeeee|eeeeeeeee
===== skx.iacatrace
                                                         0000000000111111111122222222223333333333
0 |0 |vandps ymm0, ymm0, ymmword ptr [rdi+0xa0]         :          |         |         |         
0 |0 |    OP (1 uop)                                    :A+++++++++|dw    R  |         |         
0 |0 |    LOAD (1 uop)                                  :s+++deeeee|w     R  |         |         
0 |1 |vorps ymm0, ymm0, ymmword ptr [rdi+0xc0]          :          |         |         |         
0 |1 |    LOAD (1 uop)                                  :s---deeeee|w     R  |         |         
0 |1 |    OP (1 uop)                                    :A+++++++++|+dw    R |         |         
0 |2 |vxorps ymm0, ymm0, ymmword ptr [rdi+0xe0]         :          |         |         |         
0 |2 |    LOAD (1 uop)                                  :s---cdeeee|ew     R |         |         
0 |2 |    OP (1 uop)                                    :A+++++++++|++dw    R|         |         
0 |3 |vaddps ymm0, ymm0, ymmword ptr [rdi+0x100]        :          |         |         |         
0 |3 |    LOAD (1 uop)                                  : s---deeee|ew      R|         |         
0 |3 |    OP (1 uop)                                    : A++++++++|++++deeew|   R     |         
1 |0 |vandps ymm0, ymm0, ymmword ptr [rdi+0xa0]         :          |         |         |         
1 |0 |    LOAD (1 uop)                                  : s---cdeee|eew      |   R     |         

Looking at the actual icactrace files it looks like IVB/HSW/BDW to not generate any records to close off the execution part of an instruction (i.e. it's not a problem with the pt.py not construction the right output.)

Unfortunately I do have a need to do things on HSW/BDW so I would like to get things working properly.  If someone can tell if this is a tools problem or a user problem (i.e. my error) please let me know.

Oren K. (Intel)'s picture

Hi Ciro,

No, we only support client & server CPUs.

Thanks, Oren.

Pages

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.