Intel® Architecture Code Analyzer

What If Home | Product Overview | Features and Benefits | Throughput Analysis |
Technical Requirements | Discussion Forum | Blog

IACA 2.3 is out! The new version adds support for the Intel® micro architecture code name Skylake (client and server).

Product Overview

Intel® Architecture Code Analyzer helps you statically analyze the data dependency, throughput and latency of code snippets on Intel® microarchitectures. The term kernel is used throughout the rest of this document instead of code snippet.

Features and Benefits

For a given binary, Intel® Architecture Code Analyzer:

  • Performs static analysis of kernel throughput and latency under ideal front-end, out-of-order engine and memory hierarchy conditions.
  • Identifies the binding of the kernel instructions to the processor ports.
  • Identifies kernel critical path.

The Intel® Architecture Code Analyzer enables you to do a first order estimate of relative kernel performance on different micro architectures. The Intel® Architecture Code Analyzer does not provide absolute performance numbers.

Intel® Architecture Code Analyzer is a command-line tool with ASCII output. It handles one or more kernels that are marked for analysis within an executable, a shared library, or an object file.

Throughput Analysis

The Throughput Analysis treats the kernel as a body of an infinite loop. It computes the kernel throughput and highlights its bottlenecks.

The Throughput Analysis report contains the following whole kernel information:

  • Throughput of the analyzed kernel, counted in cycles.
    • The kernel bottleneck: front-end, port #, divider unit or inter-iteration dependency.
    • Total number of cycles each processor port was bound with micro-ops.

The Throughput Analysis also provides the following information per instruction:

  • Number of instruction micro-ops.
  • Average number of cycles the instruction was bound to each processor port, per loop iteration
  • An indication whether the instruction is on the critical path of the analyzed kernel.
  • Instruction disassembly in Intel® Software Developer’s Manual (MASM) style.

Technical Requirements

Intel® Architecture Code Analyzer is a command-line utility that can analyze a kernel, contained in a binary file, that is delimited with special markers. The tool is capable of analyzing Intel® 64 code, including Intel® AVX, AVX2 and AVX-512 instructions.

Intel® Architecture Code Analyzer is available on Windows*, Linux*, and Mac OS X* operating systems. Only Intel® 64 operating systems are supported.

Release Notes for 2.3

  • Added support for Intel® micro architecture code name Skylake (client and server).
  • Added support for Intel® Advanced Vector Extensions 512 (Intel® AVX-512).
  • Added support for tracing the execution (see user guide).
  • Dropped the -no_interiteration flag.

Release Notes for 2.2

  • Added support for Intel® microarchitecture code name Broadwell.
  • Better support for Intel® Advanced Vector Extensions (Intel® AVX) Gather operations.
  • Replaced the "InterIteration" throughput bottleneck indication with a more general "long dependency chains" indication.
  • Added an indication when front end bubbles occur (see user guide).
  • Numerous improvements in modelling supported processors.
  • Unsupported instructions are now marked with 'X' instead of '!' for better readability.
  • NHM, WSM microarchitectures are not actively supported any more.
  • Removed support for running IACA on 32 bit operating systems and for analyzing 32 bit programs.
  • Dropped latency analysis support.
  • Windows* OS will be supported in the next release. Available Now!

Release Notes for 2.1

  • Added support for Intel® microarchitecture codenamed Haswell.
  • Added support for MSVS64 compiler.
  • Added 64-bit binaries.

Release Notes for 2.0.1

  • Fixed a bug where –graph option failed to produce graph file.

Release Notes for 2.0

  • Added support for Intel® microarchitecture codenamed Sandy Bridge. This replaces the Intel® AVX microarchitecture previously in Intel® Architecture code Analyzer.
  • Added support for Intel® microarchitecture codenamed Ivy Bridge.
  • Added support for Mac OS X.
  • Improved analyzer algorithm for throughput analysis
    (new analysis output, see more details in User Manual)
  • Improved analyzer algorithm for latency analysis, output also includes microarchitecture events that will affect the latency. (new analysis output, see more details in the User Manual)
  • Added support for graphic output of the dependency graph

Release Notes for 1.1.3

  • Fixed a bug where using -o option produced truncated output
  • Fixed IACA_UD_BYTES definition in iacaMarks.h to include {}.

Release Notes for 1.1.2

  • Intel® Architecture Code Analyzer now supports adding START and END marks in code compiled with Visual C++ compiler (64-bit). See iacaMarks.h
  • Intel® Architecture Code Analyzer now supports multiple block analysis. You can direct the tool to analyze the n'th block that is delimited with analyzer marks. When used with n=0, all surrounded blocks in the file are analyzed and the output contains separate reports per block.

Release Notes for 1.1.1

  • Fixed Intel® AVX zero idiom instructions wrong identification
  • Fixed empty code blocks (containing only zero idiom instructions / not supported instructions) crashing the analyzer
  • Fixed Analyzer arch nehalem option to treat AES and PCLMUL instructions as illegal. These aren't supported on Intel® microarchitecture codename Nehalem.
  • Changed analyzer marks to abort if the binary is executed. To deactivate the marks when building for execution #define IACA_MARKS_OFF or use -DIACA_MARKS_OFF option in the compiler command line. Binaries with active marks should be used for analysis only.

Release Notes for 1.1

  • Intel® Architecture Code Analyzer is now hosted on Linux* operating systems, in addition to Windows* operating systems. Both IA-32 and Intel® 64 operating systems are supported.
  • Intel® Architecture Code Analyzer now supports two existing Intel® processors: Intel® microarchitecture, codenamed Nehalem and Westmere
  • Two critical path types are detected:
    • DATA_DEPENDENCY critical path (similar to previous releases - reflects instruction data dependencies only)
    • PERFORMANCE critical path (new - reflects port conflicts and front-end pressure, as well)

Release Notes for 1.0.2

  • Ignoring pop ebx / push ebx that Intel® Architecture Code Analyzer Markers add to IA32 code
  • Fixed misclassifying rcp / rsqrt as divider operations

Release Notes for 1.0.1

  • Graceful handling of unsupported instructions, they are quietly ignored in the analyzed block analysis and do not impact the throughput and latency calculations.
  • A few unsupported instructions are now supported, e.g. CMOV instruction family
  • Intel® AVX to Intel® SSE code switch detection. The performance penalty associated with such code switch is noted but not accounted for.
For more complete information about compiler optimizations, see our Optimization Notice.

34 comments

Top
Yakir G. (Intel)'s picture

Hi Boming L,

I have compiled and successfully analyzed  your code example on Visual Studio 2017,

It is possible that  Visual Studio 2015 compiler somehow messes with the markers, 

anyway, I recommend upgrading to Visual Studio 2017 if possible, or compiling with another compiler.

Thanks, Yakir.

Boming L.'s picture

Hi Yakir,

Thanks for the reply. The code I tried to analyze was simple one, as following:

#include <intrin.h>
#include "iacaMarks.h"

void simply_add(int* src, int* dst) {
    IACA_VC64_START
    __m128i a = _mm_stream_load_si128((__m128i*)src);
    __m128i res = _mm_add_epi32(a, a);
    _mm_stream_si128((__m128i*)dst, res);
    IACA_VC64_END
}

I passed the object file to iaca, but it ended up with the error message I mentioned.

Additionally, it seems not containing a header file in the latest IACA for win64, so "iacaMarks.h" which I use is actually ported from Linux version.

Yakir G. (Intel)'s picture

Hi Boming L,

Could you please share the code that produced this error?

Thanks Yakir.

Gideon S. (Intel)'s picture

Hi Craig,

We are going to release a new version in a few weeks in which IACA emits the trace directly for all micro-archs without depending on Python. That should solve all these issues. Stay tuned!

Thanks, Gideon.

Craig R.'s picture

I think I've found some bugs in 2.3 particularly w.r.t. trace/pt.py.  The forum is "archived" and can't be posted to, so I'll post here in hopes that someone at Intel can fix things or explain what I'm doing wrong.  (Sorry for the formatting, this commenting mechanism is really not suited for this).  Anyway, here are the two issues I've run across:

1) the manual says Python 3.6.1 should work for pt.py but there seem to be some problems:

   a) line 35 starts with spaces whereas the others start with tab.  This may not be a python3-ism, but I had problems till I fixed it with tabs

   b) lines 102, 106, 147 use '/' for division.  Under python2 this is integer division, under 3 floating point (the latter causing problems).  If these are changed to '//' (integer division in python2 and 3) things work.

2) I generated some nonsense code to see what the iaca tools can do:

#include <immintrin.h>
#include <iacaMarks.h>
#include <stdio.h>

typedef __m256 m256;
#define N 16
m256 fn(m256 *xin)
{
  m256 t = xin[0];
  t = _mm256_and_ps(t, xin[1]);
  t = _mm256_or_ps(t, xin[2]);
  t = _mm256_xor_ps(t, xin[3]);
  t = _mm256_add_ps(t, xin[4]);
IACA_START
  t = _mm256_and_ps(t, xin[5]);
  t = _mm256_or_ps(t, xin[6]);
  t = _mm256_xor_ps(t, xin[7]);
  t = _mm256_add_ps(t, xin[8]);
IACA_END
  t = _mm256_and_ps(t, xin[9]);
  t = _mm256_or_ps(t, xin[10]);
  t = _mm256_xor_ps(t, xin[11]);
  t = _mm256_add_ps(t, xin[12]);
  return t; 
}
int
main(int argc, char *argv[])
{
  m256 xin[N];
  int i;
  for (i=0; i<N; i++) xin[i] = _mm256_set1_ps(i);
  m256 xout = fn(xin);
  printf("%8X\n", *(unsigned int *)&xout);
}

Compile with "gcc -O3 -mavx avx.c -Iiaca-2.3/iaca-lin64/include". (gcc is 4.4.7 if that matters).

I then generate trace files using iaca for IVB, HSW, BDW, SKX using "iaca-2.3/iaca-lin64/bin/iaca -trace ivb -arch IVB a.out", etc...

Using (a fixed) version of pt.py it looks like there are problems with all but the SKX version (I've narrowed the lines to try to make things presentable):

===== bdw.iacatrace
                                                         00000000001111111111222222222233333333333
0 |0 |vandps ymm0, ymm0, ymmword ptr [rdi+0xa0]         :          |         |         |          
0 |0 |    OP (1 uop)                                    :A+++++++++sdeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |0 |    LOAD (1 uop)                                  :s+++deeeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |1 |vorps ymm0, ymm0, ymmword ptr [rdi+0xc0]          :          |         |         |         
0 |1 |    LOAD (1 uop)                                  :s---deeeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |1 |    OP (1 uop)                                    :A+++++++++|sdeeeeeee|eeeeeeeee|eeeeeeeeee
0 |2 |vxorps ymm0, ymm0, ymmword ptr [rdi+0xe0]         :          |         |         |          
0 |2 |    LOAD (1 uop)                                  :s---cdeeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |2 |    OP (1 uop)                                    :A+++++++++|+sdeeeeee|eeeeeeeee|eeeeeeeeee
0 |3 |vaddps ymm0, ymm0, ymmword ptr [rdi+0x100]        :          |         |         |          
0 |3 |    LOAD (1 uop)                                  :s---cdeeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |3 |    OP (1 uop)                                    :A+++++++++|++sdeeeee|eeeeeeeee|eeeeeeeeee
1 |0 |vandps ymm0, ymm0, ymmword ptr [rdi+0xa0]         :          |         |         |          
1 |0 |    LOAD (1 uop)                                  : s---cdeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
===== hsw.iacatrace
                                                         00000000001111111111222222222233333333335
0 |0 |vandps ymm0, ymm0, ymmword ptr [rdi+0xa0]         :          |         |         |          
0 |0 |    OP (1 uop)                                    :A+++++++++sdeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |0 |    LOAD (1 uop)                                  :s+++deeeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |1 |vorps ymm0, ymm0, ymmword ptr [rdi+0xc0]          :          |         |         |          
0 |1 |    LOAD (1 uop)                                  :s---deeeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |1 |    OP (1 uop)                                    :A+++++++++|sdeeeeeee|eeeeeeeee|eeeeeeeeee
0 |2 |vxorps ymm0, ymm0, ymmword ptr [rdi+0xe0]         :          |         |         |          
0 |2 |    LOAD (1 uop)                                  :s---cdeeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |2 |    OP (1 uop)                                    :A+++++++++|+sdeeeeee|eeeeeeeee|eeeeeeeeee
0 |3 |vaddps ymm0, ymm0, ymmword ptr [rdi+0x100]        :          |         |         |          
0 |3 |    LOAD (1 uop)                                  :s---cdeeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |3 |    OP (1 uop)                                    :A+++++++++|++sdeeeee|eeeeeeeee|eeeeeeeeee
1 |0 |vandps ymm0, ymm0, ymmword ptr [rdi+0xa0]         :          |         |         |          
1 |0 |    LOAD (1 uop)                                  : s---cdeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
===== ivb.iacatrace
                                                         00000000001111111111222222222233333333335
0 |0 |vandps ymm0, ymm0, ymmword ptr [rdi+0xa0]         :          |         |         |          
0 |0 |    OP (1 uop)                                    :A+++++++++sdeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |0 |    LOAD (1 uop)                                  :s+++deeeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |1 |vorps ymm0, ymm0, ymmword ptr [rdi+0xc0]          :          |         |         |          
0 |1 |    LOAD (1 uop)                                  :s---deeeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |1 |    OP (1 uop)                                    :A+++++++++|sdeeeeeee|eeeeeeeee|eeeeeeeeee
0 |2 |vxorps ymm0, ymm0, ymmword ptr [rdi+0xe0]         :          |         |         |          
0 |2 |    LOAD (1 uop)                                  :s+++ccdeee|eeeeeeeee|eeeeeeeee|eeeeeeeeee
0 |2 |    OP (1 uop)                                    :A+++++++++|+sdeeeeee|eeeeeeeee|eeeeeeeeee
0 |3 |vaddps ymm0, ymm0, ymmword ptr [rdi+0x100]        :          |         |         |         
0 |3 |    LOAD (1 uop)                                  :s---ccdeee|eeeeeeeee|eeeeeeeee|eeeeeeeee
0 |3 |    OP (1 uop)                                    :A+++++++++|++sdeeeee|eeeeeeeee|eeeeeeeee
1 |0 |vandps ymm0, ymm0, ymmword ptr [rdi+0xa0]         :          |         |         |         
1 |0 |    LOAD (1 uop)                                  : s---cccde|eeeeeeeee|eeeeeeeee|eeeeeeeee
===== skx.iacatrace
                                                         0000000000111111111122222222223333333333
0 |0 |vandps ymm0, ymm0, ymmword ptr [rdi+0xa0]         :          |         |         |         
0 |0 |    OP (1 uop)                                    :A+++++++++|dw    R  |         |         
0 |0 |    LOAD (1 uop)                                  :s+++deeeee|w     R  |         |         
0 |1 |vorps ymm0, ymm0, ymmword ptr [rdi+0xc0]          :          |         |         |         
0 |1 |    LOAD (1 uop)                                  :s---deeeee|w     R  |         |         
0 |1 |    OP (1 uop)                                    :A+++++++++|+dw    R |         |         
0 |2 |vxorps ymm0, ymm0, ymmword ptr [rdi+0xe0]         :          |         |         |         
0 |2 |    LOAD (1 uop)                                  :s---cdeeee|ew     R |         |         
0 |2 |    OP (1 uop)                                    :A+++++++++|++dw    R|         |         
0 |3 |vaddps ymm0, ymm0, ymmword ptr [rdi+0x100]        :          |         |         |         
0 |3 |    LOAD (1 uop)                                  : s---deeee|ew      R|         |         
0 |3 |    OP (1 uop)                                    : A++++++++|++++deeew|   R     |         
1 |0 |vandps ymm0, ymm0, ymmword ptr [rdi+0xa0]         :          |         |         |         
1 |0 |    LOAD (1 uop)                                  : s---cdeee|eew      |   R     |         

Looking at the actual icactrace files it looks like IVB/HSW/BDW to not generate any records to close off the execution part of an instruction (i.e. it's not a problem with the pt.py not construction the right output.)

Unfortunately I do have a need to do things on HSW/BDW so I would like to get things working properly.  If someone can tell if this is a tools problem or a user problem (i.e. my error) please let me know.

Oren K. (Intel)'s picture

Hi Ciro,

No, we only support client & server CPUs.

Thanks, Oren.

Ciro Viscovo's picture

Excuse me, does a tool exist that support Silvermont microarchitecture?

Thanks

Ciro Viscovo (Hitachi Group)

Boming L.'s picture

I'm still bothered by the error code: "COULD NOT FIND START_MARKER NUMBER 1even using IACA ver. 2.3, by attempting to analyze the code compiled by Visual Studio 2015 Update 3. Is the problem really fixed in the latest version?

Gideon S. (Intel)'s picture

IACA Version 2.1 does not work well with VS 2015. This will be fixed  with the support for Windows OS in Version 2.2, expected end Q1 or early Q2 ’17.  

Thanks, /g

Alexander L.'s picture

Unfortunately this tool (v. 2.1 x64) does not work with 64-bit Dll compiled with Visual Studio 2015 Update 3 for me too :(

COULD NOT FIND START_MARKER NUMBER 1

Could you, please, fix this?

Is it possible, that the reason is, the assembly contain managed code as bridge between .Net and native methods?

Pages

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.