Intel® Architecture Code Analyzer

What If Home | Product Overview | Features and Benefits | Throughput Analysis | Latency Analysis |
Technical Requirements | Discussion Forum | Blog

Product Overview

Intel® Architecture Code Analyzer helps you statically analyze the data dependency, throughput and latency of code snippets on Intel® microarchitectures. The term kernel is used throughout the rest of this document instead of code snippet.

Features and Benefits

For a given binary, Intel® Architecture Code Analyzer:

  • Performs static analysis of kernel throughput and latency under ideal front-end, out-of-order engine and memory hierarchy conditions.
  • Identifies the binding of the kernel instructions to the processor ports.
  • Identifies kernel critical path.

The Intel® Architecture Code Analyzer enables you to do a first order estimate of relative kernel performance on different micro architectures. The Intel® Architecture Code Analyzer does not provide absolute performance numbers.

Intel® Architecture Code Analyzer is a command-line tool with ASCII output. It handles one or more kernels that are marked for analysis within an executable, a shared library, or an object file.

Throughput Analysis

The Throughput Analysis treats the kernel as a body of an infinite loop. It computes the kernel throughput and highlights its bottlenecks.

The Throughput Analysis report contains the following whole kernel information:

  • Throughput of the analyzed kernel, counted in cycles.
    • The kernel bottleneck: front-end, port #, divider unit or inter-iteration dependency.
    • Total number of cycles each processor port was bound with micro-ops.

The Throughput Analysis also provides the following information per instruction:

  • Number of instruction micro-ops.
  • Average number of cycles the instruction was bound to each processor port, per loop iteration
  • An indication whether the instruction is on the critical path of the analyzed kernel.
  • Instruction disassembly in Intel® Software Developer’s Manual (MASM) style.

Latency Analysis

The Latency Analysis treats the kernel as a sequence of instructions that is executed once. It computes the latency of the kernel execution from its first to its last instruction and identifies all resource conflicts within the kernel.

The Latency Analysis report contains the following information:

  • Latency of the analyzed kernel, counted in cycles.
  • Instructions that were delayed due to resource conflicts
  • The instructions on the critical path (CP) resulting from data dependencies and resource conflicts.
  • Total resource conflict delay for each execution unit.
  • Dependencies between instructions due to resource conflicts.

Technical Requirements

Intel® Architecture Code Analyzer is a command-line utility that can analyze a kernel, contained in a binary file, that is delimited with special markers. The tool is capable of analyzing both IA-32 and Intel® 64 code, including Intel® AVX and AVX2 instructions.

Intel® Architecture Code Analyzer is available on Windows*, Linux*, and Mac OS X* operating systems. Both IA-32 and Intel® 64 operating systems are supported. Intel® 64 code can be analyzed on IA-32 operating systems and vice versa.

Release Notes for 2.1

  • Added support for Intel® microarchitecture codenamed Haswell.
  • Added support for MSVS64 compiler.
  • Added 64-bit binaries.

Release Notes for 2.0.1

  • Fixed a bug where –graph option failed to produce graph file.

Release Notes for 2.0

  • Added support for Intel® microarchitecture codenamed Sandy Bridge. This replaces the Intel® AVX microarchitecture previously in Intel® Architecture code Analyzer.
  • Added support for Intel® microarchitecture codenamed Ivy Bridge.
  • Added support for Mac OS X.
  • Improved analyzer algorithm for throughput analysis
    (new analysis output, see more details in User Manual)
  • Improved analyzer algorithm for latency analysis, output also includes microarchitecture events that will affect the latency. (new analysis output, see more details in the User Manual)
  • Added support for graphic output of the dependency graph

Release Notes for 1.1.3

  • Fixed a bug where using -o option produced truncated output
  • Fixed IACA_UD_BYTES definition in iacaMarks.h to include {}.

Release Notes for 1.1.2

  • Intel® Architecture Code Analyzer now supports adding START and END marks in code compiled with Visual C++ compiler (64-bit). See iacaMarks.h
  • Intel® Architecture Code Analyzer now supports multiple block analysis. You can direct the tool to analyze the n'th block that is delimited with analyzer marks. When used with n=0, all surrounded blocks in the file are analyzed and the output contains separate reports per block.

Release Notes for 1.1.1

  • Fixed Intel® AVX zero idiom instructions wrong identification
  • Fixed empty code blocks (containing only zero idiom instructions / not supported instructions) crashing the analyzer
  • Fixed Analyzer arch nehalem option to treat AES and PCLMUL instructions as illegal. These aren't supported on Intel® microarchitecture codename Nehalem.
  • Changed analyzer marks to abort if the binary is executed. To deactivate the marks when building for execution #define IACA_MARKS_OFF or use -DIACA_MARKS_OFF option in the compiler command line. Binaries with active marks should be used for analysis only.

Release Notes for 1.1

  • Intel® Architecture Code Analyzer is now hosted on Linux* operating systems, in addition to Windows* operating systems. Both IA-32 and Intel® 64 operating systems are supported.
  • Intel® Architecture Code Analyzer now supports two existing Intel® processors: Intel® microarchitecture, codenamed Nehalem and Westmere
  • Two critical path types are detected:
    • DATA_DEPENDENCY critical path (similar to previous releases - reflects instruction data dependencies only)
    • PERFORMANCE critical path (new - reflects port conflicts and front-end pressure, as well)

Release Notes for 1.0.2

  • Ignoring pop ebx / push ebx that Intel® Architecture Code Analyzer Markers add to IA32 code
  • Fixed misclassifying rcp / rsqrt as divider operations

Release Notes for 1.0.1

  • Graceful handling of unsupported instructions, they are quietly ignored in the analyzed block analysis and do not impact the throughput and latency calculations.
  • A few unsupported instructions are now supported, e.g. CMOV instruction family
  • Intel® AVX to Intel® SSE code switch detection. The performance penalty associated with such code switch is noted but not accounted for.
For more complete information about compiler optimizations, see our Optimization Notice.

Comments

Is there gcc version of iacaMarks.h so that I can analyze code for gcc compiled binary?
I have AVX ready gcc, but no AVX ready icc because I couldn't get the response from icc beta request.


I got a gcc version of iacaMarks.h from Intel guy, then the gcc compiled code with gcc version of iacaMarks.h runs finely on iaca.exe . Thanks a lot.




>>...Can the software author make the tool opensource?

It is hard to believe that Intel will make it. That is, to change a status of the 'Architecture Code Analyzer' to an Open Source project.


Hi, I've got two remarks

Hi, thanks for providing this great tool! I've got two remarks about version 2.1.

1. I noticed the 64 bit linux binary isn't working for me (tested on various systems: Xeon E5420 running Linux 3.0.82-0.7-default; Xeon E5-2660 v2 running 2.6.32-431.3.1.el6.x86_64; and Xeon E3-1240 v3 running 3.0.82-0.7-default). I get "Initialization failed" on all machines. The 32 bit binary is running fine.

2. I was trying to measure L1 throughput on Haswell (Xeon E3-1240 v3) with the STREAM sum (a[i]=b[i]+c[i]) benchmark. I was expecting to get 96B/c, corresponding to 326.4GB/s at a (fixed) frequency of 3.4GHz; however, I noticed that I couldn't get much more than 200GB/s. After some thinking I decided the problem was probably caused by additional cycles required for loop management (inc, cmp, jl), because the two loads, the add, and the store already hit the 4uop/c limit. However, loop unrolling didn't improve the bandwidth. I tried IACA, and to my surprise, it revealed that the new store AGU isn't used. Instead, the stores are split up between port 2 and 3 and thus these ports become the bottleneck.

|   2    |           |     | 1.0       |           | 1.0 |     |     |     | CP | vmovupd ymmword ptr [rdi+r9*8], ymm1
|   2    |           |     |           | 1.0       | 1.0 |     |     |     | CP | vmovupd ymmword ptr [rdi+r9*8+0x20], ymm3
|   2    |           |     | 1.0       |           | 1.0 |     |     |     | CP | vmovupd ymmword ptr [rdi+r9*8+0x40], ymm5
|   2    |           |     |           | 1.0       | 1.0 |     |     |     | CP | vmovupd ymmword ptr [rdi+r9*8+0x60], ymm7

Is this a bug in IACA or is the new store AGU only used in special cases?

Thanks!


Hello Johannes,

Hello Johannes,

The Linux version can’t identify if a binary is 32bit or 64bit, need to add -64 to the command line for the analyzer to work with 64bit binary. (see more details in the user manual)

Port7 AGU can only work on stores with simple memory address (no index register). This is why the above analysis doesn't show port7 utilization.

 

Thanks,

Tal


Hello Tal,

Hello Tal,

I wasn't aware of that. Replacing the code with an explicit (fast) lea to produce the base register on Port 1 or 5 and using that in the store improved the speed significantly, I wonder why the compiler didn't think of that when using -xHost. I even get ~94B/c in a simple load/store benchmark now. Thanks for sharing!