Intel® Developer Zone:
Intel Instruction Set Architecture Extensions

Intel’s Instruction Set Architecture (ISA) continues to evolve to improve functionality, performance and the user experience. Featured below are planned extensions to the ISA that are new as well as those being planned for enhancements in future generations of processors. By publishing these extensions early, Intel helps ensure that the software ecosystem has time to innovate and come to market with enhanced and new products when the processors are launched.


Tools & Downloads

  • Intel® C++ Compiler

    The Intel® C++ Compiler is available for download from the Intel® Registration Center for all licensed customers. Evaluation versions of Intel® Software Development Products are also available for free download.

  • Intel Intrinsics Guide

    The Intel Intrinsics Guide is an interactive reference tool for Intel intrinsic instructions, which are C style functions that provide access to many Intel instructions – including Intel® Streaming SIMD Extensions (Intel® SSE), Intel® Advanced Vector Extensions (Intel® AVX), and more – without the need to write assembly code.

Intel® Advanced Vector Extensions (Intel® AVX)

The need for greater computing performance continues to grow across industry segments. To support rising demand and evolving usage models, we continue our history of innovation with the Intel® Advanced Vector Extensions (Intel® AVX) in products today.

Intel® AVX is a new-256 bit instruction set extension to Intel® SSE and is designed for applications that are Floating Point (FP) intensive. It was released early 2011 as part of the Intel® microarchitecture code name Sandy Bridge processor family and is present in platforms ranging from notebooks to servers. Intel AVX improves performance due to wider vectors, new extensible syntax, and rich functionality. This results in better management of data and general purpose applications like image, audio/video processing, scientific simulations, financial analytics and 3D modeling and analysis.

Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

In the future, some new products will feature a significant leap to 512-bit SIMD support. Programs can pack eight double precision and sixteen single precision floating numbers within the 512-bit vectors, as well as eight 64-bit and sixteen 32-bit integers. This enables processing of twice the number of data elements that IntelAVX/AVX2 can process with a single instruction and four times the capabilities of Intel SSE.

Intel AVX-512 instructions are important because they open up higher performance capabilities for the most demanding computational tasks. Intel AVX-512 instructions offer the highest degree of compiler support by including an unprecedented level of richness in the design of the instruction capabilities.

Intel AVX-512 features include 32 vector registers each 512-bit wide and eight dedicated mask registers. Intel AVX-512 is a flexible instruction set that includes support for broadcast, embedded masking to enable predication, embedded floating point rounding control, embedded floating-point fault suppression, scatter instructions, high speed math instructions, and compact representation of large displacement values.

Intel AVX-512 offers a level of compatibility with Intel AVX which is stronger than prior transitions to new widths for SIMD operations. Unlike Intel SSE and Intel AVX which cannot be mixed without performance penalties, the mixing of Intel AVX and Intel AVX-512 instructions is supported without penalty. Intel AVX registers YMM0–YMM15 map into Intel AVX-512 registers ZMM0–ZMM15 (in x86-64 mode), very much like Intel SSE registers map into Intel AVX registers. Therefore, in processors with Intel AVX-512 support, Intel AVX and Intel AVX2 instructions operate on the lower 128 or 256 bits of the first 16 ZMM registers.

More information about the details about Intel AVX-512 instructions can be found in the blog "AVX-512 Instructions". The instructions are documented in the Intel® Architecture Instruction Set Extensions Programming Reference (see the "Overview" tab on this page).

Intel® Integrated Performance Primitives (Intel® IPP) Functions Optimized for Intel® Advanced Vector Extensions (Intel® AVX)
By Naveen Gv (Intel)Posted 08/02/20120
This page offers the list of Intel IPP functions specially optimized for Intel AVX.
Intel® IPP 7.0 Release Notes
By Ying Song (Intel)Posted 08/02/20121
Summary of new features and changes in the Intel IPP 7.0
Intel® AVX State Transitions: Migrating SSE Code to AVX
By Chris Kirkpatrick (Intel)Posted 08/02/20120
Introduction Intel® Advanced Vector eXtensions (AVX) are the latest instruction set addition to the IA-32 and IA-64 architectures. They provide enhanced 256-bit SIMD operations on 8-wide floating-point vectors for Intel® 2nd Generation Core®™ processors code named Sandy Bridge and later processor...
Intel® AVX Realization Of IIR Filter For Complex Float Data
By Igor Astakhov (Intel)Posted 08/02/20120
Download PDF Download Intel® AVX Realization Of IIR Filter For Complex Float Data [PDF 128KB] Introduction This paper describes complex Infinite Impulse Response (IIR) filter implementation with Intel® AVX Single Instruction Multiple Data (SIMD) instruction set. The main difficulty of such realiz...


Subscribe to
Jeff's Notebook: 3D Vector Normalization Using 256-Bit Intel® Advanced Vector Extensions (Intel® AVX)
By Jeff Kataoka (Intel)Posted 02/11/20110
With the launch of our New Second Generation Intel® Core processor, there has been a lot of interested in the Intel® Advanced Vector Extensions (Intel® AVX). I decided to investigate more on how application developers targeting the Second Generation Intel core processor for their application migh...
Visual Studio 2010 Built-in CPU Acceleration
By Asaf ShellyPosted 12/20/20105
Writing the sample code for this post I was amazed myself to see how simple it was to reach over 20 times performance improvement with so little effort.    The motivation is a very heavy video processing algorithm created for HD TV. This means hi-resolution which means many pixels to compute and ...
New Parallel Studio: Intel Parallel Studio 2011
By James Reinders (Intel)Posted 09/14/20101
This month, we introduced Intel Parallel Studio 2011. It is a very worthy successor to the original Intel Parallel Studio by expanding both on the tooling and the parallel programming models it offers. On the tooling, we have the Intel Parallel Advisor tool. It is an exciting tool that is a joy t...
Parallel Programming Talk #62 - What Every Software Developer Should Know About Intel AVX
By aaron-tersteeg (Intel)Posted 02/03/20100
Welcome to Show 62 of Parallel Programming Talk. Today is February 2nd. Groundhog day in the United States. On this episode Clay and Aaron will be addressing recent questions about what every software developer should know about Intel® Advanced Vector Extensions (Intel® AVX) with Senior Software ...


Subscribe to Intel Developer Zone Blogs

    Intel® Software Guard Extensions (Intel® SGX)

    Intel Vision Statement

    Computing workloads today are increasing in complexity, with hundreds of software modules delivered by different teams scattered across the world. Isolation of workloads on open platforms has been an ongoing effort, beginning with protected mode architecture to create a privilege separation between operating systems and applications. Recent malware attacks however have demonstrated the ability to penetrate into highly privileged modes and gain control over all the software on a platform.

    Intel® Software Guard Extensions (Intel® SGX) is a name for Intel Architecture extensions designed to increase the security of software through an “inverse sandbox” mechanism. In this approach, rather than attempting to identify and isolate all the malware on the platform, legitimate software can be sealed inside an enclave and protected from attack by the malware, irrespective of the privilege level of the latter. This would complement the ongoing efforts in securing the platform from malware intrusion, similar to how we install safes in our homes to protect valuables even while introducing more sophisticated locking and alarm systems to prevent and catch intruders.

    Getting Started (common to all ISA)


    Tools & Downloads

    • No change to existing content

    Technical Content

    No content found
    Subscribe to Intel Developer Zone Blogs
    No Content Found
    Subscribe to

    Intel® Memory Protection Extensions (Intel® MPX)

    Computer systems face malicious attacks of increasing sophistication, and one of the more commonly observed forms is to cause or exploit buffer overruns (or overflows) in software applications.

    Intel® Memory Protection Extensions (Intel® MPX) is a name for Intel Architecture extensions designed to increase robustness of software. Intel MPX will provide hardware features that can be used in conjunction with compiler changes to check that memory references intended at compile time do not become unsafe at runtime. Two of the most important goals of Intel MPX are to provide this capability at low overhead for newly compiled code, and to provide compatibility mechanisms with legacy software components. Intel MPX will be available in a future Intel® processor.

    No Content Found
    Subscribe to
    No content found
    Subscribe to Intel Developer Zone Blogs

      Intel® Secure Hash Algorithm Extensions (Intel® SHA Extensions)

      The Secure Hash Algorithm (SHA) is one of the most commonly employed cryptographic algorithms.  Primary usages of SHA include data integrity, message authentication, digital signatures, and data de-duplication.  As the pervasive use of security solutions continues to grow, SHA can be seen in more applications now than ever. The Intel® SHA Extensions are designed to improve the performance of these compute intensive algorithms on Intel® architecture-based processors.

      The Intel® SHA Extensions are a family of seven Intel® Streaming SIMD Extensions (Intel® SSE)-based instructions that are used together to accelerate the performance of processing SHA-1 and SHA-256 on Intel architecture-based processors.  Given the growing importance of SHA in our everyday computing devices, the new instructions are designed to provide a needed boost of performance to hashing a single buffer of data. The performance benefits will not only help improve responsiveness and lower power consumption for a given application, they may enable developers to adopt SHA in new applications to protect data while delivering to their user experience goals. The instructions are defined in a way that simplifies their mapping into the algorithm processing flow of most software libraries, thus enabling easier development.

      No Content Found
      Subscribe to
      No content found
      Subscribe to Intel Developer Zone Blogs
        SSE2 vectorized code seems to run slower than non-vectorized code
        By tomk@ap.com12
        Hi everyone: This is my first time posting to the forum. I have a lot of experience designing and optimizing assembly language routines for signal processing. Until recently, this was on what you might call predictable architectures (Moto 56k DSP and PowerPC). I am now doing this on an x86, and having difficulty understanding where timing changes are occurring. I'm only interested in optimizing one loop: a second-order section (a basic building block of digital filters). The loop processes one sample, reading it in from memory, doing the math and updating the internal states, and writing the sample out again (on top of where it was read from). The loop executes N times, where N is the number of samples in the buffer. I began by writing the loop in C++, then turning on the SSE2 optimizations in the compiler (Visual Studio) and grabbing the disassembly. I then manually removed all the unnecessary re-loads of registers that never changed and whatnot, and ended up with something like a ...
        SSE4 Register-Handling
        By adrian s.21
        I'm working on a stereo-algorithm to compute a disparity map. Therefore I need to calculate a lot of SAD-values. To improve the performance I want to use SSE4, especially the "_mm_mpsadbw_epu8" instruction. I stumbled over this Intel document. In Section F "Intel® SSE4 – Optimized Function for 16x16 Blocks" is a SAD calculation example of a 16x16 Block. I used this snippet in my code and the preformance improved a lot. But it is not enough. Is it possible to boost the performance by using all 16 SSE registers instead of 8, or is there any kind of constraint? Best Regards Jambalaja
        Haswell TLBs undefined in Intel cpu spec
        By perfwise15
        I am currently upgrading my cpuid detection of Intel TLBs and have a Haswell 4770 cpu.  I note that in the 4 registers returned by cpuid test eax=2 I observe undefined descriptors of 0xc1 and 0xb6 being returned which are not defined in the Intel cpu spec for my Intel i7 4770 released cpu.   CAn someone at intel update the spec for tlb detection in leaf eax=2 and let me know what is missing "please".  I use this in my high perf code for tlb detection and currently don't detect any 2nd level TLB. Perfwise
        How to extract DWORD from upper half of 256-bit register?
        By Igor Levicki63
        Congratulations to Intel CPU instruction set engineers for managing to make YET ANOTHER non-orthogonal instruction set extension -- why PEXTRD/PINSRD (among many others) were not promoted to 256 bits in AVX2? Any ideas/tricks to work around this engineering "oversight"?
        Almost-unit-stride stores
        By Fabio L.3
        Hi all I have a AVX vector register reg containing 4 double values, let's call them (in order): 0 - 2 - 3 - 4These values have to be added to distinct locations of an array A, namely to positions A[0], A[2], A[3], A[4]In other words: A[0] += reg[0], A[2] += reg[2] and so on This is a quite recurrent situation in my program, i.e. sequences of load-add-stores that are "almost" unit-stride - but actually they are not.  At the beginning I thought I could have used some sort of shuffle instructions to shift values in reg, i.e. getting 0 - x - 2 - 3 (and maybe treating reg[4] as a scalar value), and then perfom standard 256-bit instructions. However, as far as I know, I can't reduce that kind of shifting to a single instruction, right?  Related to this question, let's say that now reg is 0 - 2 - 3 - 5. Should I treat all 4 values as scalar values or is there a way of efficiently (1/2 instructions?) extracting the two values in the middles (i.e. those crossing the two 128-bits lanes) into ...
        To use FPU
        By GHui23
        The following code is to use FPU. I run it on E5-2620. It only upto 2 GFlops. If I want to 2*8 GFlops, how could I code program? Any help will be appreciated. void* test_pd_avx(){  double x[4]={12.02,14.34,34.23,234.34};  double y[4]={123.234,234.234,675.34,3453.345};  __m256d mx=_mm256_load_pd(x);  __m256d my=_mm256_load_pd(y);  for(;;)  {    __m256d mz=_mm256_mul_pd(mx,my);      }} The Compiler Option: icc test.c -O0
        Haswell GFLOPS
        By caosun71
        Hi Intel Experts:     I cannot find the latest Intel Haswell CPU GFlops, could you please let me know that?     I want to understand the performance difference between Haswell and Ivy-bridge, for example, i7-4700HQ and i7-3630QM. From Intel website, I could know i7-3630QM's GFlops is 76.8 (Base). Could you please let me know that of i7-4700HQ?     I get some information from internet that:          Intel SandyBridge and Ivy-Bridge have the following floating-point performance: 16-SP FLOPS/cycle --> 8-wide AVX addition and 8-wide AVX multiplication.         Intel Haswell have the following floating-point performance: 32-SP FLOPS/cycle --> two 8-wide FMA (fused multiply-add) instructions     I have two questions here:     1. Take i7-3632QM as an example: 16 (SP FLOPS/cycle) X 4 (Quad-core) X 2.4G (Clock) = 153.6 GFLOPS = 76.8 X 2. Does it mean that one operation is a combined addition and multiplication operation?     2. Does Haswell have TWO FMA?      Thank you very much for an...
        Question about example on Optimization manual---AVX mask move to avoid branch penalty
        By Deyang Gu31
        Hi all, I am trying to run an example introduced by optimization manual(June 2013) on page 11-23, example 11-14. I tried to use a separate .s file to write the function, and a main.c file to do the main func. The code will only run correctly in debug mode. Please see attachment for my code. The cond_loop.c is actually cond_loop.s but the forum won't accept this kind of extension.   icc main-2.c cond_loop.s -g          Everything works fine.  icc main-2.c cond_loop.s              Segmentation Fault with failure to access array members at the end of the code. After the function void cond_loop(const float *a, float *b, const float *c, const float *d, const float *e, const int length) returns, all the array pointers will be lost so I cannot access the old arrays anymore. This problem will only occur without -g compile option, meaning release code only bug. So I am not able to debug it. I did some research and it showed this is because in debug mode stack frame pointer will always be sav...


        Subscribe to Forums