Intel® AVX C/C++ Intrinsics Emulation

Intel® AVX instruction set extension [1] will appear in the next generation Intel microarchitecture codename ‘Sandy Bridge'. We chose to announce AVX early to get as much support from software vendors as possible by the hardware launch time. Now, most software development platforms are supporting Intel AVX, examples are compilers and assemblers from Intel, Microsoft and GCC as well as UNIX binutils.

For early adopters we introduced support of AVX in Intel® Software Development Emulator [2], it allows you to run and check functional correctness of the code with the actual AVX instructions before hardware is available.

Today we are adding another useful piece to help those who may not be able to use new tools supporting AVX in their current development environment but plan to migrate in the future or are using a software platform which is not supported by Intel SDE. These software developers can still start programming with Intel AVX using intrinsics.

Here we are providing the C and C++ header file which emulates Intel AVX intrinsics. The AVX emulation header file uses intrinsics for the prior Intel instruction set extensions up to Intel SSE4.2. SSE4.2 support in your development environment as well as hardware is required in order to use the AVX emulation header file.

To use simply have this file included:

#include "avxintrin_emu.h"

Instead of usual:

#include <immintrin.h>

One can also create alternative immintrin.h file (which in turn includes avxintrin_emu.h) to avoid an intrusive change to the source base and then simply switch between real AVX code generation and emulation via alternating the path to include directories.

Emulation header is primarily targeting UNIX type of environments, and was tested on such with GCC and Intel C/C++ compilers. We have a strong support with other tools (compilers, assemblers and SDE) on Microsoft Windows platform, but this header file can still be used on Windows, if desired, with Intel Compiler.

Note that the AVX emulation header file is designed to allow functional correctness of an AVX implementation and not recommended for long-term usage or release in a final product. Once your development environment and hardware supports AVX, we recommend that you switch to the real AVX intrinsic header file.

Although we did our best to debug it, this file must not be considered a reference functional implementation of AVX instructions or even bug-free. Please see the current version's limitations and caveats in the beginning of the file. Please let us know about the issues you faced using it.


#include "avxintrin_emu.h"  // #include <immintrin.h>

void saxpy(float a, const float* x, const float* y, float* __restrict z, size_t len)
    size_t i = 0;
    __m256 a_ = _mm256_set1_ps(a);

    for (int len16_ = len & -16; i + 16 <= len16_; i += 16)
        __m256 x1_ = _mm256_loadu_ps(x + i);
        __m256 x2_ = _mm256_loadu_ps(x + i + 8);

        __m256 y1_ = _mm256_loadu_ps(y + i);
        __m256 y2_ = _mm256_loadu_ps(y + i + 8);

        x1_ = _mm256_mul_ps(x1_, a_);
        x2_ = _mm256_mul_ps(x2_, a_);

        x1_ = _mm256_add_ps(x1_, y1_);
        x2_ = _mm256_add_ps(x2_, y2_);

        _mm256_storeu_ps(z + i, x1_);
        _mm256_storeu_ps(z + i + 8, x2_);

    for (; i < len; ++i)
        z[i] = x[i] * a + y[i];

[1] Intel AVX - /en-us/avx/

[2] Intel Software Development Emulator - /en-us/articles/intel-software-development-emulator

File avxintrin-emu.h44.16 KB
For more complete information about compiler optimizations, see our Optimization Notice.


Vadim M.'s picture

In order to fix GCC warning on inline asm constraints in __emu_mm_cmp_ps, i used the following modified code:

static __emu_inline __m128 __emu_mm_cmp_ps(__m128 m1, __m128 m2, const int predicate)
    __m128 res;
    if (__builtin_constant_p(predicate) && predicate >= 0 && predicate <= 7 ) {
        res = m1;
        __asm__ ( "cmpps %[pred_], %[m2_], %[res_]" : [res_] "+x" (res) : [m2_] "xm" (m2), [pred_] "i" (predicate) );
    } else {
        res = _mm_setzero_ps();
        __asm__ __volatile__ ( "ud2" : : : "memory" ); /* not supported yet */
    return ( res );

anonymous's picture


Thanks for this header file. It allowed me to find a bug in my code with AVX support enabled even if I don't have access to an AVX machine !

By the way, after looking at the code, I found a little potential problem that would become real only if the _mm_broadcast_sd(double const *) function became part of the standard (_mm_broadcast_ss(float const *) is but not the former). This function would return a vector of double, not a vector of float. So it would return an __m128d, not an __m128. This is why I suggest the following patch:

Index: avxintrin_emu.h
--- avxintrin_emu.h (revision 5759)
+++ avxintrin_emu.h (working copy)
@@ -601,7 +601,7 @@
__emu_mm_broadcast_impl( __emu_mm_broadcast_ss, __m128, float )
__emu_mm_broadcast_impl( __emu_mm256_broadcast_ss, __emu__m256, float )

-__emu_mm_broadcast_impl( __emu_mm_broadcast_sd, __m128, double )
+__emu_mm_broadcast_impl( __emu_mm_broadcast_sd, __m128d, double )
__emu_mm_broadcast_impl( __emu_mm256_broadcast_sd, __emu__m256d, double )

__emu_mm_broadcast_impl( __emu_mm256_broadcast_ps, __emu__m256, __m128 )


anonymous's picture

Great idea with the "Intrinsics Guide for Intel Advanced Vector Extensions v2.5.1". However, I would like to see a very critical piece of information missing: the number of cycles required to execute a specific instruction.

Thank you!


anonymous's picture

I am still looking for the _mm256_fmadd_pd(__m256d, __m256d, __m256d) intrinsic emulation. Would anyone care of include it?


anonymous's picture

Now that AVX hardware (sandy bridge) chips are shipping I wanted to see how it performed with SAXPY on different size arrays. With icc version 12.0.2 I tried to compile the above and:

~/src/intel> icc -v
Version 12.0.2
~/src/intel> icc -c i.c
i.c(9): error: type name is not allowed
for ( size_t len16_ = len & -16; i + 16 <= len16_; i += 16 )

i.c(9): error: expected a ";"
for ( size_t len16_ = len & -16; i + 16 <= len16_; i += 16 )

i.c(9): error: identifier "len16_" is undefined
for ( size_t len16_ = len & -16; i + 16 <= len16_; i += 16 )

Have the intrinsic names changed since this post? Do I need some AVX specific compiler flags to pick up the right includes?

anonymous's picture

Thank you, this is awesome!

I was just going to start writing a file which implements all AVX intrinsics using two SSE intrinsics when I discovered this.

I'm developing in both Windows (Visual Studio 10) and QNX Neutrino (using GCC 4.4.2).

The MS Compiler in the Windows Version has problems with stack/function argument alignment because most are call-by-value. It causes compiler error C2719. Since this issue is fairly common and requires things like modifying Microsofts STL vector header, I can understand why you don't support it in this emulation header.

I've successfully compiled this in QNX Momentics with GCC, however. It hasn't crashed at runtime, but I need to verify the data to make sure everything works as expected, specially with the branching / blend functions.

Thanks again!

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.