From ARM NEON to Intel SSE- the automatic porting solution, tips and tricks

I love ARM. Yes, I do. - Why do I work for Intel then? Because I love Intel even more. That's why I'd like to help independent software vendors to port their products from ARM to Intel Architecture. But if and only if they would like to do it.

Why would they like it? The answer is simple. Currently, Intel CPUs are in smartphones and tablets while Android, Windows 8, and some other operating systems support both ARM and x86, increasing the  developers options enormously.  In most cases porting to x86  is very easy – the work varies from zero  (for managed code and generic native code running via Intel Houdini binary translator) to simple code rebuild with the corresponding compiler. But for some applications it is not true.

Modern ARM CPU widely used in mobile devices ( iPhone, iPad, Microsoft Surface, Samsung devices and millions of others) have the 64-128bit SIMD instruction set (aka NEON or "MPE" Media Processing Engine)  defined as part of the ARM® Architecture, Version 7 (ARMv7).   NEON  is used by  numerous developers for performance critical tasks via assembler or NEON intrinsics set supported by modern  compilers like gcc, rvct and Microsoft. NEON could be found in such famous open source projects as FFMPEG, VP8, OpenCV, etc.   For such projects  achieving  the maximal performance on x86 causes the need to port ARM NEON instructions  or intrinsics to Intel SIMD (SSE).  Namely toIntel  SSSE3 for the first generation of Intel Atom CPU based devices and to Intel SSE4 for the second generation launched in 2013.

However x86 SIMD and NEON instructions sets and therefore intrinsic functions are different, there is no one-to-one correspondence between them, so the porting task is not trivial. Or,  to be more precise  it was not trivial before this post publication.

Automatic porting solution for intrinsic functions based ARM NEON source code port to Intel x86 SIMD.

Attached is the automatic solution for intrinsic functions based ARM NEON  source code  port to  Intel x86 SIMD (SSE up to 4.2).  Why intrinsics?  - X86 SIMD intrinsic functions are supported by all widespread C/C++ compilers – Intel Compiler, Microsoft Compiler, gcc etc.  These intrinsics are very mature and their performance is equal tor even greater than pure assembler  performance, while the usability is much better.

Why Intel SSE only not MMX? - While Intel MMX (64 bit data processing instructions)  instructions set usage is possible for 64 bit NEON instructions substitution, it is not recommended: MMX performance is  commonly the same or lower than for the SSE instructions but the specific MMX problem of floating point registers sharing with the serial code could cause a lot of problems in SW if not properly treated.

By default SSE up to SSSE3 is used for porting but if uncomment the correspoding "#define USE_SSE4" line then the SSE up to SSE4 are used for porting.

Though the solution is targeted for intrinsics, it could be used for pure assembler porting assistance. Namely, for each NEON function the corresponding NEON asm instruction is provided, while the corresponding  x86 intrinsics  code could be used directly and the  asm code could be copied from the complier intermediate output.

The solution is shaped as C/C++ language header  to be included in the sources ported instead of the standard "arm_neon.h" and provide the fully automatic porting.

Solution covers 100% NEON functions (~1700 ones) and:

  1. Redefines ARM NEON 64 and 128 bit vectors  as the corresponding x86 SIMD data.
  2. Redefines  some  functions from ARM NEON to Intel SSE if 1:1 correspondence exists (~50% of functions).     
  3.  Implements some ARM NEON functions  using Intel SIMD if the performance effective implementation is possible (~45% of functions)
  4. Implements the remaining NEON functions (for <5% of functions not widely used in applications) using the  serial solution and issuing the corresponding "low performance" compiler warning  

 

//*******  definition sample ***************** 

int8x8_t    vadd_s8(int8x8_t a, int8x8_t b);         // VADD.I8 d0,d0,d0 

#define vadd_s8 _mm_add_epi8

//***** porting sample***********

uint8x16_t vshrq_n_u8(uint8x16_t a, __constrange(1,8) int b) // VSHR.U8 q0,q0,#8

{//no 8 bit shift available, need the special trick

      __declspec(align(16))unsigned short mask10_16[9] ={0xffff, 0xff7f, 0xff3f, 0xff1f, 0xff0f,  0xff07, 0xff03, 0xff01, 0xff00};

      __m128i mask0 = _mm_set1_epi16(mask10_16[b]);  //to mask the bits to be "spoiled"  by 16 bit shift

        __m128i r = _mm_srli_epi16 ( a, b);

      return _mm_and_si128 (r,  mask0);

}

//***** “serial solution” sample *****

uint64x1_t vqaddq_u64(uint64x1_t a, uint64x1_t b); // VQADD.U64 d0,d0,d0

INLINE PERFORMANCE_WARNING(uint64x1_t vqadd_u64(uint64x1_t a, uint64x1_t b), REASON_SLOW_SERIAL)

      ALIGN_16 uint64_t res, a64, b64;

      a64 = ui64(a);

      b64 = ui64(b);

      res = a64 + b64;

      if (res < a64) {res = ~(uint64_t)0;}

      return M128i(res);}

The solution for the functions implemented passes extensive correctness tests for the ARM Neon intrinsic functions.

Main ARM NEON - x86 SIMD porting challenges:

  • Some x86 intrinsic functions require  immediate parameters rather than constants resulting in compile time “catastrophic error” when called from a wrapper function.  Fortunately  it  happens not for all compilers and in non-optimized (Debug) build only . The solution is to replace such functions with a corresponding switch for immediate parameters using branches (cases)  in debug mode.
  • Not all arithmetic operations are available for 8-bit data in x86 SIMD. Also there is no shift for such data. The common solution used is to convert 8-bit data to 16-bit, process them and then pack to 8 bit again. However in some cases it is possible to use tricks like the one shown in vector right shift sample above  (vshrq_n_u8 function) to avoid such  conversions.
  • For some functions where x86 implementation contains  more than 1 instruction, the intermediate overflow  is possible. The solution is to use the overflow safe algorithm implementation even if it is slower. Say, if we need to calculate the average  of a and b  i.e.  (a+b)/2, the calculation should be done as (a/2 + b/2).
  • For some NEON functions there exist corresponding x86 SIMD functions, however their behavior differs when the functions parameters are “out of range”. Such cases need the special processing like the following Table lookup sample. While in NEON  specification indexes out of range return 0, for Intel SIMD we need to set the most significant bit to 1 for zero return:


uint8x8_t vtbl1_u8(uint8x8_t a, uint8x8_t b)

{

      __m64 c7 = _mm_set1_pi8 (7);

      __m64 maskgt = _mm_cmpgt_pi8(b,c7);

      __m64 bmask = _mm_or_si64(b,maskgt);

       return _mm_shuffle_pi8(a,bmask);

}

  • For some NEON functions there exist corresponding x86 SIMD functions, however their rounding rules are different, so we need to compensate it by adding or subtracting 1 from the final result.
  • For some functions x86 SIMD implementation is not possible or not effective.  Such function samples are: shift of vector by another vector,  some arithmetic operations for 64 and 32 bit data. The only solution here is the serial code implementation.

 Performance

First it is necessary to notice that the exact porting solution selection for each function is based on  common sense and x86 SIMD  latency and throughput data for latest Intel CPU. However for some CPUs and conditions better solution might be possible.

Solution performance was tested on several projects demonstrating very similar results. For example let’s take “Hello NEON” Android NDK standard sample. It implements FIR filter along with the integrated performance benchmark for NEON vs serial execution performance comparison.

The test was compiled as is for ARM (gcc 4.7) and with the header proposed for x86 (Intel compiler) and executed on the corresponding CPUs. Here our interest is not the absolute performance numbers  but  the  SIMD performance improvement ratio only.  That’s why we don’t need to use the same or even close environment for ARM and x86 testing.  And the testing provides the following results:

 ARM Cortex A9 (Samsung Galaxy Note)  - the 4.2 times  speedup for NEON version vs serial C version.

Intel Atom  Z3770 (Intel development platforms) – the 4.0 times  speedup for NEON version vs serial C version  if the SSE was used.

 That leads to the first porting tip:

  •  For most cases x86 porting expect the similar to ARM NEON perfomance increase ratio for vectorized /serial code.

Other porting considerations and best known methods are:

  • Use 16-bit data alignment for faster load and store
  •  Avoid NEON functions working with constants. It gives not gain but performance penalty for constants load\propagation instead.  If constants usage is necessary try to move constants initialization out of hotspots loops  and if applicable replace it with logical and compare operations.
  • Try to avoid functions marked as "serialy implemented" because they need to store data from registers to memory, process them serialy and load them again. Probably you could change the data type or algorithm used to make the whole port vectorized not a serial one.

Once again - just include the file below in your project instead of "arm_neon.h" header and your code will be ported without any other changes necessary!

Upd. NEONvsSSE.h file was updated  ( NEONvsSSE_5.h) for various issues fixing including 100% gcc compartibility for cpp code on 0304.2014

AttachmentSize
Download NEONvsSSE_5.h712.26 KB
For more complete information about compiler optimizations, see our Optimization Notice.

Comments

Daniel D.'s picture

hello, thanks for the great effort you spent in writing a neon to sse bridge, which is very helpfull for me. nevertheless i found an issue with the implementation. vabd is not implemented correct, as far as i can tell, because it doesn't handle an intermediate overflow. at least for ARM9 the results show differences. best regards, slowdive_de

victoria-zhislina (Intel)'s picture

Daniel (slowdive_de), thanks a lot for letting me know about this bug - I'll fix it and place the updated version ASAP. Meanwhile if further  bugs found or you have some feature requests etc please feel free to contact me directly by mail = victoria.zhislina at intel.com

kind regards

Daniel D.'s picture

Hello, as I noticed there was allready an update since I ran the test. With the current version (from 18.2.2014) the issue seems to be allready fixed! Thanks for your effort and best regards, Daniel.

kasun k.'s picture

Thank you for developing this library. It will come in handy to port apps across platforms. I am trying to compile this simple example and getting a huge number of warnings and errors. 

I am using gcc (GCC) 4.7.2, on Suse linux 11 using Quad  Xeon Intel(R) Xeon(R) CPU  W3550  @ 3.07GHz

 

Command> gcc example.c

Tried following options -march=native and -msse3  with no luck. 

Do I need to add any more compile options.I would greatly appreciate any help in this regard.

 

 

#include <stdio.h>

#include <stdint.h>

#include "NEON2SSE_0.h"

int main(void)
{
    uint8x8_t v1 = {1, 1, 1, 1, 1, 1, 1, 1};
    uint8x8_t v2 = {2, 2, 2, 2, 2, 2, 2, 2};
    uint8x8x2_t vd1, vd2;
    union {uint8x8_t v; uint8_t buf[8];} d1, d2, d3, d4;
    int i;

    vd1 = vzip_u8(v1, vdup_n_u8(0));
    vd2 = vzip_u8(v2, vdup_n_u8(0));

    vst1_u8(d1.buf, vd1.val[0]);
    vst1_u8(d2.buf, vd1.val[1]);
    vst1_u8(d3.buf, vd2.val[0]);
    vst1_u8(d4.buf, vd2.val[1]);

    printf("  d1  d2  d3  d4\n");
    for (i = 0; i < 8; i++) {
        printf("%4d%4d%4d%4d\n",
        d1.buf[i],
        d2.buf[i],
        d3.buf[i],
        d4.buf[i]);
    }

    return 0;
}

Sample Errors

NEON2SSE_0.h:13872:10: error: incompatible types when assigning to type __m128i from type int
NEON2SSE_0.h:13873:10: error: incompatible types when assigning to type __m128i from type int
example.c: In function main:
example.c:16:24: error: request for member val in something not a structure or union
example.c:17:24: error: request for member val in something not a structure or union
example.c:18:24: error: request for member val in something not a structure or union
example.c:19:24: error: request for member val in something not a structure or union

 

victoria-zhislina (Intel)'s picture

Hello, kasun k. Thanks a lot for your input! It helped me to find the problem in my implementation - it was not possible to address .val explicitly in 64 bit data (it worked in 128bit only). I plan to  fix it till the end of this week - please subscribe to updates and take the latest version.

But there is one more problem here, unsolvable but I hope not too severe. Namely it is not possible to use unions with any x86 vector type members. It is not my solution restriction but x86 intrinsics implementation in compilers. Therefore your code should look like this:

int main(void)
{
    uint8x8_t v1 = {1, 1, 1, 1, 1, 1, 1, 1};
    uint8x8_t v2 = {2, 2, 2, 2, 2, 2, 2, 2};
    uint8x8x2_t vd1, vd2;
    uint8_t  d1[8], d2[8], d3[8], d4[8];
    int i;
 
    vd1 = vzip_u8(v1, vdup_n_u8(0));
    vd2 = vzip_u8(v2, vdup_n_u8(0));
 
    vst1_u8(d1, vd1.val[0]);
    vst1_u8(d2, vd1.val[1]);
    vst1_u8(d3, vd2.val[0]);
    vst1_u8(d4 , vd2.val[1]);
 
    printf("  d1  d2  d3  d4\n");
    for (i = 0; i < 8; i++) {
        printf("%4d%4d%4d%4d\n",
        d1[i],
        d2[i],
        d3[i],
        d4[i]);
    }
    return 0;
}

This one above compiles fine and gives the same result as on ARM CPU with my new version - to be posted here very soon as stated above.

And  about warnings - it is the designed behavior - to warn about the functions that are implemented serialy.

Thanks again for your findings!

kasun k.'s picture

Fantastic. Works like a charm! I can work around not having unions and val.

Really appreciate all the hard work done to accomplish this herculean task.