From ARM NEON to Intel SSE- the automatic porting solution, tips and tricks

I love ARM. Yes, I do. - Why do I work for Intel then? Because I love Intel even more. That's why I'd like to help independent software vendors to port their products from ARM to Intel Architecture. But if and only if they would like to do it.

Why would they like it? The answer is simple. Currently, Intel CPUs are in smartphones and tablets while Android, Windows 8, and some other operating systems support both ARM and x86, increasing the  developers options enormously.  In most cases porting to x86  is very easy – the work varies from zero  (for managed code and generic native code running via Intel Houdini binary translator) to simple code rebuild with the corresponding compiler. But for some applications it is not true.

Modern ARM CPU widely used in mobile devices ( iPhone, iPad, Microsoft Surface, Samsung devices and millions of others) have the 64-128bit SIMD instruction set (aka NEON or "MPE" Media Processing Engine)  defined  first as a part of the ARM® Architecture, Version 7 (ARMv7).   NEON  is used by  numerous developers for performance critical tasks via assembler or  NEON intrinsics set supported by modern  compilers like gcc, rvct and Microsoft. NEON could be found in such famous open source projects as FFMPEG, VP8, OpenCV, etc.   For such projects  achieving  the maximal performance on x86 causes the need to port ARM NEON instructions  or intrinsics to Intel SIMD (SSE).  Namely to Intel  SSSE3 for the first generation of Intel Atom CPU based devices and to Intel SSE4 for the second and later generations available since 2013.

However x86 SIMD and NEON instructions sets and therefore intrinsic functions are different, there is no one-to-one correspondence between them, so the porting task is not trivial. Or,  to be more precise  it was not trivial before this post publication.

Automatic porting solution for intrinsic functions based ARM NEON source code port to Intel x86 SIMD.

Attached is the automatic solution for intrinsic functions based ARM NEON  source code  port to  Intel x86 SIMD (SSE up to 4.2).  Why intrinsics?  - x86 SIMD intrinsic functions are supported by all widespread C/C++ compilers – Intel Compiler, Microsoft Compiler, gcc etc.  These intrinsics are very mature and their performance is equal tor even greater than pure assembler  performance, while the usability is much better.

Why Intel SSE only  but not MMX? - While Intel MMX (64 bit data processing instructions)  instructions set usage is possible for 64 bit NEON instructions substitution, it is not recommended: MMX performance is  commonly the same or lower than for the SSE instructions but the specific MMX problem of floating point registers sharing with the serial code could cause a lot of problems in SW if not properly treated. Moreover, MMX is NOT supported on 64-bit systems that are coming to mobile devices.

By default SSE up to SSSE3 is used for porting  unless you've explicitly set the  SSE4 code generation flag during the compilation or uncomment the correspoding "#define USE_SSE4" line  in the file provided. Then the SSE up to SSE4 are used for porting.

Though the solution is targeted for intrinsics, it could be used for pure assembler porting assistance. Namely, for each NEON function the corresponding NEON asm instruction is provided, while the corresponding  x86 intrinsics  code could be used directly and the  asm code could be copied from the complier intermediate output.

The solution is shaped as C/C++ language header  to be included in the sources ported instead of the standard "arm_neon.h" and provide the fully automatic porting.

Solution covers 100% NEON functions (~1700 ones)  except for 16-bit floats processing and:

  1. Redefines ARM NEON 64 and 128 bit vectors  as the corresponding x86 SIMD data.
  2. Redefines  some  functions from ARM NEON to Intel SSE if 1:1 correspondence exists (~50% of  128 bit functions)   
  3.  Implements some ARM NEON functions  using Intel SIMD if the performance effective implementation is possible (~45% of functions)
  4. Implements the remaining NEON functions (for <5% of functions not widely used in applications) using the  serial solution and issuing the corresponding "low performance" compiler warning  

 

//*******  definition sample *****************

int8x16_t vaddq_s8(int8x16_t a, int8x16_t b); // VADD.I8 q0,q0,q0

#define vaddq_s8 _mm_add_epi8

//***** porting sample***********

uint8x16_t vshrq_n_u8(uint8x16_t a, __constrange(1,8) int b) // VSHR.U8 q0,q0,#8

{//no 8 bit shift available, need the special trick

      __declspec(align(16))unsigned short mask10_16[9] ={0xffff, 0xff7f, 0xff3f, 0xff1f, 0xff0f,  0xff07, 0xff03, 0xff01, 0xff00};

      __m128i mask0 = _mm_set1_epi16(mask10_16[b]);  //to mask the bits to be "spoiled"  by 16 bit shift

        __m128i r = _mm_srli_epi16 ( a, b);

      return _mm_and_si128 (r,  mask0);

}

//***** “serial solution” sample *****

uint64x1_t vqadd_u64(uint64x1_t a, uint64x1_t b); // VQADD.U64 d0,d0,d0

_NEON2SSE_INLINE _NEON2SSE_PERFORMANCE_WARNING(uint64x1_t vqadd_u64(uint64x1_t a, uint64x1_t b), _NEON2SSE_REASON_SLOW_SERIAL){

_NEON2SSE_ALIGN_16 uint64_t a64, b64;

uint64x1_t res;

a64 = a.m64_u64[0];

b64 = b.m64_u64[0];

res.m64_u64[0] = a64 + b64;

if (res.m64_u64[0] < a64) {

res.m64_u64[0] = ~(uint64_t)0;

}

return res;}

The solution for the functions implemented passes extensive correctness tests for the ARM Neon intrinsic functions.

Main ARM NEON - x86 SIMD porting challenges:

  • 64-bits processing functions. As  128 bit SSE registers only   are used for x86 vector operations. It means that for each 64-bit processing function we need to load data to SSE (xmm registers) somehow and then store it back. It impacts not code quality only but the performance as well. Various load-store techniques are preferred for different compilers, for some functions the serial processing is faster.

  • Some x86 intrinsic functions require  immediate parameters rather than constants resulting in compile time “catastrophic error” when called from a wrapper function.  Fortunately  it  happens not for all compilers and in non-optimized (Debug) build only . The solution is to replace such functions with a corresponding switch for immediate parameters using branches (cases)  in debug mode.

  • Not all arithmetic operations are available for 8-bit data in x86 SIMD. Also there is no shift for such data. The common solution used is to convert 8-bit data to 16-bit, process them and then pack to 8 bit again. However in some cases it is possible to use tricks like the one shown in vector right shift sample above  (vshrq_n_u8 function) to avoid such  conversions.

  • For some functions where x86 implementation contains  more than 1 instruction, the intermediate overflow  is possible. The solution is to use the overflow safe algorithm implementation even if it is slower. Say, if we need to calculate the average  of a and b  i.e.  (a+b)/2, the calculation should be done as (a/2 + b/2).

  • For some NEON functions there exist corresponding x86 SIMD functions, however their behavior differs when the functions parameters are “out of range”. Such cases need the special processing like the following Table lookup sample. While in NEON  specification indices out of range return 0, for Intel SIMD we need to set the most significant bit to 1 for zero return:

    
    		uint8x8_t vtbl1_u8(uint8x8_t a, uint8x8_t b)
    
    		{
    
    		 uint8x8_t res64;
    
    		__m128i c7, maskgt, bmask, b128;
    
    		c7 = _mm_set1_epi8 (7);
    
    		b128 = _pM128i(b);
    
    		maskgt = _mm_cmpgt_epi8(b128,c7);
    
    		bmask = _mm_or_si128(b128,maskgt);
    
    		bmask = _mm_shuffle_epi8(_pM128i(a),bmask);
    
    		return64(bmask);}
    
    		

  • For some NEON functions there exist corresponding x86 SIMD functions, however their rounding rules are different, so we need to compensate it by adding or subtracting 1 from the final result.

  • For some functions x86 SIMD implementation is not possible or not effective.  Such function samples are: shift of vector by another vector,  some arithmetic operations for 64 and 32 bit data. The only solution here is the serial code implementation.

     Performance

    First it is necessary to notice that the exact porting solution selection for each function is based on  common sense and x86 SIMD  latency and throughput data for latest Intel Atom CPU. However for some CPUs and conditions better solution might be possible.

    Solution performance was tested on several projects demonstrating very similar results that lead to the first and very important conclusion:

  •  For most cases of x86 porting expect the perfomance increase ratio similar to the ARM NEON for vectorized /serial code ratio if  128 bit processing  NEON functions  used.

    Unfortunately the situation is different for 64 bit processing NEON functions (even for those taking 64 input and returning 128bits or vice-versa). For them the speedup is significantly lower.

    So the second very important conclustion is:

  • Avoid 64-bit processing NEON functions try to use 128-bit versions even if your data are 64-bit. If you use 64-bit NEON functions - expect the corresponding performance penalty.

     Other porting considerations and best known methods are:

  • Use 16-bit data alignment for faster load and store

  •  Avoid NEON functions working with constants. It gives not gain but performance penalty for constants load\propagation instead.  If constants usage is necessary try to move constants initialization out of hotspots loops  and if applicable replace it with logical and compare operations.

  • Try to avoid functions marked as "serialy implemented" because they need to store data from vector registers to memory, process them serialy and load them again. Probably you could change the data type or algorithm used to make the whole port vectorized not a serial one.

Once again - just include the file below in your project instead of "arm_neon.h" header and your code will be ported without any other changes required!

Upd. NEONvsSSE.h file has been updated  on May 23, 2017  for the better LLVM compiler compartibility and some minor bug fixes

Since December 5, 2016 the project is available on GitHub: https://github.com/intel/ARM_NEON_2_x86_SSE, but the current latest version to be placed here as well.

AttachmentSize
File NEON_2_SSE.h740.22 KB
For more complete information about compiler optimizations, see our Optimization Notice.

29 comments

Top
Zihan H.'s picture

about function vrshrn_n_s32

_NEON2SSE_INLINE int16x4_t vrshrn_n_s32(int32x4_t a, __constrange(1,16) int b) // VRSHRN.I32 d0,q0,#16

{
    int16x4_t res64;
    __m128i r32;
    r32  = vrshrq_n_s32(a,b);
    r32  =  _mm_shuffle_epi8 (r32, *(__m128i*) mask8_16_even_odd); //narrow, use low 64 bits only. Impossible to use _mm_packs because of negative saturation problems
    return64(r32);
}

I think mask8_32_even_odd is correct, not mask8_16_even_odd.

 

Ilya K.'s picture

> And if possible could you please tell me in very generic terms - what kind of project do you use this header at?
Image processing for smartphones camera. Many of code already ported on SSE4 intrinsics. So I'm just tested this solution on real code, not simple artificial tests.

victoria-zhislina (Intel)'s picture

Hi, Ilya K. I confirm this bug (or a gcc compartibility issue because it works fine with Intel comiler I use) and even more bugs of this kind found by me. I'm preparing the new version having all of them fixed. Meanwhile if more  bugs found please feel free to contact me directly by mail = victoria.zhislina at intel.com. And if possible could you please tell me in very generic terms - what kind of project do you use this header at?

Ilya K.'s picture

Found more "lvalue required as unary ‘&’ operand" errors:

    {
        uint8x16_t q4 = vdupq_n_u8(0);
        uint16x8_t q0 = vmovl_u8(vget_low_u8(q4));
    }

    {
        int16x4_t c1 = vcreate_s16(0x0003000200010000L);
    }

victoria-zhislina (Intel)'s picture

Ilya K, megathanks for your bugs report - the bugs mentioned to be fixed ASAP (it is strange they haven't been found earlier)

And let me say that you are not right about SSE4.2 - I do use the function from this set: _mm_cmpgt_epi64.

And yes, the reciprocal difference is the feature currently - to be changed if any serious complains appear.

Ilya K.'s picture

Found two errors (tested with GCC 4.7 and 4.9).

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <stdint.h>
#include <math.h>

#include "NEONvsSSE.h"

int main(int argc, char **argv) {
    uint8_t buf[32] = { 0 };

    // bug 1 (";" in macro LOAD_SI128)
    {
        uint8x16_t v0 = vaddq_u8(vld1q_u8(buf), vld1q_u8(buf+16));
    }

    // bug 2 (vcombine_XX as macro, not inline function)
    {
        uint8x8_t d0 = vdup_n_u8(0);
        vcombine_u8(vadd_u8(d0, d0), vadd_u8(d0, d0));
    }

    return 0;
}

GCC output:

main.c: In function ‘main’:
main.c:14:28: error: expected ‘)’ before ‘;’ token
main.c:14:28: error: too few arguments to function ‘_mm_add_epi8’
main.c:20:3: error: lvalue required as unary ‘&’ operand
main.c:20:3: error: lvalue required as unary ‘&’ operand

Clang 3.8 also gives errors on this code.

Fixes:

(1)
#define LOAD_SI128(ptr) \
        ( ( ((unsigned long)(ptr) & 15) == 0 ) ? _mm_load_si128((__m128i*)(ptr)) : _mm_loadu_si128((__m128i*)(ptr)) )

(2)
#define M1 { return _mm_unpacklo_epi64 (_pM128i(low), _pM128i(high)); }
static inline int8x16_t vcombine_s8(int8x8_t low, int8x8_t high) M1
static inline int16x8_t vcombine_s16(int16x4_t low, int16x4_t high) M1
static inline int32x4_t vcombine_s32(int32x2_t low, int32x2_t high) M1
static inline int64x2_t vcombine_s64(int64x1_t low, int64x1_t high) M1
#undef M1

 

Also two flaws:

(1) SSE4.1 is enough here, SSE4.2 intrinsics not used.

#if defined(__SSE4_2__)
    #define USE_SSE4
#endif
...
#ifdef USE_SSE4
#include <smmintrin.h> //SSE4.1
#include <nmmintrin.h> //SSE4.2
#endif

(2) vrecpeq_f32 not match output from ARM processor, gives more precision. So results from emulation and real run on ARM not exactly match.

Junaid S.'s picture

@ victoria,

The simple fix to this bug I found is to comment line 173 of your code. clang/llvm is strict in implementing some C standards which GCC does not care off. So, I think the error was caused by strictness. Plus I think replacing __ with _ can also do the trick. 

victoria-zhislina (Intel)'s picture

Junaid S, thanks for your comment.  the library hasn't ben tested with clang/llvm, so if you could test it and do some fixes there it would be great.

Junaid S.'s picture

Hi victoria,

Have your tried this library with clang/llvm?

I am trying but I think clang is a strick compiler and giving error on line 173,

error: cannot combine with previous 'float' declaration specifier

typedef    float __fp16;

Pages

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.