From ARM NEON to Intel SSE- the automatic porting solution, tips and tricks

I love ARM. Yes, I do. - Why do I work for Intel then? Because I love Intel even more. That's why I'd like to help independent software vendors to port their products from ARM to Intel Architecture. But if and only if they would like to do it.

Why would they like it? The answer is simple. Currently, Intel CPUs are in smartphones and tablets while Android, Windows 8, and some other operating systems support both ARM and x86, increasing the  developers options enormously.  In most cases porting to x86  is very easy – the work varies from zero  (for managed code and generic native code running via Intel Houdini binary translator) to simple code rebuild with the corresponding compiler. But for some applications it is not true.

Modern ARM CPU widely used in mobile devices ( iPhone, iPad, Microsoft Surface, Samsung devices and millions of others) have the 64-128bit SIMD instruction set (aka NEON or "MPE" Media Processing Engine)  defined  first as a part of the ARM® Architecture, Version 7 (ARMv7).   NEON  is used by  numerous developers for performance critical tasks via assembler or  NEON intrinsics set supported by modern  compilers like gcc, rvct and Microsoft. NEON could be found in such famous open source projects as FFMPEG, VP8, OpenCV, etc.   For such projects  achieving  the maximal performance on x86 causes the need to port ARM NEON instructions  or intrinsics to Intel SIMD (SSE).  Namely to Intel  SSSE3 for the first generation of Intel Atom CPU based devices and to Intel SSE4 for the second and later generations available since 2013.

However x86 SIMD and NEON instructions sets and therefore intrinsic functions are different, there is no one-to-one correspondence between them, so the porting task is not trivial. Or,  to be more precise  it was not trivial before this post publication.

Automatic porting solution for intrinsic functions based ARM NEON source code port to Intel x86 SIMD.

Attached is the automatic solution for intrinsic functions based ARM NEON  source code  port to  Intel x86 SIMD (SSE up to 4.2).  Why intrinsics?  - X86 SIMD intrinsic functions are supported by all widespread C/C++ compilers – Intel Compiler, Microsoft Compiler, gcc etc.  These intrinsics are very mature and their performance is equal tor even greater than pure assembler  performance, while the usability is much better.

Why Intel SSE only  but not MMX? - While Intel MMX (64 bit data processing instructions)  instructions set usage is possible for 64 bit NEON instructions substitution, it is not recommended: MMX performance is  commonly the same or lower than for the SSE instructions but the specific MMX problem of floating point registers sharing with the serial code could cause a lot of problems in SW if not properly treated. Moreover, MMX is NOT supported on 64-bit systems that are coming to mobile devices.

By default SSE up to SSSE3 is used for porting but for gcc if SSE4 flag was ompilation with if uncomment the correspoding "#define USE_SSE4" line then the SSE up to SSE4 are used for porting.

Though the solution is targeted for intrinsics, it could be used for pure assembler porting assistance. Namely, for each NEON function the corresponding NEON asm instruction is provided, while the corresponding  x86 intrinsics  code could be used directly and the  asm code could be copied from the complier intermediate output.

The solution is shaped as C/C++ language header  to be included in the sources ported instead of the standard "arm_neon.h" and provide the fully automatic porting.

Solution covers 100% NEON functions (~1700 ones)  except for 16-bit floats processing and:

  1. Redefines ARM NEON 64 and 128 bit vectors  as the corresponding x86 SIMD data.
  2. Redefines  some  functions from ARM NEON to Intel SSE if 1:1 correspondence exists (~50% of  128 bit functions)   
  3.  Implements some ARM NEON functions  using Intel SIMD if the performance effective implementation is possible (~45% of functions)
  4. Implements the remaining NEON functions (for <5% of functions not widely used in applications) using the  serial solution and issuing the corresponding "low performance" compiler warning  

 

//*******  definition sample *****************

int8x16_t vaddq_s8(int8x16_t a, int8x16_t b); // VADD.I8 q0,q0,q0

#define vaddq_s8 _mm_add_epi8

//***** porting sample***********

uint8x16_t vshrq_n_u8(uint8x16_t a, __constrange(1,8) int b) // VSHR.U8 q0,q0,#8

{//no 8 bit shift available, need the special trick

      __declspec(align(16))unsigned short mask10_16[9] ={0xffff, 0xff7f, 0xff3f, 0xff1f, 0xff0f,  0xff07, 0xff03, 0xff01, 0xff00};

      __m128i mask0 = _mm_set1_epi16(mask10_16[b]);  //to mask the bits to be "spoiled"  by 16 bit shift

        __m128i r = _mm_srli_epi16 ( a, b);

      return _mm_and_si128 (r,  mask0);

}

//***** “serial solution” sample *****

uint64x1_t vqadd_u64(uint64x1_t a, uint64x1_t b); // VQADD.U64 d0,d0,d0

_NEON2SSE_INLINE _NEON2SSE_PERFORMANCE_WARNING(uint64x1_t vqadd_u64(uint64x1_t a, uint64x1_t b), _NEON2SSE_REASON_SLOW_SERIAL){

_NEON2SSE_ALIGN_16 uint64_t a64, b64;

uint64x1_t res;

a64 = a.m64_u64[0];

b64 = b.m64_u64[0];

res.m64_u64[0] = a64 + b64;

if (res.m64_u64[0] < a64) {

res.m64_u64[0] = ~(uint64_t)0;

}

return res;}

The solution for the functions implemented passes extensive correctness tests for the ARM Neon intrinsic functions.

Main ARM NEON - x86 SIMD porting challenges:

  • 64-bits processing functions. As  128 bit SSE registers only   are used for x86 vector operations. It means that for each 64-bit processing function we need to load data to SSE (xmm registers) somehow and then store it back. It impacts not code quality only but the performance as well. Various load-store techniques are preferred for different compilers, for some functions the serial processing is faster.

  • Some x86 intrinsic functions require  immediate parameters rather than constants resulting in compile time “catastrophic error” when called from a wrapper function.  Fortunately  it  happens not for all compilers and in non-optimized (Debug) build only . The solution is to replace such functions with a corresponding switch for immediate parameters using branches (cases)  in debug mode.

  • Not all arithmetic operations are available for 8-bit data in x86 SIMD. Also there is no shift for such data. The common solution used is to convert 8-bit data to 16-bit, process them and then pack to 8 bit again. However in some cases it is possible to use tricks like the one shown in vector right shift sample above  (vshrq_n_u8 function) to avoid such  conversions.

  • For some functions where x86 implementation contains  more than 1 instruction, the intermediate overflow  is possible. The solution is to use the overflow safe algorithm implementation even if it is slower. Say, if we need to calculate the average  of a and b  i.e.  (a+b)/2, the calculation should be done as (a/2 + b/2).

  • For some NEON functions there exist corresponding x86 SIMD functions, however their behavior differs when the functions parameters are “out of range”. Such cases need the special processing like the following Table lookup sample. While in NEON  specification indices out of range return 0, for Intel SIMD we need to set the most significant bit to 1 for zero return:

    
    		uint8x8_t vtbl1_u8(uint8x8_t a, uint8x8_t b)
    
    		{
    
    		 uint8x8_t res64;
    
    		__m128i c7, maskgt, bmask, b128;
    
    		c7 = _mm_set1_epi8 (7);
    
    		b128 = _pM128i(b);
    
    		maskgt = _mm_cmpgt_epi8(b128,c7);
    
    		bmask = _mm_or_si128(b128,maskgt);
    
    		bmask = _mm_shuffle_epi8(_pM128i(a),bmask);
    
    		return64(bmask);}
    
    		

  • For some NEON functions there exist corresponding x86 SIMD functions, however their rounding rules are different, so we need to compensate it by adding or subtracting 1 from the final result.

  • For some functions x86 SIMD implementation is not possible or not effective.  Such function samples are: shift of vector by another vector,  some arithmetic operations for 64 and 32 bit data. The only solution here is the serial code implementation.

     Performance

    First it is necessary to notice that the exact porting solution selection for each function is based on  common sense and x86 SIMD  latency and throughput data for latest Intel Atom CPU. However for some CPUs and conditions better solution might be possible.

    Solution performance was tested on several projects demonstrating very similar results that lead to the first and very important conclusion:

  •  For most cases of x86 porting expect the perfomance increase ratio similar to the ARM NEON for vectorized /serial code ration if  128 bit processing  NEON functions  used.

    Unfortunately the situation is different for 64 bit processing NEON functions (even for those taking 64 input and returning 128bits or vice-versa). For them the speedup is significantly lower.

    So the second very important conclustion is:

  • Avoid 64-bit processing NEON functions try to use 128-bit versions even if your data are 64-bit. If you use 64-bit NEON functions - expect the corresponding performance penalty.

     Other porting considerations and best known methods are:

  • Use 16-bit data alignment for faster load and store

  •  Avoid NEON functions working with constants. It gives not gain but performance penalty for constants load\propagation instead.  If constants usage is necessary try to move constants initialization out of hotspots loops  and if applicable replace it with logical and compare operations.

  • Try to avoid functions marked as "serialy implemented" because they need to store data from vector registers to memory, process them serialy and load them again. Probably you could change the data type or algorithm used to make the whole port vectorized not a serial one.

    Once again - just include the file below in your project instead of "arm_neon.h" header and your code will be ported without any other changes necessary!

    Upd. NEONvsSSE.h file has been updated  on June 16, 2015  for  a minor bugfix and better compilers compartibility.

Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.