From ARM NEON to Intel MMX&SSE- the automatic porting solution, tips and tricks

I love ARM. Yes, I do. - Why do I work for Intel then? Because I love Intel even more. That's why I'd like to help independent software vendors to port their products from ARM to Intel Architecture. But if and only if they would like to do it.

Why would they like it? The answer is simple. Currently, Intel CPUs are in smartphones and tablets while Android, Windows 8, Meego and some other operating systems support both ARM and x86, increasing the  developers options enormously.  In most cases porting to x86  is very easy – the work varies from zero  (for managed code and generic native code running via Intel Houdini binary translator) to simple code rebuild with the corresponding compiler. But for some applications it is not true.

Modern ARM CPU widely used in mobile devices ( iPhone, iPad, Microsoft Surface, Samsung devices and millions of others) have the 64-128bit SIMD instruction set (aka NEON or "MPE" Media Processing Engine)  defined as part of the ARM® Architecture, Version 7 (ARMv7).   NEON  is used by  numerous developers for performance critical tasks via assembler or NEON intrinsics set supported by modern  compilers like gcc, rvct and Microsoft. NEON could be found in such famous open source projects as FFMPEG, VP8, OpenCV, etc.   For such projects  achieving  the maximal performance on x86 causes the need to port ARM NEON instructions  or intrinsics to Intel SIMD (MMX and SSE).  

However x86 SIMD and NEON instructions sets and therefore intrinsic functions are different, there is no one-to-one correspondence between them, so the porting task is not trivial. Or,  to be more precise  it was not trivial before this post publication.

Automatic porting solution for intrinsic functions based ARM NEON source code port to Intel x86 SIMD.

Attached is the automatic solution for intrinsic functions based ARM NEON  source code  port to  Intel x86 SIMD (MMX and SSE up to 4.1).  X86 SIMD intrinsic functions are supported by all widespread C/C++ compilers – Intel Compiler, Microsoft Compiler, gcc etc.  These intrinsics are very mature and their performance is equal to pure assembler  performance, while the usability is much better.

Though the solution is targeted for intrinsics, it could be used for pure assembler porting assistance. Namely, for each NEON function the corresponding NEON asm instruction is provided, while the corresponding  x86 asm code could be copied from the complier intermediate output.

The solution is shaped as C/C++ language header  to be included in the sources ported instead of the standard "arm_neon.h" and provide the fully automatic porting.

Solution covers 100% NEON functions (~1700 ones) and:

  1. Redefines ARM NEON 64 and 128 bit vectors  as the corresponding x86 SIMD data.
  2. Redefines  some  functions from ARM NEON to Intel MMX and\or SSE if 1:1 correspondence exists (~50% of functions).           For 64-bit data processing functions MMX or SSE usage is possible
  3.  Implements some ARM NEON functions  using Intel SIMD if the performance effective implementation is possible (~45% of functions)
  4. States that no effective Intel SIMD solution is available, the serial solution is required (for <5% of functions not widely used in applications)

 
//*******  definition sample ***************** 
int8x8_t    vadd_s8(int8x8_t a, int8x8_t b);         // VADD.I8 d0,d0,d0 
#ifdef USE_MMX
              #define vadd_s8 _mm_add_pi8   //MMX
#else
              #define vadd_s8 _mm_add_epi8
#endif 
 
//***** porting sample***********
uint8x16_t vshrq_n_u8(uint8x16_t a, __constrange(1,8) int b) // VSHR.U8 q0,q0,#8
{//no 8 bit shift available, need the special trick
      __declspec(align(16))unsigned short mask10_16[9] ={0xffff, 0xff7f, 0xff3f, 0xff1f, 0xff0f,  0xff07, 0xff03, 0xff01, 0xff00};
      __m128i mask0 = _mm_set1_epi16(mask10_16[b]);  //to mask the bits to be "spoiled"  by 16 bit shift
        __m128i r = _mm_srli_epi16 ( a, b);
      return _mm_and_si128 (r,  mask0);
}
 
//***** “not available” sample *****
int64x2_t vqaddq_s64(int64x2_t a, int64x2_t b); // VQADD.S64 q0,q0,q0 
 // no optimal SIMD solution available, use a serial solution instead

The solution for the functions implemented passes extensive correctness tests for the ARM Neon intrinsic functions.

Main ARM NEON - x86 SIMD porting challenges:

  • Some x86 intrinsic functions require  immediate parameters rather than constants resulting in compile time “catastrophic error” when called from a wrapper function.  Fortunately  it  happens in non-optimized (Debug) build only . The solution is to replace such functions with a corresponding switch for immediate parameters using branches (cases)  in debug mode. Great thanks to my colleague Thomas Wilhalm for his input.
  • Not all arithmetic operations are available for 8-bit data in x86 SIMD. Also there is no shift for such data. The common solution used is to convert 8-bit data to 16-bit, process them and then pack to 8 bit again. However in some cases it is possible to use tricks like the one shown in vector right shift sample above  (vshrq_n_u8 function) to avoid such  conversions.
  • For some functions where x86 implementation contains  more than 1 instruction, the intermediate overflow  is possible. The solution is to use the overflow safe algorithm implementation even if it is slower. Say, if we need to calculate the average  of a and b  i.e.  (a+b)/2, the calculation should be done as (a/2 + b/2).
  • For some NEON functions there exist corresponding x86 SIMD functions, however their behavior differs when the functions parameters are “out of range”. Such cases need the special processing like the following Table lookup sample. While in NEON  specification indexes out of range return 0, for Intel SIMD we need to set the most significant bit to 1 for zero return:

 
uint8x8_t vtbl1_u8(uint8x8_t a, uint8x8_t b)
{
      __m64 c7 = _mm_set1_pi8 (7);
      __m64 maskgt = _mm_cmpgt_pi8(b,c7);
      __m64 bmask = _mm_or_si64(b,maskgt);
       return _mm_shuffle_pi8(a,bmask);
}

  • For some NEON functions there exist corresponding x86 SIMD functions, however their rounding rules are different, so we need to compensate it by adding or subtracting 1 from the final result.
  • For some functions x86 SIMD implementation is not possible or not effective.  Such function samples are: shift of vector by another vector,  some arithmetic operations for 64 and 32 bit data. The only solution here is the serial code implementation.

 Performance

First it is necessary to notice that the exact porting solution selection for each function is based on  common sense and x86 SIMD  latency and throughput data for latest Intel CPU. However for some CPUs and conditions better solution might be possible.

Solution performance was tested on several projects demonstrating very similar results. For example let’s take “Hello NEON” Android NDK standard sample. It implements FIR filter along with the integrated performance benchmark for NEON vs serial execution performance comparison.

The test was compiled as is for ARM (gcc 4.7) and with the header proposed for x86 (Intel compiler) and executed on the corresponding CPUs. Here our interest is not the absolute performance numbers  but  the  SIMD performance improvement ratio only.  That’s why we don’t need to use the same or even close environment for ARM and x86 testing.  And the testing provides the following results:

 ARM Cortex A9 (Samsung Galaxy Note)  - the 4.2 times  speedup for NEON version vs serial C version.

Intel i5 (Intel Ultrabook) – the 4.0- 4.5 times  speedup for NEON version vs serial C version (depending on the particular CPU model)  if SSE was used and 2.6-3.0 times speedup if MMX was used.

 That leads to the first porting tip:

  •  For most cases (especially if  you have to mix  64-bit and 128-bit wide data) use SSE functions only, not MMX (undef  USE_MMX in the solution proposed)

Other porting considerations and best known methods are:

  •  Avoid NEON functions working with constants. It gives not gain but performance penalty for constants load\propagation instead.  If constants usage is necessary try to move constants initialization out of hotspots loops  and if applicable replace it with logical and compare operations.
  • Use 16-bit data alignment for faster load and store

PS. NEONvsSSE.h file was updated on 27.02.2013 

附件尺寸
下载 neonvssse.h998.17 KB
如需更全面地了解编译器优化,请参阅优化注意事项.