# Intel® AVX Realization Of IIR Filter For Complex Float Data

By Igor Astakhov (Intel), published on August 2, 2012

This is a computer translation of the original content. It is provided for general information only and should not be relied upon as complete or accurate.

### Download PDF

Download Intel® AVX Realization Of IIR Filter For Complex Float Data [PDF 128KB]

### Introduction

This paper describes complex Infinite Impulse Response (IIR) filter implementation with Intel® AVX Single Instruction Multiple Data (SIMD) instruction set. The main difficulty of such realization is the feedback dependency that is the nature of IIR filter, that each next output point is dependent on the each previous output point. This paper describes the general approach for IIR vectorization and its specific realization based on the Intel AVX instruction set architecture (ISA). The performance aspects of such implementation are considered at the bottom.

### Intel® AVX Realization of IIR Filter for Complex Float Data

Let us take a look at the well-known IIR filter formula:

**Equation 1**

**Figure 1. Structure of an Arbitrary Order IIR Filter**

Where “a” and “b” are IIR coefficients, “x” – input data vector, “y” – output vector, “order” – filter order. There is a strong dependency of an output value on a set of previous output values. All further transformations have a goal to remove this feedback – the dependency of each output point on the previous points. We take into account the float data type and Intel AVX instruction width. Thus, we are interested in the vectorization of the computation loop with factor of 8 (or 4 if complex data type is processed). Due to such universal approach, this method can be applied to any other data type, for example, real or double data with the certain vectorization factor. All further exposition is about complex float data type – therefore all the additions, subtractions and multiplications below mean complex arithmetic operations. Therefore, our purpose is recalculation of the “a" coefficients for vectorization on four the “y" calculations (to remove feedback latency). Equation 1) may be represented as:

**Equation 2**

Where Fx(n) is a function which depends only on “x” and it is simple Finite Impulse Response (FIR) filter. We shall do two-step successive filtering: at first by "x" (FIR), then by "y". The FIR part for any 4 points can be represented with the corresponding formulas (assuming here for simplicity “order”=k+1):

**Equation 3**

The FIR part of IIR has relatively simple realization with using Intel AVX and unrolling on 8 (4 complex numbers) along vector length. Eight copies of each “b.re” & “b.im” tap aligned on 32-byte boundary are stored in IIR State structure, thus it is common for Intel AVX algorithm (see Initialization Function):

**Figure 2. FIR–Part Coefficients Configuration**

b0.re, b0.re, b0.re, b0.re, b0.re, b0.re, b0.re, b0.re

-b0.im, b0.im,-b0.im, b0.im,-b0.im, b0.im,-b0.im, b0.im

b1.re, b1.re, b1.re, b1.re, b1.re, b1.re, b1.re, b1.re

-b1.im, b1.im,-b1.im, b1.im,-b1.im, b1.im,-b1.im, b1.im

…………………..

bk.re, bk.re, bk.re, bk.re, bk.re, bk.re, bk.re, bk.re

-bk.im, bk.im,-bk.im, bk.im,-bk.im, bk.im,-bk.im, bk.im

Below is a block scheme for the FIR part of IIR algorithm implemented via new 256-bit registers (the full assembler code is presented in Assembler Function for B Coefficients):

**Figure 3. FIR–Part of Algorithm**

Load YMM reg = {x3.im, x3.re, x2.im, x2.re, x1.im, x1.re, x0.im, x0.re}

vmulps {bk.re, bk.re, bk.re, bk.re, bk.re, bk.re, bk.re, bk.re}+

Vshufps YMM reg = {x3.re, x3.im, x2.re, x2.im, x1.re, x1.im, x0.re, x0.im}

vmulps {-bk.im, bk.im,-bk.im, bk.im,-bk.im, bk.im,-bk.im, bk.im} +……………………………………………………………………………………………………………………

…………………………………………………………………………………………………… +

Load YMM reg = {x3+k.im, x3+k.re, x2+k.im, x2+k.re, x1+k.im, x1+k.re, xk.im, xk.re}

vmulps {b0.re, b0.re, b0.re, b0.re, b0.re, b0.re, b0.re, b0.re} +

Vshufps YMM reg = {x3+k.re, x3+k.im, x2+k.re, x2+k.im, x1+k.re, x1+k.im, xk.re, xk.im}

vmulps {-b0.im, b0.im,-b0.im, b0.im,-b0.im, b0.im,-b0.im, b0.im} = Fx(n+0).re,im, Fx(n+1). re,im, Fx(n+2). re,im, Fx(n+3). re,im

//Temporally store to destination vector.

After the first step all Fx(n) should be saved at destination vector and we will have:

**Equation 4**

In order to remove a feedback latency, let us substitute the yn from the first line to the next three lines, then yn+1 from the second line to the next two lines, yn+2 from the third - to line number 4 in the equation above. For example point yn+2:

**Equation 5**

At the left of the first four lines we'll have new output points – yn , yn+1 , yn+2 , yn+3, and at the right - the functions from the "old" points (yn-1 ...yn-k) and from Fx(n+0)...Fx(n+3) with some new coefficients, which are calculated during the pState structure initialization (see Initialization Function):

**Equation 6**

**Equation 7**

**Equation 8**

**Equation 9**

For complex float data type and GSSE operations the vectorization factor should be 4. Now we have 4 new sets of complex “a” coefficients (of “y”) for a bundle of size four:

- a0n for the first point of a bundle – Equation 6;
- a1n for the second point of a bundle – Equation 7;
- a2n for the third point of a bundle – Equation 8;
- a3n for the fourth point of a bundle – Equation 9.

And also we have to do some extra calculations for Fx(n+0).re,im, Fx(n+1).re,im, Fx(n+2).re,im, Fx(n+3).re,im – coefficients for the each point of a bundle are the next:

**Figure 4. The Extra Triangle of Coefficients for IIR-Par**t

1. | 0, | 0, | 0, | 1; |

2. | a01, | 0, | 0, | 1; |

3. | a11, | a01, | 0, | 1; |

4. | a21, | a11, | a01, | 1; |

In the same manner as for the first (FIR) step all new coefficients are calculated at the init stage (see Initialization Function) and are stored in the natural for Intel AVX order with the appropriate alignment (the part with all “1” is not required and is missed):

**Figure 5. IIR–Part Coefficients Configuration**

a01.re,im, a11.re,im, a21.re,im, a31.re,im

a02.re,im, a12.re,im, a22.re,im, a32.re,im

...................

a0k.re,im, a1k.re,im, a2k.re,im, a3k.re,im

0.re,im, a01.re,im, a11.re,im, a21.re,im

0.re,im, 0.re,im, a01.re,im, a11.re,im

0.re,im, 0.re,im, 0.re,im, a01.re,im

Therefore, for each 4-output points we have 10 extra complex additions and 6 extra complex multiplications which is the algorithm cost of the feedback dependency removing:

**Figure 6. IIR–Part of Algorithm**

vbroadcast to YMM = {yn-1.re, yn-1.re, yn-1.re, yn-1.re,yn-1.re, yn-1.re, yn-1.re, yn-1.re}

vmulps {a01.re, a01.im, a11.re, a11.im, a21.re, a21.im, a31.re, a31.im} +

vbroadcast to YMM = {yn-1.im, yn-1.im, yn-1.im, yn-1.im,yn-1.im, yn-1.im, yn-1.im, yn-1.im}

vmulps {a01.im, a01.re, a11.im, a11.re, a21.im, a21.re, a31.im, a31.re} +

……………………………… +

vbroadcast to YMM = {yn-k.re, yn-k.re, yn-k.re, yn-k.re,yn-k.re, yn-k.re, yn-k.re, yn-k.re}

vmulps {a0k.re, a0k.im, a1k.re, a1k.im, a2k.re, a2k.im, a3k.re, a3k.im} +

vbroadcast to YMM = {yn-k.im, yn-k.im, yn-k.im, yn-k.im,yn-k.im, yn-k.im, yn-k.im, yn-k.im}

vmulps {a0k.im, a0k.re, a1k.im, a1k.re, a2k.im, a2k.re, a3k.im, a3k.re} +

vbroadcast to YMM = { Fx(n+0), Fx(n+0), Fx(n+0), Fx(n+0)}re.im

vmulps { a0k.re,im, a1k.re,im, a2k.re,im, a3k.re,im } + (extra)

vbroadcast to YMM = { Fx(n+1), Fx(n+1), Fx(n+1), Fx(n+1)}re.im

vmulps { 0.re,im, a01.re,im, a11.re,im, a21.re,im } + (extra)

vbroadcast to YMM = { Fx(n+2), Fx(n+2), Fx(n+2), Fx(n+2)}re.im

vmulps { 0.re,im, 0.re,im, a01.re,im, a11.re,im } + (extra)

vbroadcast to YMM = { Fx(n+3), Fx(n+3), Fx(n+3), Fx(n+3)}re.im

vmulps { 0.re,im, 0.re,im, 0.re,im, a01.re,im } + (extra)

vload to YMM = { Fx(n+3), Fx(n+2), Fx(n+1), Fx(n+0)}re.im = vstore YMM = {Yn+0, Yn+1, Yn+2, Yn+3}re.im

Temporally store to destination vector

This approach allows to remove feedback for the whole bundle and to improve IIR filter performance more than two times on all SIMD architectures.

Below is the full IIR block scheme for the algorithm described above that is used in Intel® Integrated Performance Primitives (Intel® IPP) libraries.

**Figure 7. Full IIR Algorithm Block–Scheme**

IppStatus ippsIIRInitAlloc_32fc( IppsIIRState_32fc** ppState, const Ipp32fc* pTaps, int order, const Ipp32fc* pDlyLine);

Order of taps = B0,B1,...,Border,A0,A1,...,Aorder

//This function allocates memory and initializes an arbitrary IIR filter state

Taps are normed: a1=A1/A0 , a2=A2/A0 , ..., border=Border/A0

Taps are recalculated according to (6) – (9) and stored by eight coefficients according to Figure 1 and Figure 4 order with appropriate alignment (on 32 byte boundary) in pState structure

IppStatus ippsIIR_32fc( const Ipp32fc *pSrc, Ipp32fc *pDst, int len,

IppsIIRState_32fc *pState )

{

//The first stage: filtering “by x” as simple FIR filter does with bk coefficients (the code written in Intel AVX). Temporal output is written to pDst vector (Figure 2 and Assembler Function for B Coefficients)

//The second stage: filtering “by y” by bundles of four (the code written in Intel AVX) with recalculated ak coefficients (Figure 5 and Assembler Function for A Coefficients)

}

### Conclusion

Because of feedback dependency the previous realizations of IIR filter function for IA32 architecture in the IPP library were point-wise, not vectorized. It means that the vector function performance per output point is almost the same as for the one-point function. The vectorization was done “by filter order”, not by vector length. The visible performance improvement in comparison with “C” code was only for filter orders 32 and higher. Such large order IIR filters are not practically used in signal processing. For the orders less than 32 the usual implementation of IIR filter with Intel SSE does not provide any performance gain with not vectorized version because of the latency of the instructions accumulating in the common scheme with feedback dependency. The presented approach (the feedback dependency removing) implemented with Intel AVX instructions gives ~1.6x speedup for the filter of order 6 over the analogous Intel SSE code (it must be taken into account that this ratio is obtained under architectural software simulator and actual numbers may differ). The considered implementation of the filter optimized with Intel AVX will be included into the future version of IPP library.

### Appendix A: Function Code

**Initialization Function**

**Filtering Function**

**Assembler Function for B Coefficients**

**Assembler Function for A Coefficients**

### Reference

Intel® Integrated Performance Primitives for Intel® Architecture, vol.1 – Signal Processing, Filtering Functions, 6-103…6-130.

### About the Author

Igor Astakhov is a Department Software Engineer Manager with the Software Solutions Group (Visual Computing Software Division, CIP, and Intel IPP). He is focused on Intel IPP libraries development for all existing and future Intel Architectures. Igor has been at Intel for five years. His email is igor.astakhov@intel.com.

For more complete information about compiler optimizations, see our Optimization Notice.