Wiener Filtering Using Intel® Advanced Vector Extensions

1 Introduction
Intel® Advanced Vector Extensions (Intel® AVX) is a 256 bit instruction set extension to Intel® Streaming SIMD Extensions (Intel® SSE) and is designed for applications that are floating point intensive. These instructions provide a means to accelerate applications that rely heavily on floating-point operations, such as 3D geometry, video processing, image processing, and spatial (3D) audio. This application note discusses Wiener filtering, and includes an example of code that has been optimized using Intel® AVX. The source reference for the code and application note for this article has been AP-807 Wiener Filtering Using Streaming SIMD Extensions [4]. The original article included optimizations using streaming SIMD extensions and this whitepaper discusses migration of the code to 256 bit instruction set extension (i.e. Intel® AVX).


2 The Wiener Filter Algorithm
Wiener filtering (also known as Least Mean Square filtering) is a technique for removing unwanted noise from an image. The description of this algorithm is from The Pocket Handbook of Image Processing Algorithms in C, by Harley R Myler and Aruthur R. Weeks [1]. The algorithm has four (Fourier transformed) vector inputs, representing (one component of) the original image (Image), the degraded image (Guv), the noise image spectra (Noise) and the degradation function (Huv). Each input is a vector of row*col complex numbers. The complex numbers are represented as two contiguous floats for the real and imaginary parts of the number. An additional parameter, gamma, is included in the computation. When gamma is 1.0, the filter is known as non-parametric. The parameters to the filter can be adjusted until the filtered image is satisfactory.


2.1 Applications for Wiener Filters
Wiener filters are commonly used in image processing applications to remove noise from reconstructed images. Wiener filtering is often used to restore a blurry image. However, the Wiener filter has proved important in adaptive filtering, has been used for wavelet transforms, and has found application in communications and other DSP-related disciplines. The reader also should be aware that Fourier transformation is a key element in any signal processing discipline. Please refer to the Intel application note, Split-Radix FFT (AP-808) for further information on implementing a Fourier transform.


2.2 Implementing the Wiener Filter
As described in Section 2.1, the input to the function is four arrays of complex numbers. For each
Element of the image, the following operations are carried out using complex arithmetic. The complex variables D and Hs are intermediate variables used during the computation. The function
Complex_conj is used to take the complex conjugate of a complex number. When divides occur, a check (using an if statement) must be done to ensure that the denominator is non-zero. In the event that a denominator is zero, the result should be set to zero.
1. Complex Noise = gamma * (Noise * Complex_conj ( Noise ) )
2. Complex D = Image * Complex_conj ( Image )
3. Complex D = Noise / D
4. Complex Hs = Huv * Complex_conj ( Huv )
5. Complex Num = Complex_conj ( Huv ) * Guv
6. Complex Image = Num / (Hs + D)


3 Vectorizing the code with 128-bit SIMD
The reference source code for the article (both the C version and 128-bit SIMD version) is available in Intel application note AP-807. The following relates to the original port of the scalar code to 128-bit SIMD. First, the code can be optimized simply by observing that many of the operations involve multiplying a number by its complex conjugate. Since the result has no imaginary component, many of the operations specified in Section 2.2 can be simplified. The resultant C code is given in Section 5. Before the code can be optimized for Intel® SSE, it must be converted to a form suitable for SIMD execution. Four iterations of the original C code are gathered together and processed in a single iteration of the new loop. Each pass through the new loop does the work of four of the original iterations. The list of operations in Section 2.2 shows that three divides must be done per iteration: one divide in Step 3, (the imaginary component requires no divide since the imaginary part is zero), and two divides in Step 6. One of the latter two divides can be removed by noting that they both have the same denominator. The if statement required to check for a zero denominator can be removed by using a masking technique. Further improvements can be obtained by replacing all divides with reciprocal approximations. These techniques are described below. After the code has been converted to a SIMD format, the checks for zero denominators (if statements) can be removed by creating a mask for the non-zero denominator elements, and ANDing that mask to the result of the division to zero out the elements where a division by zero occurred. This technique assumes that QNANs are being generated rather than SNANs, by masking SIMD floating-point exceptions in the MXCSR register, so that no floating point divide by zero exceptions will occur. For example, assume that you want to compute the quantity ( N / D ), where N and D are floats. A typical code sequence is given below.

If ( D != 0 )
Result = N / D;
Else
Result = 0.0;

The result can be computed as (and ( div ( N, D ), cmp_neq ( D, 0 ) ) ), a
computation that does not require an if statement. Using intrinsics, this would be expressed as

mm_and_ps ( mm_div_ps ( N, D ), mm_cmpneq_ps ( D, zero ) ).

This technique is specific to vectorized code, and is implemented in the attached Intrinsics
and assembly language versions of the code.

The Newton-Raphson method is a classic technique for approximating functions. The initial “guess” at the reciprocal is computed using the rcpps instruction. Subsequently, the “guess” is improved using the Newton-Raphson method. The result is not as accurate as what is provided by the divide instruction; however, it can be obtained significantly faster. (Programmers must determine if their application allows a reduced precision answer.) Full details and discussion of this technique are available in the Newton-Raphson application note [2]. The specific code sequence employed in this Wiener filter is given below. (The denominator should be examined to be certain a division by zero is not occurring.)

RC = _mm_rcpps( D );
RECIP = _mm_sub( _mm_add( RC, RC ), _mm_mul( RC, _mm_mul( RC, D ) ) );


4 Vectorizing the code with 256-bit SIMD
The 128-bit vectorized code can be easily ported to 256-bit SIMD by simply migrating the changes. The 256-bit code will do 8 iterations in one loop (the 128-bit Intel® SSE will do 4 iterations in one loop). The load/store instructions as well rcp/mul/add/sub instructions will now operate on 8 data points. The branchless technique described above is retained in the Intel® AVX code as well. As can be seen from the code, using the corresponding Intel® AVX intrinsics (for example _mm256_mul_ps for 256-bit vs _mm_mul_ps for 128-bit) completes the port. Section 6.3 provides the complete 256-bit Intel® AVX code.


5 Grouping input/output arrays
Additionally, the 256-bit SIMD code performance can be improved by grouping input and output arrays in sequential order. Having sequential memory accesses through cache/memory decreases potential cache way conflicts. This provides a simpler accessing pattern for the CPU hardware prefetcher which leads to more accurate data.


6 Conclusion

The performance results (interms of CPU clock cycles) of the 128-bit SIMD code vs 256-bit SIMD code over large number of iterations are as below.

 

Intel® AVX (256-bit)

Intel® SSE (128-bit)

Intel® AVX vs. Intel® SSE

Wiener filter

45871

66933

1.46x

Wiener filter with grouped arrays

42464

64473

1.51x


The overall improvement of Intel® AVX over Intel® SSE with Wiener filtering is 1.46x. Intel® AVX can provide a significant performance improvement for the Wiener filter algorithm, compared to coding with 128-bit SIMD. By grouping the input/output arrays, the overall improvement of Intel® AVX over Intel® SSE with Wiener filtering is 1.51x. The improvements cited in this document are the result of several techniques. The techniques include using Intel® AVX (vectorizing the code), and removing conditional branch instructions (if statements) by using masking operations provided by both Intel® SSE and Intel® AVX. If it is acceptable to reduce the numerical precision of the result, then a further gain can be realized by replacing divide operations with reciprocal approximations (employing a Newton-Raphson technique). Further, the article also highlights the ease with which an existing floating point code can be ported to Intel® AVX.


7 Coding Example

/*
* Wiener Filter (also known as the Least Mean Square filter)
*
* Reference: The Pocket Handbook of Image Processing Algorithms in C
* by Harley R Myler & Arthur R. Weeks
* 1993 Prentice-Hall, ISBN 0-13-642240-3 p260-3.
*
* The data is several arrays of complex floats in row major order.
* The description for the algorithm from p260 states:
*
* The algorithm computes a parametric Wiener filter on the
* Fourier transform of a degraded image, Guv, with noise
* spectra N, degradation function Huv, and original image Img.
* The computation is in place, so that the filtered version of
* the input is returned in the original image variable. The
* original and noise images are either estimations form some
* predictive function or ad hoc approximations. If the noise
* image is zero, the process reduces to the inverse filter.
*
* The Weiner parameter gamma is passed to the algorithm.
* If this parameter is 1.0, the filter is non-parametric.
* Methods exist in the literature to derive the parameter value;
* however, it is sometimes determined from trial and error.
*
*NOTE!!!! The code on page 263 has an error. In cxml, the complex
* multiply routine, the imaginary part of the computation should be
* a*d + b*c, not a*d - b*c.
*
*NOTE! (another error) The *complex* array length is rows*cols, so the
* *float* array length should be 2*rows*cols. Also, note that the
* algorithm operates on one component of the pixel.
*/
void wiener_filter ( float *Img,
float *Huv,
float *No,
float *Guv,
float gamma,
int rows,
int cols)
{
int i, sz;
float numr, numi, dr, hsr;
sz = 2 * rows * cols;
for (i = 0; i < sz; i += 2)
{
/* Compute (in place) the noise spectral density with Wiener gamma*/
No[i] = (float) ( gamma * ( No[i]*No[i] + No[i+1]*No[i+1] ) );
No[i+1] = (float) 0.0;
/* Compute image spectral density */
dr = (float) ( Img[i]*Img[i] + Img[i+1]*Img[i+1] );
/* Compute denominator spectral density term */
if (dr != 0.0)
dr = (float) (No[i] / dr) ;
/* Compute degradation power spectrum */
hsr = (float) ( Huv[i]*Huv[i] + Huv[i+1]*Huv[i+1] );
/* Compute numerator term */
numr = (float) ( Huv[i]*Guv[i] + Huv[i+1]*Guv[i+1] );
numi = (float) ( Huv[i]*Guv[i+1] - Huv[i+1]*Guv[i ] );
/* Final computation */
if ( (hsr + dr) != 0.0 )
{
Img[i] = (float) (numr / (hsr + dr));
Img[i+1] = (float) (numi / (hsr + dr));
}
else
{
Img[i] = (float) 0.0;
Img[i+1] = (float) 0.0;
}
}
} /* wiener_filter */



7.2 128-bit Intrinsics Code

/*
#include 
//#define MM_FUNCTIONALITY
#include 
#include 
void intrin_wiener_rcp_sse( float *Img,
float *Huv,
float *No,
float *Guv,
float gamma,
int rows,
int cols )
{
int i, sz;
__m128 first2, next2, nor4, noi4, nr4, inr4, ini4, dr4;
__m128 hr4, hi4, hsr4, gr4, gi4, numr4, numi4;
__m128 rc, denom;
__m128 zero = _mm_set_ps1 (0.0);
sz = 2 * rows * cols;
assert( (sz > 3) & !(sz & 3) );
assert( !( ((int)Img) & 15 ) ); /* Assume alignment */
assert( !( ((int)Huv) & 15 ) );
assert( !( ((int)No) & 15 ) );
assert( !( ((int)Guv) & 15 ) );
for (i = 0; i < sz; i += 8)
{
* Compute (in place) the noise spectral density with Wiener gamma
*
* complex Noise = gamma * (Noise * complex conj Noise)
*
* No[i] = (float) ( gamma * ( No[i]*No[i] + No[i+1]*No[i+1] ) );
* No[i+1] = (float) 0.0;
*/
first2 = _mm_load_ps ( &No[i] );
next2 = _mm_load_ps ( &No[i+4] );
nor4 = _mm_shuffle_ps( first2, next2, 0x88 );
noi4 = _mm_shuffle_ps( first2, next2, 0xdd );
nr4 = _mm_mul_ps ( _mm_set_ps1( gamma ) ,
_mm_add_ps ( _mm_mul_ps( nor4 , nor4 ),
_mm_mul_ps( noi4 , noi4 ) ) );
_mm_store_ps( &No[i ], _mm_unpacklo_ps ( nr4, zero ) );
_mm_store_ps( &No[i+4], _mm_unpackhi_ps ( nr4, zero ) );
/*
* Compute image spectral density
*
* Complex D = Image * complex conj Image
*
* dr = (float) ( Img[i]*Img[i] + Img[i+1]*Img[i+1] );
*/
first2 = _mm_load_ps ( &Img[i] );
next2 = _mm_load_ps ( &Img[i+4] );
inr4 = _mm_shuffle_ps( first2, next2, 0x88 );
ini4 = _mm_shuffle_ps( first2, next2, 0xdd );
dr4 = _mm_add_ps ( _mm_mul_ps( inr4 , inr4),
_mm_mul_ps( ini4 , ini4) );
/*
* Compute denominator spectral density term
*
* Complex D = noise / D
*
* if (dr != 0.0)
* dr = (float) (No[i] / dr) ;
*
* Do that reciprical division thing!
*/
rc = _mm_rcp_ps(dr4);
rc = _mm_sub_ps( _mm_add_ps( rc, rc),
_mm_mul_ps( rc, _mm_mul_ps( rc, dr4) ) );
dr4 = _mm_and_ps ( _mm_mul_ps ( nr4 , rc ),
_mm_cmpneq_ps( dr4, zero ) );
/*
* Compute degradation power spectrum
*
* Complex Hs = Huv * complex conj Huv
*
* hsr = (float) ( Huv[i]*Huv[i] + Huv[i+1]*Huv[i+1] );
*/
first2 = _mm_load_ps ( &Huv[i] );
next2 = _mm_load_ps ( &Huv[i+4] );
hr4 = _mm_shuffle_ps( first2, next2, 0x88 );
hi4 = _mm_shuffle_ps( first2, next2, 0xdd );
hsr4 = _mm_add_ps ( _mm_mul_ps (hr4 , hr4 ),
_mm_mul_ps (hi4 , hi4 ) );
/*
* Compute numerator term
*
* Complex Num = complex conj Huv * Guv
*
* numr = (float) ( Huv[i]*Guv[i] + Huv[i+1]*Guv[i+1] );
* numi = (float) ( Huv[i]*Guv[i+1] - Huv[i+1]*Guv[i ] );
*/
first2 = _mm_load_ps ( &Guv[i] );
next2 = _mm_load_ps ( &Guv[i+4] );
gr4 = _mm_shuffle_ps( first2, next2, 0x88 );
gi4 = _mm_shuffle_ps( first2, next2, 0xdd );
numr4 = _mm_add_ps ( _mm_mul_ps (hr4 , gr4),
_mm_mul_ps (hi4 , gi4) );
numi4 = _mm_sub_ps ( _mm_mul_ps (hr4 , gi4),
_mm_mul_ps (hi4 , gr4) );
/*
* Final computation
*
* Complex Image = Num / (Hs + D)
*
* if ( (hsr + dr) != 0.0 )
* {
* Img[i] = (float) (numr / (hsr + dr));
* Img[i+1] = (float) (numi / (hsr + dr));
* }
* else
* {
* Img[i] = (float) 0.0;
* Img[i+1] = (float) 0.0;
* }
*
* Do the reciprical division thing
*/
denom = _mm_add_ps( hsr4, dr4 );
rc = _mm_rcp_ps(denom);
rc = _mm_sub_ps( _mm_add_ps( rc, rc),
_mm_mul_ps( rc, _mm_mul_ps( rc, denom) ) );
inr4 = _mm_and_ps( _mm_mul_ps ( numr4 , rc ) ,
_mm_cmpneq_ps( denom, zero ) );
ini4 = _mm_and_ps( _mm_mul_ps ( numi4 , rc ) ,
_mm_cmpneq_ps( denom, zero ) );
_mm_store_ps( &Img[i ], _mm_unpacklo_ps ( inr4, ini4 ) );
_mm_store_ps( &Img[i+4], _mm_unpackhi_ps ( inr4, ini4 ) );
}
} /* intrin_wiener_rcp */


    


7.3 256-bit Intrinsics Code

void intrin_wiener_rcp_avx( float *Img,
					   float *Huv,
					   float *No,
					   float *Guv,
					   float gamma,
					   int rows,
					   int cols )
{
	int i, sz;
	__m256 first2, next2, nor4, noi4, nr4, inr4, ini4, dr4;
	__m256 hr4, hi4, hsr4, gr4, gi4, numr4, numi4;
	__m256 rc, denom;
	__m256 zero = _mm256_setzero_ps();
	sz = 2 * rows * cols;
	assert( (sz > 3) & !(sz & 3) );
	assert( !( ((int)Img) & 15 ) ); /* Assume alignment */
	assert( !( ((int)Huv) & 15 ) );
	assert( !( ((int)No) & 15 ) );
	assert( !( ((int)Guv) & 15 ) );
	for (i = 0; i < sz; i += 16)
	{
		* Compute (in place) the noise spectral density with Wiener gamma
		*
		* complex Noise = gamma * (Noise * complex conj Noise)
		*
		* No[i] = (float) ( gamma * ( No[i]*No[i] + No[i+1]*No[i+1] ) );
		* No[i+1] = (float) 0.0;
		*/
		first2 = _mm256_load_ps ( &No[i] );
		next2 = _mm256_load_ps ( &No[i+4*2] );
		nor4 = _mm256_shuffle_ps( first2, next2, 0x88 );
		noi4 = _mm256_shuffle_ps( first2, next2, 0xdd );
		nr4 = _mm256_mul_ps ( _mm256_set1_ps( gamma ) ,
		_mm256_add_ps ( _mm256_mul_ps( nor4 , nor4 ),
		_mm256_mul_ps( noi4 , noi4 ) ) );
		_mm256_store_ps( &No[i ], _mm256_unpacklo_ps ( nr4, zero ) );
		_mm256_store_ps( &No[i+4*2], _mm256_unpackhi_ps ( nr4, zero ) );
		
		/*
		* Compute image spectral density
		*
		* Complex D = Image * complex conj Image
		*
		* dr = (float) ( Img[i]*Img[i] + Img[i+1]*Img[i+1] );
		*/
		first2 = _mm256_load_ps ( &Img[i] );
		next2 = _mm256_load_ps ( &Img[i+4*2] );
		inr4 = _mm256_shuffle_ps( first2, next2, 0x88 );
		ini4 = _mm256_shuffle_ps( first2, next2, 0xdd );
		dr4 = _mm256_add_ps ( _mm256_mul_ps( inr4 , inr4),
		_mm256_mul_ps( ini4 , ini4) );
		/*
		* Compute denominator spectral density term
		*
		* Complex D = noise / D
		*
		* if (dr != 0.0)
		* dr = (float) (No[i] / dr) ;
		*
		* Do that reciprical division thing!
		*/
		rc = _mm256_rcp_ps(dr4);
		rc = _mm256_sub_ps( _mm256_add_ps( rc, rc),
		_mm256_mul_ps( rc, _mm256_mul_ps( rc, dr4) ) );
		dr4 = _mm256_and_ps ( _mm256_mul_ps ( nr4 , rc ),
		_mm256_cmpneq_ps( dr4, zero ) );
		/*
		* Compute degradation power spectrum
		*
		* Complex Hs = Huv * complex conj Huv
		*
		* hsr = (float) ( Huv[i]*Huv[i] + Huv[i+1]*Huv[i+1] );
		*/
		first2 = _mm256_load_ps ( &Huv[i] );
		next2 = _mm256_load_ps ( &Huv[i+4*2] );
		hr4 = _mm256_shuffle_ps( first2, next2, 0x88 );
		hi4 = _mm256_shuffle_ps( first2, next2, 0xdd );
		hsr4 = _mm256_add_ps ( _mm256_mul_ps (hr4 , hr4 ),
		_mm256_mul_ps (hi4 , hi4 ) );
		/*
		* Compute numerator term
		*
		* Complex Num = complex conj Huv * Guv
		*
		* numr = (float) ( Huv[i]*Guv[i] + Huv[i+1]*Guv[i+1] );
		* numi = (float) ( Huv[i]*Guv[i+1] - Huv[i+1]*Guv[i ] );
		*/
		first2 = _mm256_load_ps ( &Guv[i] );
		next2 = _mm256_load_ps ( &Guv[i+4*2] );
		gr4 = _mm256_shuffle_ps( first2, next2, 0x88 );
		gi4 = _mm256_shuffle_ps( first2, next2, 0xdd );
		numr4 = _mm256_add_ps ( _mm256_mul_ps (hr4 , gr4),
		_mm256_mul_ps (hi4 , gi4) );
		numi4 = _mm256_sub_ps ( _mm256_mul_ps (hr4 , gi4),
		_mm256_mul_ps (hi4 , gr4) );
		/*
		* Final computation
		*
		* Complex Image = Num / (Hs + D)
		*
		* if ( (hsr + dr) != 0.0 )
		AP-807 Wiener Filtering Using Streaming SIMD Extensions
		01/28/99 15
		* {
		* Img[i] = (float) (numr / (hsr + dr));
		* Img[i+1] = (float) (numi / (hsr + dr));
		* }
		* else
		* {
		* Img[i] = (float) 0.0;
		* Img[i+1] = (float) 0.0;
		* }
		*
		* Do the reciprical division thing
		*/
		denom = _mm256_add_ps( hsr4, dr4 );
		rc = _mm256_rcp_ps(denom);
		rc = _mm256_sub_ps( _mm256_add_ps( rc, rc),
		_mm256_mul_ps( rc, _mm256_mul_ps( rc, denom) ) );
		inr4 = _mm256_and_ps( _mm256_mul_ps ( numr4 , rc ) ,
		_mm256_cmpneq_ps( denom, zero ) );
		ini4 = _mm256_and_ps( _mm256_mul_ps ( numi4 , rc ) ,
		_mm256_cmpneq_ps( denom, zero ) );
		_mm256_store_ps( &Img[i ], _mm256_unpacklo_ps ( inr4, ini4 ) );
		_mm256_store_ps( &Img[i+4*2], _mm256_unpackhi_ps ( inr4, ini4 ) );

	}
} /* intrin_wiener_rcp */


    


7.4 256-bit Intrinsics Code with grouped arrays

blockHNG Structure

Huv[0] Huv[15] No[0] No[15] Guv[0] Guv[15] Huv[16] Huv[31] No[16]
void intrin_wiener_rcp_avx ( float *Img,
					float *_blockHNG,
					float gamma,
					int rows,
					int cols)
{
	int sz;
	__m256 first2, next2, nor4, noi4, nr4, inr4, ini4, dr4;
	__m256 hr4, hi4, hsr4, gr4, gi4, numr4, numi4;
	__m256 rc, denom;
	__m256 zero = _mm256_setzero_ps();
	sz = 2 * rows * cols;

	assert( (sz > 3) & !(sz & 3) );
	assert( !( ((int)Img) & 15 ) ); // Assume alignment 
	assert( !( ((int)_blockHNG) & 15 ) ); // Assume alignment 

	float *Huv;
	float *No;
	float *Guv;
	
	int j = 0;	// img index
	for (int _blockHNG_tracker = 0; _blockHNG_tracker < 2 * rows * cols * 3; _blockHNG_tracker += 48)
	{
		Huv = &(_blockHNG[_blockHNG_tracker]);
		No = &(_blockHNG[_blockHNG_tracker + 16]);
		Guv = &(_blockHNG[_blockHNG_tracker + 32]);

		/*
		* Compute (in place) the noise spectral density with Wiener gamma
		*
		* complex Noise = gamma * (Noise * complex conj Noise)
		*
		AP-807 Wiener Filtering Using Streaming SIMD Extensions
		01/28/99 13
		* No[i] = (float) ( gamma * ( No[i]*No[i] + No[i+1]*No[i+1] ) );
		* No[i+1] = (float) 0.0;
		*/
		first2 = _mm256_load_ps ( &No[0] );
		next2 = _mm256_load_ps ( &No[8] );
		nor4 = _mm256_shuffle_ps( first2, next2, 0x88 );
		noi4 = _mm256_shuffle_ps( first2, next2, 0xdd );
		nr4 = _mm256_mul_ps ( _mm256_set1_ps( gamma ) ,
		_mm256_add_ps ( _mm256_mul_ps( nor4 , nor4 ),
		_mm256_mul_ps( noi4 , noi4 ) ) );

		_mm256_store_ps( &No[0], _mm256_unpacklo_ps ( nr4, zero ) );
		_mm256_store_ps( &No[8], _mm256_unpackhi_ps ( nr4, zero ) );

		/*
		* Compute image spectral density
		*
		* Complex D = Image * complex conj Image
		*
		* dr = (float) ( Img[i]*Img[i] + Img[i+1]*Img[i+1] );
		*/
		first2 = _mm256_load_ps ( &Img[j] );
		next2 = _mm256_load_ps ( &Img[j+8] );
		inr4 = _mm256_shuffle_ps( first2, next2, 0x88 );
		ini4 = _mm256_shuffle_ps( first2, next2, 0xdd );
		dr4 = _mm256_add_ps ( _mm256_mul_ps( inr4 , inr4),
		_mm256_mul_ps( ini4 , ini4) );
		/*
		* Compute denominator spectral density term
		*
		* Complex D = noise / D
		*
		* if (dr != 0.0)
		* dr = (float) (No[i] / dr) ;
		*
		* Do that reciprical division thing!
		*/
		rc = _mm256_rcp_ps(dr4);
		rc = _mm256_sub_ps( _mm256_add_ps( rc, rc),
		_mm256_mul_ps( rc, _mm256_mul_ps( rc, dr4) ) );
		dr4 = _mm256_and_ps ( _mm256_mul_ps ( nr4 , rc ),
		_mm256_cmpneq_ps( dr4, zero ) );
		/*
		* Compute degradation power spectrum
		*
		* Complex Hs = Huv * complex conj Huv
		*
		* hsr = (float) ( Huv[i]*Huv[i] + Huv[i+1]*Huv[i+1] );
		*/
		first2 = _mm256_load_ps ( &Huv[0] );
		next2 = _mm256_load_ps ( &Huv[8] );
		hr4 = _mm256_shuffle_ps( first2, next2, 0x88 );
		hi4 = _mm256_shuffle_ps( first2, next2, 0xdd );
		hsr4 = _mm256_add_ps ( _mm256_mul_ps (hr4 , hr4 ),
		_mm256_mul_ps (hi4 , hi4 ) );
		/*
		* Compute numerator term
		*
		* Complex Num = complex conj Huv * Guv
		*
		* numr = (float) ( Huv[i]*Guv[i] + Huv[i+1]*Guv[i+1] );
		* numi = (float) ( Huv[i]*Guv[i+1] - Huv[i+1]*Guv[i ] );
		*/
		first2 = _mm256_load_ps ( &Guv[0] );
		next2 = _mm256_load_ps ( &Guv[8] );
		gr4 = _mm256_shuffle_ps( first2, next2, 0x88 );
		gi4 = _mm256_shuffle_ps( first2, next2, 0xdd );
		numr4 = _mm256_add_ps ( _mm256_mul_ps (hr4 , gr4),
		_mm256_mul_ps (hi4 , gi4) );
		numi4 = _mm256_sub_ps ( _mm256_mul_ps (hr4 , gi4),
		_mm256_mul_ps (hi4 , gr4) );
		/*
		* Final computation
		*
		* Complex Image = Num / (Hs + D)
		*
		* if ( (hsr + dr) != 0.0 )
		AP-807 Wiener Filtering Using Streaming SIMD Extensions
		01/28/99 15
		* {
		* Img[i] = (float) (numr / (hsr + dr));
		* Img[i+1] = (float) (numi / (hsr + dr));
		* }
		* else
		* {
		* Img[i] = (float) 0.0;
		* Img[i+1] = (float) 0.0;
		* }
		*
		* Do the reciprical division thing
		*/
		denom = _mm256_add_ps( hsr4, dr4 );
		rc = _mm256_rcp_ps(denom);
		rc = _mm256_sub_ps( _mm256_add_ps( rc, rc),
		_mm256_mul_ps( rc, _mm256_mul_ps( rc, denom) ) );
		inr4 = _mm256_and_ps( _mm256_mul_ps ( numr4 , rc ) ,
		_mm256_cmpneq_ps( denom, zero ) );
		ini4 = _mm256_and_ps( _mm256_mul_ps ( numi4 , rc ) ,
		_mm256_cmpneq_ps( denom, zero ) );

		_mm256_store_ps( &Img[j ], _mm256_unpacklo_ps ( inr4, ini4 ) );
		_mm256_store_ps( &Img[j+8], _mm256_unpackhi_ps ( inr4, ini4 ) );
		j+=16;
	}
} /* Intrin_wiener_rcp_avx */



Acknowledgements
The authors would like to thank Phil Kerly, Raghu Muthyalampalli and Justin Landon who assisted with the performance assessment of the code, providing performance recommendations and review of the whitepaper.

References
The following documents are referenced in this application note, and provide background or supporting information for understanding the topics presented in this document.

1. The Pocket Handbook of Image Processing Algorithms in C, by Harley R Myler and Aruthur R.
Weeks. ISBN 0-13-642240-3.
2. Increasing the Accuracy of the Results from the Reciprocal and Reciprocal Square Root
Instructions using the Newton-Raphson Method, Intel application note (AP-803, Order Number: 243637-001).
3. Split-Radix FFT, Intel application note (AP-808, Order Number: 243642-001).
4. Wiener Filtering Using Streaming SIMD Extensions, Intel application note (AP-807).

Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione