<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Mon, 21 May 2012 04:47:26 -0700 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/articles/avx/type/technical-article/feed/" rel="self" type="application/rss+xml" />
    <title>Intel Software Network articles Feed</title>
    <link>http://software.intel.com/en-us/articles/avx/type/technical-article/</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>Improving the Compute Performance of Video Processing Software Using AVX (Advanced Vector Extensions) Instructions (by Eli Hernandez and Larry Moore)</title>
      <description><![CDATA[ <h2 class="sectionHeading">Download Article</h2>
Download <a href="http://software.intel.com/file/41302">Improving the Compute Performance of Video Processing Software Using AVX (Advanced Vector Extensions) Instructions</a> [PDF 311KB]<br /><br />
<h2 class="sectionHeading">Abstract</h2>
Modern x86 CPUs permit instruction level parallelism (e.g. SIMD) on register vectors at most 128-bits. Second Generation Intel® Core™ Processors include the first generation of AVX (256-bit operators), which permits increased parallel processing. This paper outlines a case study in which AVX instructions are used to improve the compute performance of a de-saturation algorithm. The paper also discusses how future integer based AVX instructions might be used to further enhance SIMD optimizations and achieve even greater performance benefits on video processing algorithms.<br /><br />
<h2 class="sectionHeading">1. Introduction</h2>
Modern x86 CPUs permit Instruction Level Parallelism (ILP), such as Single Instruction Multiple Data (SIMD), on vectors at most 128-bit. These register vectors can be used to process multiple data elements with fewer instructions. Second Generation Intel® Core™ Processors (codenamed Sandy Bridge) included the first generation of AVX, which is a 256-bit instruction set extension to the Intel® Streaming SIMD Extensions (Intel® SSE).<br /><br />The first generation of AVX included a wide range of instructions designed primarily to accelerate compute intensive algorithms performing arithmetic operations on floating point data. However, even if an algorithm is integer based, using AVX instructions could potentially increase an algorithm’s performance without sacrificing accuracy of the results. In video processing algorithms, the pixel channels are often stored as 8-bit unsigned integers (bytes) and processed as 32-bit or larger format integer values. Therefore, most video algorithms require conversion of pixels to and from a format. Wider bit widths are used for calculation accuracy and smaller formats are used to save space. Typically, floating-point units are not used because the extra conversion costs do not significantly improve accuracy. However, AVX is capable of greatly improving the runtime performance of video processing software and a vast number of other software applications by the increased parallelism.<br /><br />This paper describes a case study in which AVX instructions are used to enhance the performance of a de-saturation algorithm (a common video filter). The case study takes the algorithm from a non-SIMD state to AVX based SIMD. The paper also discusses how future generations of AVX may be able to further aid performance optimization and enable greater performance of video processing.<br /><br />
<h2 class="sectionHeading">2. Intel SIMD Overview</h2>
On Intel SIMD architectures, a vector register can store a group of data elements of a single data type (e.g. floats or integers). The vector registers of Sandy Bridge are 256 bits wide whereas all other processors since Intel® Pentium III were 128 bits wide. Each vector (called YMM in Sandy Bridge) register can store 8 floats, 8 32-bit integers, 32 chars, etc. AVX instructions operate on the full 256 bits, but SSE can only operate on 128 bits.<br /><br />A SIMD enabled-processor can execute a single operation on multiple data. An operation performed simultaneously on multiple data elements is a vector process. SIMD vectorization is the process of converting an algorithm from a scalar to a vector implementation. The multiply function in sample code below is used to illustrate the difference between the scalar and SIMD vector process.<br /><br /><img src="http://software.intel.com/file/41320" /><br /><br />
<p ><img src="http://software.intel.com/file/41321" /></p>
<div ><b>Figure 1:</b><i> This illustrates the difference between scalar and vector processes. The scalar version would have 16 loads, 8 multiplications and 8 stores. SSE can potentially have 4 loads, 2 vector multiplications and 2 stores. AVX would use 2 loads, 1 large vector multiplication and 1 store. The labels with VMUL were shortened to hide the distinction between various versions of vector multiplication instructions. VMUL performs multiplication on vectors A and B for each element pair and stores the results in another vector. Let us suppose for simplicity that loads and stores cost 3 cycles, all multiplication costs 1 cycle and we are ignoring pipelining. Then the scalar version spends 80 cycles to compute 8 elements while the AVX version spends 10 cycles, yielding a theoretical speedup of 8x. This clearly illustrates why SIMD vectorization has become a very important aspect to optimize application performance. Also given observed performance benefits with SIMD, automatic SIMD vectorization has become as keystone feature in advanced compilers.<br /></i></div>
<br />
<h2 class="sectionHeading">3. Video Processing Code</h2>
Typical video processing algorithms calculate pixel values using a triple for-loop (for each frame, for each X, for each Y). This typically is seen as an area of high CPU utilization (i.e. hotspot). Video processing application hotspots are excellent candidates to optimize with AVX.<br /><br />A simple approach to optimize using SIMD involves taking advantage of the latest processor technology features, such as AVX. The following sections describe the optimization process using AVX instructions to enhance the performance of a de-saturation algorithm. The serial code implementation is briefly discussed and AVX-based SIMD instructions are used to optimize the de-saturation algorithm. Finally, this chapter ends with our performance results of the optimize code.<br /><br /><br />
<h2 class="sectionHeading">3.1. Desaturation - Sample Code</h2>
The typical implementation of the Desaturation algorithm uses the incoming pixel values to compute a luminance value. The luminance value is applied to all outgoing pixels to de-saturate images as part of processing video for output.<br /><br />As you can see in the sample code below the algorithm traverses row by row to get pixel data, which channel values (blue, green, and red) are used to calculate the luminance value. In other to achieve high accuracy the algorithm converts the one-byte channel values to single precision floating point. The floating-point values are used in a dot product type of operation to compute the luminance value. The Desaturation sample algorithm uses the fLuminace(…) function to convert pixel channel values from byte to float. The conversion to float is achieved implicitly by typecasting each channel value to float and with weights as constants for Red, Green and Red, the fLuminance(…) function uses the float values to compute luminance which value is applied to the video output.<br /><br /><img src="http://software.intel.com/file/41322" /><br /><img src="http://software.intel.com/file/41323" /><br /><br />Note that the conversion of channel data from byte to float occurs implicitly by typecasting to float. Although the scalar code looks simple and trivial, the assembly code generated by the compiler is much more complex. In analysis of the generated assembly code, the implicit byte to float conversion can be performed with fewer instructions by using the more efficient AVX instructions. As we have observed, the serial code calculates one channel and one pixel at a time. Nothing is computed in parallel (ignoring pipelining and reordering). Refer to <b>Appendix A</b> for the assembly code.<br /><br /><br />
<h2 class="sectionHeading">3.2. Desaturation - Optimization with AVX</h2>
This section outlines the transformation of the serial code and describes how AVX, SSE4.1 and SSE2 instructions optimize the de-saturation algorithm. As illustrated in Chapter 2 with SIMD, we can work on many items at once. Therefore, the load, store, conversion and math operations can be done in parallel. The algorithm below describes how we can use instruction level parallelism (via AVX instructions) to significantly improve performance. Note that the algorithm is written with the restriction that we could only use available instructions, not idealistic for future instructions as we discuss later. Therefore, lines 19, 20 and 21 involve an intermediate step to convert 32-bit integers back down to 8-bit unsigned and etcetera.<br /><br /><img src="http://software.intel.com/file/41332" /><br /><img src="http://software.intel.com/file/41333" /><br /><br />
<div ><b>Figure 2. </b><i>De-saturation algorithm</i><br /><br /></div>
<br />With the Figure 2 as the backbone of de-saturate, we can implement the real code. The motivations for using a procedure similar to Figures 2 and 3 are that:<br /><br />
<ul>
<li>AVX provides greater throughput for parallel processing of single-precision floating- point units than any past Intel SIMD x86 extension (MMX, SSE, SSE2, SSE3, SSE4.1, SSE4.2).</li>
<li>The cost (SIMD) to cast byte (8-bit unsigned char) to integer (32-bit signed integer) to single precision floating point (32-bit float) and back is less than using multiple calls of the equivalent code (scalar) using just bytes or integers.</li>
<li>Using byte based SIMD with this procedure gives poor precision. Parallel performance is not considered.</li>
<li>Using integer based SIMD with this procedure gives acceptable precision. Current AVX instructions for integer arithmetic do not exist and therefore cannot take full advantage of the 256-bit registers.</li>
<li>Using float based SIMD with this procedure gives very good precision and offers higher performance than those described above.</li>
</ul>
<img src="http://software.intel.com/file/41334" /><br /><img src="http://software.intel.com/file/41335" /><br /><br /><b>Figure 3. </b><i>De-saturation code optimzed AVX<sup>1</sup></i><br /><br />The algorithm and AVX code shown in Figures 2 and 3 convey the same exact process line-for-line. Notice that only lines 9 and 16 involve doing the real work. Theses lines each process 8 single precision floating point multiplications in parallel, totalling 16 multiplications for 2 instructions versus 16 individual multiplication instructions. Everything else is unnecessary overhead to make use of the parallel instructions or to increase precision.<br /><br />Despite the overhead, this code still improves performance by 1.45x . If integer based instructions existed with equivalent parallelism to that of single precision floating point, we could further increase performance. In such case, lines 6, 8, 10, 11, 14, 15, 17 and 18 could be eliminated. Lines 9 and 16 would operate on integers instead. Lines 19, 20 and 21 could require a single pack instruction (integer to byte). Of course, there are other hypothetical instructions that could be introduced with future AVX generations. The potential performance gain is left as an exercise for the reader. For assembly instructions generated by intrinsic functions used in the inner [ix] loop, refer to <b>Appendix B</b>.<br /><br /><br />
<h2 class="sectionHeading">3.3 Desaturation - Performance Test Results<b><sup>3</sup></b></h2>
Performance assessment of the de-saturation algorithm optimized with AVX in this study observed a 1.45x speedup when compared to the serial code. To gather performance data the de-saturation algorithm was applied to a 1440x1080 image and was looped 100 times. Performance was measured in elapsed time (milliseconds) taken to de-saturate the image, the following performance numbers were consistently observed:<br /><br />
<table cellpadding="10" cellspacing="0" border="0">
<tbody>
<tr>
<td>Serial Code:</td>
<td>1264 milliseconds</td>
</tr>
<tr>
<td>Code with AVX:</td>
<td>873 milliseconds</td>
</tr>
<tr>
<td>Performance Scaling:</td>
<td>1.45x or 1264ms/873ms</td>
</tr>
</tbody>
</table>
<br />A kernel (small application program) was used to run the algorithm. A kernel with a 1.45x scaling typically translates to a performance improvement of 10% to 15% when measured at the workload level. However, for video processing this rule of thumb does not apply. Consider you are applying the de-saturation algorithm to a one-minute or longer video clip. In that case, there will be more than 100 frames (images) to process. In theory, since more data has be processed, the performance boost potential could be more than 1.45x especially if or when processing full High-Definition (e.i.,1920x1080) video.<br /><br /><br />
<h2 class="sectionHeading">4. Packed Integer Conversion Instructions</h2>
Since our optimized de-saturation algorithm uses one of the SSE4.1 instructions, we will give an overview of SSE4.1 because other SSE4.1 instructions may be applicable for the optimization of other video processing algorithms. The Packed Integer Conversion instruction set contains 12 instructions for packed integer bit width conversions. Any of which can be utilized to optimize code where bit width is to be increased for integer data.<br /><br />The table in <b>Figure 5</b> lists the SSE4.1 instructions for packed integer conversions. The instructions support sign extension and zero extension conversions of byte to word, byte to double word, byte to quad-word, word to double word, word to quad-word, and double word to quad-word. Additionally, the chart shows a comparison of SSE2 vs SSE4.1 instructions needed to convert four (4) one-byte integers to four (4) 32-bit integers.<br /><br />The <i>pmovzxbd (byte to double word)</i> instruction was utilized a total of four (4) times in the de-saturate optimization. When/if these instructions include support for full 256-bit register, the use of this instruction in the optimized algorithm will be reduced to two (2). Thereby further improving the loop performance.<br /><br /><img src="http://software.intel.com/file/41328" /><br /><b>Figure 4. </b><i>Instructions for bit width conversions of packed integers</i><br /><br />The source operand to packed integer conversion instructions is from either an XMM register or memory. The destination is always an XMM register. When accessing memory, no alignment is required, unless alignment checking is enabled. In which case, all conversions must be aligned to the width of the memory being referenced. The number of elements that can be converted and width of memory reference is illustrated in <b>Figure 5</b>. The alignment requirement is shown in parenthesis.<br /><br /><img src="http://software.intel.com/file/41329" /><br /><b>Figure 5. Number of elements to process.</b> <i>P is Packed. MOV is Move (copy register). ZX is Zero Extend. SX is Sign Extend. B is Byte. W is Word. D is Double Word. Q is Quad-Word.</i><br /><br /><br />
<h2 class="sectionHeading">5. Conclusion</h2>
This paper has discussed how Second Generation Intel® Core™ Processors could increase parallel processing via AVX instructions and 256 bit registers. This paper outlined a case study in which AVX instructions were used to improve the compute performance of a de-saturation algorithm. The paper also discussed how future integer based AVX instructions could be used to further enhance SIMD optimizations and achieve even greater performance benefits on video processing algorithms. The procedure described demonstrated how AVX instructions or their intrinsic functions could be utilized to improve the runtime performance of video processing applications. The paper documented that despite some overhead incurred to setup for SIMD processing, the video de-saturation still achieved excellent performance benefits. <br /><br /><br />
<h2 class="sectionHeading">About the Authors</h2>
Eli Hernandez is an Application Engineer in the Consumer Client and Power Enabling Group at Intel Corporation where he works with customers to optimize their software for power efficiency and to run best on Intel hardware and software technologies. Eli joined Intel in August of 2007 with over 12 years of experience in software development for the telecom and the chemical industry. He received his B.S. in Electrical Engineering in 1989 and completed Master Studies in Computer Science in 1991-1992 from the DePaul University of Chicago.<br /><br />In 2008, Larry Moore graduated from Saint Petersburg College with Honors. He received a Who's Who Among Students Award and was a member of Phi Theta Kappa Honor Society. In 2011, he spent 8 months at Intel as an application engineer intern, in DuPont, Washington. Currently, he is attending the University of South Florida at Tampa, Florida in an accelerated graduate program, pursuing both a Bachelor of Science and Master of Science in Computer Engineering. His current research involves computer aided verification of real-time systems and model checking. Larry is also a member of IEEE Computer Society. <br /><br /><br />
<h2 class="sectionHeading">Appendix A: Inner loop quivalent assembly of the serial code</h2>
Roughly 45 instructions to proccess an iteration of the algorithm inner loop. With throughput of 1 pixels processed per iteration.<br /><br /><img src="http://software.intel.com/file/41330" /><br /><br />
<h2 class="sectionHeading">Appendix B: Equivalent assembly of inner loop optimized with AVX</h2>
Roughly 30 instructions to proccess an iteration of the algorithm inner loop. With throughput of 4 pixels processed per iteration.<br /><br /><img src="http://software.intel.com/file/41331" /><br /><br /><sup>1</sup> Load and store operations on optimized code assumes data is aligned.<br /><sup>2</sup> Please see footnote 3 and section 3.3.<br /><sup>3</sup> The performance measurements in this section are the actual numbers from real tests. However, we do not guarantee you will achieve as good of a performance.<br /><br />
<div id="vc-meta" >
<div id="vc-meta-author">
<div></div>
</div>
<div id="vc-meta-pubdate">02-08-2012</div>
<div id="vc-meta-modificationdate">02-08-2012</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product">
<div></div>
</div>
<div id="vc-meta-category">
<div></div>
</div>
<div id="vc-meta-thumb">http://software.intel.com/file/41303</div>
<div id="vc-meta-abstract">This paper describes a case study in which AVX instructions are used to enhance the performance of a de-saturation algorithm (a common video filter). The case study takes the algorithm from a non-SIMD state to AVX based SIMD. The paper also discusses how future generations of AVX may be able to further aid performance optimization and enable greater performance of video processing.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/improving-the-compute-performance-of-video-processing-software-using-avx-advanced-vector-extensions-instructions/</link>
      <pubDate>Wed, 08 Feb 2012 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/improving-the-compute-performance-of-video-processing-software-using-avx-advanced-vector-extensions-instructions/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/improving-the-compute-performance-of-video-processing-software-using-avx-advanced-vector-extensions-instructions/</guid>
      <category>Visual Computing</category>
      <category>Intel® AVX</category>
      <category>Visual Computing Source</category>
      <category>Server Developer Community</category>
    </item>
    <item>
      <title>Introduction to Intel® Advanced Vector Extensions</title>
      <description><![CDATA[ <b>By Chris Lomont</b><br /><br />
<h2 class="sectionHeading">Download Article</h2>
Download <a href="http://software.intel.com/file/37205">Introduction to Intel® Advanced Vector Extensions</a> [PDF 1.4MB]<br /><br />Intel® Advanced Vector Extensions (Intel® AVX) is a set of instructions for doing Single Instruction Multiple Data (SIMD) operations on Intel® architecture CPUs. These instructions extend previous SIMD offerings (MMX™ instructions and Intel® Streaming SIMD Extensions (Intel® SSE)) by adding the following new features:<br /> 
<ul>
<li>The 128-bit SIMD registers have been expanded to 256 bits. Intel® AVX is designed to support 512 or 1024 bits in the future.</li>
<li>Three-operand, nondestructive operations have been added. Previous two-operand instructions performed operations such as A = A + B, which overwrites a source operand; the new operands can perform operations like A = B + C, leaving the original source operands unchanged.</li>
<li>A few instructions take four-register operands, allowing smaller and faster code by removing unnecessary instructions.</li>
<li>Memory alignment requirements for operands are relaxed.</li>
<li>A new extension coding scheme (VEX) has been designed to make future additions easier as well as making coding of instructions smaller and faster to execute.</li>
</ul>
Closely related to these advances are the new Fused-Multiply-Add (FMA) instructions, which allow faster and more accurate specialized operations such as single instruction A = A * B + C. The FMA instructions should be available in the second-generation Intel® Core™ CPU. Other features include new instructions for dealing with Advanced Encryption Standard (AES) encryption and decryption, a packed carry-less multiplication operation (PCLMULQDQ) useful for certain encryption primitives, and some reserved slots for future instructions, such as a hardware random number generator.<br /><br /><br />
<h2 class="sectionHeading">Instruction Set Overview</h2>
The new instructions are encoded using what Intel calls a <i>VEX prefix</i>, which is a two- or three-byte prefix designed to clean up the complexity of current and future x86/x64 instruction encoding. The two new VEX prefixes are formed from two obsolete 32-bit instructions-Load Pointer Using DS (LDS-0xC4, 3-byte form) and Load Pointer Using ES (LES-0xC5, two-byte form)-which load the DS and ES segment registers in 32-bit mode. In 64-bit mode, opcodes LDS and LES generate an invalid-opcode exception, but under Intel® AVX, these opcodes are repurposed for encoding new instruction prefixes. As a result, the VEX instructions can only be used when running in 64-bit mode. The prefixes allow encoding more registers than previous x86 instructions and are required for accessing the new 256-bit SIMD registers or using the three- and four-operand syntax. As a user, you do not need to worry about this (unless you're writing assemblers or disassemblers).<br /><br /><br /><b>Note:</b> <i>The rest of this article assumes operation in 64-bit mode.</i><br /><br /><br />SIMD instructions allow processing of multiple pieces of data in a single step, speeding up throughput for many tasks, from video encoding and decoding to image processing to data analysis to physics simulations. Intel® AVX instructions work on Institute of Electrical and Electronics Engineers (IEEE)-754 floating-point values in 32-bit length (called <i>single precision</i>) and in 64-bit length (called <i>double precision</i>). IEEE-754 is the standard defining reproducible, robust floating-point operation and is the standard for most mainstream numerical computations.<br /><br />The older, related Intel® SSE instructions also support various signed and unsigned integer sizes, including signed and unsigned byte (B, 8-bit), word (W, 16-bit), doubleword (DW, 32-bit), quadword (QW, 64-bit), and doublequadword (DQ, 128-bit) lengths. Not all instructions are available in all size combinations; for details, see the links provided in "For More Information." See Figure 2 later in this article for a graphical representation of the data types.<br /><br />The hardware supporting Intel® AVX (and FMA) consists of the 16 256-bit YMM registers YMM0-YMM15 and a 32-bit control/status register called <i>MXCSR</i>. The YMM registers are aliased over the older 128-bit XMM registers used for Intel SSE, treating the XMM registers as the lower half of the corresponding YMM register, as shown in Figure 1.<br /><br />Bits 0-5 of MXCSR indicate SIMD floating-point exceptions with "sticky" bits-after being set, they remain set until cleared using <code>LDMXCSR</code> or <code>FXRSTOR</code>. Bits 7-12 mask individual exceptions when set, initially set by a power-up or reset. Bits 0-5 represent invalid operation, denormal, divide by zero, overflow, underflow, and precision, respectively. For details, see the links "For More Information."<br /><br />
<p ><img src="http://software.intel.com/file/37206" /></p>
<div ><b>Figure 1.</b><i> XMM registers overlay the YMM registers.</i><br /></div>
<br />Figure 2 illustrates the data types used in the Intel® SSE and Intel® AVX instructions. Roughly, for Intel AVX, any multiple of 32-bit or 64-bit floating-point type that adds to 128 or 256 bits is allowed as well as multiples of any integer type that adds to 128 bits.<br /><br />
<p ><img src="http://software.intel.com/file/37207" /></p>
<div ><b>Figure 2.</b> <i>Intel® AVX and Intel® SSE data types</i><br /></div>
<br />Instructions often come in scalar and vector versions, as illustrated in Figure 3. Vector versions operate by treating data in the registers in parallel "SIMD" mode; the scalar version only operates on one entry in each register. This distinction allows less data movement for some algorithms, providing better overall throughput.<br /><br />
<p ><img src="http://software.intel.com/file/37208" /></p>
<div ><b>Figure 3.</b> SIMD versus scalar operations<br /></div>
<br />Data is <i>memory aligned</i> when the data to be operated upon as an n-byte chunk is stored on an n-byte memory boundary. For example, when loading 256-bit data into YMM registers, if the data source is 256-bit aligned, the data is called <i>aligned</i>.<br /><br />For Intel® SSE operations, memory alignment was required unless explicitly stated. For example, under Intel SSE, there were specific instructions for memory-aligned and memory-unaligned operations, such as the MOVAPD (move-aligned packed double) and <code>MOVUPD</code> (move-unaligned packed double) instructions. Instructions not split in two like this required aligned accesses.<br /><br />Intel® AVX has relaxed some memory alignment requirements, so now Intel AVX by default allows unaligned access; however, this access may come at a performance slowdown, so the old rule of designing your data to be memory aligned is still good practice (16-byte aligned for 128-bit access and 32-byte aligned for 256-bit access). The main exceptions are the VEX-extended versions of the SSE instructions that explicitly required memory-aligned data: These instructions still require aligned data. Other specific instructions requiring aligned access are listed in Table 2.4 of the <i>Intel® Advanced Vector Extensions Programming Reference</i> (see "For More Information" for a link).<br /><br />Another performance concern besides unaligned data issues is that mixing legacy XMM-only instructions and newer Intel AVX instructions causes delays, so minimize transitions between VEX-encoded instructions and legacy Intel SSE code. Said another way, do not mix VEX-prefixed instructions and non-VEX-prefixed instructions for optimal throughput. If you must do so, minimize transitions between the two by grouping instructions of the same VEX/non-VEX class. Alternatively, there is no transition penalty if the upper YMM bits are set to zero via <code>VZEROUPPER</code> or <code>VZEROALL</code>, which compilers should automatically insert. This insertion requires an extra instruction, so profiling is recommended.<br /><br /><br />
<h2 class="sectionHeading">Intel® AVX Instruction Classes</h2>
As mentioned, Intel® AVX adds support for many new instructions and extends current Intel SSE instructions to the new 256-bit registers, with most old Intel SSE instructions having a V-prefixed Intel AVX version for accessing new register sizes and three-operand forms. Depending on how instructions are counted, there are up to a few hundred new Intel AVX instructions.<br /><br />For example, the old two-operand Intel SSE instruction <code>ADDPS xmm1, xmm2/m128</code> can now be expressed in three-operand syntax as <code>VADDPS xmm1, xmm2, xmm3/m128</code> or the 256-bit register using the form <code>VADDPS ymm1, ymm2, ymm3/m256</code>. A few instructions allow four operands, such as <code>VBLENDVPS ymm1, ymm2, ymm3/m256, ymm4,</code> which conditionally copies single-precision floating-point values from <code>ymm2</code> or <code>ymm3/m256</code> to <code>ymm1</code> based on masks in <code>ymm4</code>. This is an improvement on the previous form, where <code>xmm0</code> was implicitly needed, requiring compilers to free up <code>xmm0</code>. Now, with all registers explicit, there is more freedom for register allocation. Here, <code>m128</code> is a 128-bit memory location, <code>xmm1</code> is the 128-bit register, and so on.<br /><br />Some new instructions are VEX only (not Intel SSE extensions), including many ways to move data into and out of the YMM registers. Examples are the useful <code>VBROADCASTS[S/D]</code>, which loads a single value into all elements of an XMM or YMM register, and ways to shuffle data around in a register using <code>VPERMILP[S/D]</code>. (The bracket notation is explained in the Appendix A.)<br /><br />Intel® AVX adds arithmetic instructions for variants of add, subtract, multiply, divide, square root, compare, min, max, and round on single- and double-precision packed and scalar floating-point data. Many new conditional predicates are also useful for 128-bit Intel SSE, giving 32 comparison types. Intel® AVX also includes instructions promoted from previous SIMD covering logical, blend, convert, test, pack, unpack, shuffle, load, and store. The toolset adds new instructions, as well, including non-strided fetching (broadcast of single or multiple data into a 256-bit destination, masked-move primitives for conditional load and store), insert and extract multiple-SIMD data to and from 256-bit SIMD registers, permute primitives to manipulate data within a register, branch handling, and packed testing instructions.<br /><br /><br /><b><i>Future Additions</i></b><br />The Intel® AVX manual also lists some proposed future instructions, covered here for completeness. This is not a guarantee that these instructions will materialize as written.<br /><br />Two instructions (<code>VCVTPH2PS</code> and <code>VCVTPS2PH</code>) are reserved for supporting 16-bit floating-point conversions to and from single- and double-floating-point types. The 16-bit format is called <i>half-precision</i> and has a 10-bit mantissa (with an implied leading 1 for non-denormalized numbers, resulting in 11-bit precision), 5-bit exponent (biased by 15), and 1-bit sign.<br /><br />The proposed <code>RDRAND</code> instruction uses a cryptographically secure hardware digital random bit generator to generate random numbers for 16- 32- , and 64-bit registers. On success, the carry flag is set to 1 (<code>CF=1</code>). If not enough entropy is available, the carry flag is cleared (<code>CF=0</code>).<br /><br />Finally, there are four instructions (<code>RDFDBASE, RDGSBASE, WRFSBASE, and WRGSBASE</code>) to read and write FS and GS registers at all privilege levels in 64-bit mode.<br /><br />Another future addition is the FMA instructions, which perform operations similar to A = + A * B + C, where either of the plus signs (+) on the right can be changed to a minus sign (?) and the three operands on the right can be in any order. There are also forms for interleaved addition and subtraction. Packed FMA instructions can perform eight single-precision FMA operations or four double-precision FMA operations with 256-bit vectors.<br /><br />FMA operations such as A = A * B + C are better than performing one step at a time, because intermediate results are treated as infinite precision, with rounding done on store, and thus are more accurate for computation. This single rounding is what gives the "fused" prefix. They are also faster than performing the computation in steps.<br /><br />Each instruction comes in three forms for the ordering of the operands A, B, and C, with the ordering corresponding to a three-digit extension: form <i>132</i> does A = AC + B, form <i>213</i> does A = BA + C, and form <i>231</i> does A = BC + A. The ordering number is just the order of the operands on the right side of the expression.<br /><br /><br />
<h2 class="sectionHeading">Availability and Support</h2>
Detecting availability of the Intel® AVX features in hardware requires using the <code>CPUID</code> instruction to query support in the CPU and in the operating system, as detailed later. Second-generation Intel® Core™ processors (Intel® microarchitecture code name Sandy Bridge), released in Q1, 2011, are the first from Intel supporting Intel® AVX technology. These processors will not have the new FMA instructions. For development and testing without hardware support, the free Intel® Software Development Emulator (see "For More Information" for a link) includes support for all these features, including Intel AVX, FMA, PCLMULQDQ, and AES instructions.<br /><br />To use the Intel AVX extensions reliably in most settings, the operating system must support saving and loading the new registers (with <code>XSAVE/XRSTOR</code>) on thread context switches to prevent data corruption. To help avoid such errors, operating systems supporting Intel AVX-aware context switches explicitly set a CPU bit enabling the new instructions; otherwise, an undefined opcode (<code>#UD</code>) exception is generated when Intel AVX instructions are used.<br /><br />Microsoft Windows* 7 with Service Pack 1 (SP1) and Microsoft Windows* Server 2008 R2 with SP1-both 32- and 64-bit versions-and later versions of Windows* support Intel AVX save and restore in thread and process switches. Linux* kernels from 2.6.30 (June 2009) and later support Intel AVX, as well.<br /><br /><br /><b><i>Detecting Availability and Support</i></b><br />Detection of support for the four areas-Intel® AVX, FMA, AES, and PCLMULQDQ-are similar and require similar steps consisting of checking for hardware and operating system support for the desired feature (see Table 1). These steps are (counting bits starting at bit 0):<br /><br /><ol>
<li>Verify that the operating system supports <code>XGETBV</code> using <code>CPUID.1:ECX.OSXSAVE bit 27 = 1</code>.</li>
<li>At the same time, verify that <code>CPUID.1:ECX bit 28=1</code> (Intel AVX supported) and/or bit <code>25=1</code> (AES supported) and/or <code>bit 12=1</code> (FMA supported) and/or <code>bit 1=1</code> (PCLMULQDQ) are supported.</li>
<li>Issue <code>XGETBV</code>, and verify that the feature-enabled mask at bits 1 and 2 are 11b (XMM state and YMM state enabled by the operating system).</li>
</ol><b>Table 1.</b> <i>Feature-detection Masks</i><br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Feature</td>
<td >Bits to check</td>
<td >Constant</td>
</tr>
<tr >
<td >Intel® AVX</td>
<td >28, 27</td>
<td ><code>018000000H</code></td>
</tr>
<tr >
<td >VAES</td>
<td >28, 27, and 25</td>
<td ><code>01A000000H</code></td>
</tr>
<tr >
<td >VPCLMULQDQ</td>
<td >28, 27, and 1</td>
<td ><code>018000002H</code></td>
</tr>
<tr >
<td >FMA</td>
<td >28, 27, and 12</td>
<td ><code>018001000H</code></td>
</tr>
</tbody>
</table>
<br />Example code implementing this process is provided in Listing 1, where the <code>CONSTANT</code> is the value from Table 1. A Microsoft* Visual Studio* C++ intrinsic version is given later.<br /><br /><b>Listing 1.</b> <i>Feature Detection</i><br />
<pre name="code" class="cpp">INT Supports_Feature()
   { 
   ; result returned in eax
   mov eax, 1
   cpuid
   and ecx, CONSTANT
   cmp ecx, CONSTANT; check desired feature flags
   jne not_supported 
   ; processor supports features
   mov ecx, 0; specify 0 for XFEATURE_ENABLED_MASK register
   XGETBV; result in EDX:EAX
   and eax, 06H
   cmp eax, 06H; check OS has enabled both XMM and YMM state support
   jne not_supported
   mov eax, 1; mark as supported
   jmp done
   NOT_SUPPORTED:
   mov eax, 0 ; // mark as not supported
   done:
   }<br /> </pre>
<h2 class="sectionHeading">Usage</h2>
At the lowest programming level, most common x86 assemblers now support Intel® AVX, FMA, AES, and the VPCLMULQDQ instructions, including Microsoft MASM* (Microsoft Visual Studio* 2010 version), NASM*, FASM*, and YASM*. See their respective documentation for details.<br /><br />For language compilers, Intel® C++ Compiler version 11.1 and later and Intel® Fortran Compilers support Intel® AVX through compiler switches, and both compilers support automatic vectorization of floating-point loops. The Intel C++ Compiler supports Intel AVX intrinsics (use <code>#include &lt;immintrin.h&gt;</code> to access intrinsics) and inline assembly and even supports Intel AVX intrinsics emulation using <code>#include "avxintrin_emu.h"</code>.<br /><br />Microsoft Visual Studio* C++ 2010 with SP1 and later has support for Intel AVX (see "For More Information") when compiling 64-bit code (use the <code>/arch:AVX</code> compiler switch). It supports intrinsics using the <code>&lt;immintrin.h&gt;</code> header but not inline assembly. Intel AVX support is also in MASM*, the disassembly view of code, and the debugger views of registers (giving full YMM support).<br /><br />In the GNU Compiler Collection* (GCC*), version 4.4 supports Intel AVX intrinsics through the same header, <code>&lt;immintrin.h&gt;</code>. Other GNU toolchain support is found in Binutils 2.20.51.0.1 and later, gdb 6.8.50.20090915 and later, recent GNU Assembler (GAS) versions, and <code>objdump</code>. If your compiler does not support Intel AVX, you can emit the required bytes under many circumstances, but first-class support makes your life easier.<br /><br />Each of the three C++ compilers mentioned supports the same intrinsic operations to simplify using Intel® AVX from C or C++ code. <i>Intrinsics</i> are functions that the compiler replaces with the proper assembly instructions. Most Intel AVX intrinsic names follow the following format:<br /><br />
<pre name="code" class="cpp">_mm256_op_suffix(data_type param1, data_type param2, data_type param3)
</pre>
where <code>_mm256</code> is the prefix for working on the new 256-bit registers; <code>_op</code> is the operation, like <code>add</code> for addition or <code>sub</code> for subtraction; and <code>_suffix</code> denotes the type of data to operate on, with the first letters denoting packed (p), extended packed (ep), or scalar (s). The remaining letters are the types in Table 2.<br /><br /><br /><b>Table 2.</b> <i>Intel® AVX Suffix Markings</i><br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Marking</td>
<td >Meaning</td>
</tr>
<tr >
<td ><code>[s/d]</code></td>
<td >Single- or double-precision floating point</td>
</tr>
<tr >
<td ><code>[i/u]nnn</code></td>
<td >Signed or unsigned integer of bit size <i>nnn</i>, where <i>nnn</i> is 128, 64, 32, 16, or 8</td>
</tr>
<tr >
<td ><code>[ps/pd/sd]</code></td>
<td >Packed single, packed double, or scalar double</td>
</tr>
<tr >
<td ><code>epi32</code></td>
<td >Extended packed 32-bit signed integer</td>
</tr>
<tr >
<td ><code>si256</code></td>
<td >Scalar 256-bit integer</td>
</tr>
</tbody>
</table>
<br />Data types are in Table 3. The first two parameters are source registers, and the third parameter (when present) is an integer mask, selector, or offset value.<br /><br /><b>Table 3.</b> <i>Intel® AVX Intrinsics Data Types</i><br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Type</td>
<td >Meaning</td>
</tr>
<tr >
<td ><code>__m256</code></td>
<td >256-bit as eight single-precision floating-point values, representing a YMM register or memory location</td>
</tr>
<tr >
<td ><code>__m256d</code></td>
<td >256-bit as four double-precision floating-point values, representing a YMM register or memory location</td>
</tr>
<tr >
<td ><code>__m256i</code></td>
<td >256-bit as integers, (bytes, words, etc.)</td>
</tr>
<tr >
<td ><code>__m128</code></td>
<td >128-bit single precision floating-point (32 bits each)</td>
</tr>
<tr >
<td ><code>__m128d</code></td>
<td >128-bit double precision floating-point (64 bits each)</td>
</tr>
</tbody>
</table>
<br />Some intrinsics are in other headers, such as the AES and PCLMULQDQ being in <code>&lt;wmmintrin.h&gt;</code>. Consult your compiler documentation or the web to track down where various intrinsics live.<br /><br /><br /><i><b>Microsoft Visual Studio* 2010</b></i><br />For conciseness, the rest of this article uses Microsoft Visual Studio* 2010 with SP1; similar code should work on the Intel® compiler or GCC*. Microsoft Visual Studio* 2010 with SP1 can automatically generate Intel® AVX code if you click <b>Project Properties &gt; Configuration &gt; Code Generation</b>, select <b>Not Set</b> under <b>Enable Enhanced Instruction Set</b>, and then manually add <code>/arch:AVX</code> to the command line under the <b>Command Line</b> entry. As an example of using intrinsics, Listing 2 offers an intrinsic-based Intel AVX feature-detection routine.<br /><br /><br /><b>Listing 2.</b> <i>Intrinsic-based Feature Detection</i><br />
<pre name="code" class="cpp">// get AVX intrinsics
#include &lt;immintrin.h&gt;
// get CPUID capability
#include &lt;intrin.h&gt;

// written for clarity, not conciseness
#define OSXSAVEFlag (1UL&lt;&lt;27)
#define AVXFlag     ((1UL&lt;&lt;28)|OSXSAVEFlag)
#define VAESFlag    ((1UL&lt;&lt;25)|AVXFlag|OSXSAVEFlag)
#define FMAFlag     ((1UL&lt;&lt;12)|AVXFlag|OSXSAVEFlag)
#define CLMULFlag   ((1UL&lt;&lt; 1)|AVXFlag|OSXSAVEFlag)
 
bool DetectFeature(unsigned int feature)
	{
	int CPUInfo[4], InfoType=1, ECX = 1;
	__cpuidex(CPUInfo, 1, 1);       // read the desired CPUID format
	unsigned int ECX = CPUInfo[2];  // the output of CPUID in the ECX register. 
	if ((ECX &amp; feature) != feature) // Missing feature 
		return false; 
	__int64 val = _xgetbv(0);       // read XFEATURE_ENABLED_MASK register
	if ((val&amp;6) != 6)               // check OS has enabled both XMM and YMM support.
		return false; 
	return true;
	}<br /> </pre>
<h2 class="sectionHeading">Mandelbrot Example</h2>
To demonstrate using the new instructions, compute Mandelbrot set images using straight C/C++ code (checking to ensure that the compiler did not convert the code to Intel® AVX instructions!) and the new Intel AVX instructions as intrinsics, comparing their performance. A Mandelbrot set is a computationally intensive operation on complex numbers, defined in pseudocode as shown in Listing 3.<br /><br /><br /><b>Listing 3.</b> <i>Mandelbrot Pseudocode</i><br />
<pre name="code" class="cpp">z,p are complex numbers
for each point p on the complex plane
	z = 0
	for count = 0 to max_iterations
		if abs(z) &gt; 2.0
			break
		z = z*z+p
	set color at p based on count reached
</pre>
The usual image is over the portion of the complex plane in the rectangle (<code>-2,-1</code>) to (<code>1,1</code>). Coloring can be done in many ways (not covered here). Raise the maximum iteration count to zoom into portions and determine whether a value "escapes" over time.<br /><br />To really stress the CPU, zoom in and draw the box (<code>0.29768, 0.48364</code>) to (<code>0.29778, 0.48354</code>), computing the grid of counts at multiple sizes and using a max iteration of 4096. The resulting grid of counts, when colored appropriately, is shown in Figure 4.<br /><br />
<p ><img src="http://software.intel.com/file/37209" /></p>
<div ><b>Figure 4.</b> <i>Mandelbrot set (<b>0.29768, 0.48364</b>) to (<b>0.29778, 0.48354</b>), with max iterations of <b>4096</b></i><br /><br /></div>
<br />A basic C++ implementation to compute the iteration counts is provided in Listing 4. The absolute value of the complex number compared to 2 is replaced with the norm compared to 4.0, almost doubling the speed by removing a square root. For all versions, use single-precision floats to pack as many elements into the YMM registers as possible, which is faster but loses precision compared to doubles when zooming in further.<br /><br /><br /><b>Listing 4.</b> <i>Simple Mandelbrot C++ Code</i><br />
<pre name="code" class="cpp">// simple code to compute Mandelbrot in C++
#include &lt;complex&gt;
void MandelbrotCPU(float x1, float y1, float x2, float y2, 
                   int width, int height, int maxIters, unsigned short * image)
{
	float dx = (x2-x1)/width, dy = (y2-y1)/height;
	for (int j = 0; j &lt; height; ++j)
		for (int i = 0; i &lt; width; ++i)
		{
			complex&lt;float&gt; c (x1+dx*i, y1+dy*j), z(0,0);
			int count = -1;
			while ((++count &lt; maxIters) &amp;&amp; (norm(z) &lt; 4.0))
				z = z*z+c;
			*image++ = count;
		}
}
</pre>
Test multiple versions for performance: the basic one in Listing 4, a similar CPU version made by expanding the complex types with floats, an intrinsic-based SSE version, and an intrinsic-based Intel® AVX version shown in Listing 5. Each version is tested on image sizes of 128×128, 256×256, 512×512, 1024×1024, 2048×2048, and 4096×4096. The performance of each implementation could likely be improved while retaining its underlying instruction set constraints with more work, but they should be representative of what you can obtain.<br /><br />The Intel AVX version has been carefully crafted to fit as much as possible into the 16 YMM registers. To help track how you want them to be allocated, the variables are names <code>ymm0</code> through <code>ymm15</code>. Of course, the compiler allocates registers as it sees fit, but by being careful, you can try to make all computations stay in registers this way. (Actually, from looking at the disassembly, the compiler does not allocate them nicely, and recasting this in assembly code would be a good exercise to anyone learning Intel AVX).<br /><br /><br /><b>Listing 5.</b> <i>Intel® AVX-intrinsic Mandelbrot Implementation</i><br />
<pre name="code" class="cpp">float dx = (x2-x1)/width;
float dy = (y2-y1)/height;
// round up width to next multiple of 8
int roundedWidth = (width+7) &amp; ~7UL; 
 
float constants[] = {dx, dy, x1, y1, 1.0f, 4.0f};
__m256 ymm0 = _mm256_broadcast_ss(constants);   // all dx
__m256 ymm1 = _mm256_broadcast_ss(constants+1); // all dy
__m256 ymm2 = _mm256_broadcast_ss(constants+2); // all x1
__m256 ymm3 = _mm256_broadcast_ss(constants+3); // all y1
__m256 ymm4 = _mm256_broadcast_ss(constants+4); // all 1's (iter increments)
__m256 ymm5 = _mm256_broadcast_ss(constants+5); // all 4's (comparisons)
 
float incr[8]={0.0f,1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f}; // used to reset the i position when j increases
__m256 ymm6 = _mm256_xor_ps(ymm0,ymm0); // zero out j counter (ymm0 is just a dummy)
 
for (int j = 0; j &lt; height; j+=1)
{
	__m256 ymm7  = _mm256_load_ps(incr);  // i counter set to 0,1,2,..,7
	for (int i = 0; i &lt; roundedWidth; i+=8)
	{
		__m256 ymm8 = _mm256_mul_ps(ymm7, ymm0);  // x0 = (i+k)*dx 
		ymm8 = _mm256_add_ps(ymm8, ymm2);         // x0 = x1+(i+k)*dx
		__m256 ymm9 = _mm256_mul_ps(ymm6, ymm1);  // y0 = j*dy
		ymm9 = _mm256_add_ps(ymm9, ymm3);         // y0 = y1+j*dy
		__m256 ymm10 = _mm256_xor_ps(ymm0,ymm0);  // zero out iteration counter
		__m256 ymm11 = ymm10, ymm12 = ymm10;        // set initial xi=0, yi=0
 
		unsigned int test = 0;
		int iter = 0;
		do
		{
			__m256 ymm13 = _mm256_mul_ps(ymm11,ymm11); // xi*xi
			__m256 ymm14 = _mm256_mul_ps(ymm12,ymm12); // yi*yi
			__m256 ymm15 = _mm256_add_ps(ymm13,ymm14); // xi*xi+yi*yi
 			
			// xi*xi+yi*yi &lt; 4 in each slot
			ymm15 = _mm256_cmp_ps(ymm15,ymm5, _CMP_LT_OQ);        
			// now ymm15 has all 1s in the non overflowed locations
			test = _mm256_movemask_ps(ymm15)&amp;255;      // lower 8 bits are comparisons
			ymm15 = _mm256_and_ps(ymm15,ymm4);
			// get 1.0f or 0.0f in each field as counters
			// counters for each pixel iteration
			ymm10 = _mm256_add_ps(ymm10,ymm15);        
 
			ymm15 = _mm256_mul_ps(ymm11,ymm12);        // xi*yi 
			ymm11 = _mm256_sub_ps(ymm13,ymm14);        // xi*xi-yi*yi
			ymm11 = _mm256_add_ps(ymm11,ymm8);         // xi &lt;- xi*xi-yi*yi+x0 done!
			ymm12 = _mm256_add_ps(ymm15,ymm15);        // 2*xi*yi
			ymm12 = _mm256_add_ps(ymm12,ymm9);         // yi &lt;- 2*xi*yi+y0	
 
			++iter;
		} while ((test != 0) &amp;&amp; (iter &lt; maxIters));
 
		// convert iterations to output values
		__m256i ymm10i = _mm256_cvtps_epi32(ymm10);
 
		// write only where needed
		int top = (i+7) &lt; width? 8: width&amp;7;
		for (int k = 0; k &lt; top; ++k)
			image[i+k+j*width] = ymm10i.m256i_i16[2*k];
 
		// next i position - increment each slot by 8
		ymm7 = _mm256_add_ps(ymm7, ymm5);
		ymm7 = _mm256_add_ps(ymm7, ymm5);
	}
	ymm6 = _mm256_add_ps(ymm6,ymm4); // increment j counter
}
</pre>
The full code for all versions and a Microsoft Visual Studio* 2010 with SP1 project, including a testing harness, is available at from the links in the "For More Information" section.<br /><br />The results are shown in Figures 5 and 6. To prevent tying numbers too much to a specific CPU speed, Figure 5 shows performance of each version relative the CPU version, which represents a straightforward non-SIMD C/C++ implementation of the algorithm. As expected, the Intel® SSE version performs almost 4 times as well, because it is doing 4 pixels per pass, and the Intel® AVX version performs almost 8 times as well as the CPU version. Because there is overhead from loops, memory access, less-than-perfect instruction ordering, and other factors, 4- and 8-fold improvements should be about the best possible, so this is pretty good for a first try.<br /><br /><img src="http://software.intel.com/file/37210" /><br /><b>Figure 5.</b> <i>Relative performance across sizes</i><br /><br />The second graph in Figure 6 shows that the pixels computed per millisecond are fairly constant over each size; again, the algorithms show almost quadrupling of performance from the CPU to Intel® SSE version and another doubling from the Intel SSE to Intel® AVX version.<br /><br /><img src="http://software.intel.com/file/37211" /><br /><b>Figure 6.</b> <i>Absolute performance across sizes</i><br /><br />
<h2 class="sectionHeading">Conclusion</h2>
This article provided a mid-level overview of the new Intel® Advanced Vector Extensions (Intel® AVX). These extensions are similar to previous Intel® SSE instructions but offer a much larger register space and add some new instructions. The Mandelbrot example shows performance gains over previous technology in the amount expected. For full details, be sure to check out the Intel Advanced Vector Extensions Programming Reference (see "For More Information" for a link).<br /><br />Happy hacking!<br /><br /><br />
<h2 class="sectionHeading">For More Information</h2>
Intel® Advanced Vector Extensions Programming Reference at <a href="http://software.intel.com/en-us/avx/">http://software.intel.com/en-us/avx/</a><br /><br />Federal Information Processing Standards Publication 197, "Announcing the Advanced Encryption Standard," at <a href="http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf" target="_blank">http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf</a><br /><br />The IEEE 754-2008 floating-point format standard at <a href="http://en.wikipedia.org/wiki/IEEE_754-2008" target="_blank">http://en.wikipedia.org/wiki/IEEE_754-2008</a><br /><br />Floating-Point Support for 64-Bit Drivers at <a href="http://msdn.microsoft.com/en-us/library/ff545910.aspx" target="_blank">http://msdn.microsoft.com/en-us/library/ff545910.aspx</a><br /><br />Wikipedia's entry on the Mandelbrot set at <a href="http://en.wikipedia.org/wiki/Mandelbrot_set" target="_blank">http://en.wikipedia.org/wiki/Mandelbrot_set</a><br /><br />Intel® Software Development Emulator at <a href="http://software.intel.com/en-us/articles/intel-software-development-emulator">http://software.intel.com/en-us/articles/intel-software-development-emulator</a><br /><br />The complete Mandelbrot Intel® AVX implementation for download at <a href="http://www.lomont.org" target="_blank">http://www.lomont.org</a><br /><br /><br />
<h2 class="sectionHeading">About the Author</h2>
Chris Lomont works as a research engineer at Cybernet Systems, working on projects as diverse as quantum computing algorithms, image processing for NASA, developing security hardware for United States Homeland Security, and computer forensics. Before that he obtained a PhD. in math from Purdue, three Bachelors degrees in physics, math, and computer science, worked as a game programmer, did brief stints in financial modeling, robotics work, and various consulting roles. The rest of his time is spent hiking with his wife, watching movies, giving talks, recreational programming, doing math research, learning more physics, playing music, and performing various experiments. Visit his website www.lomont.org or his electronic gadget site <a href="http://www.hypnocube.com" target="_blank">www.hypnocube.com</a>.<br /><br /><br />
<h2 class="sectionHeading">Appendix A: Instruction Set Reference</h2>
Many instructions come in packed or scalar form, meaning that they work on multiple parallel elements or on a single element in the register-a distinction marked as <i>[P/S]</i>. Entry lengths come in double or single precision for floating-point (<i>doubles</i> and <i>singles</i>, for brevity); marked <i>[D/S]</i>; and the integer forms byte, word, doubleword, and quadword, marked <i>[B/W/D/Q]</i>. Integer forms also sometimes come in signed or unsigned forms, marked <i>[S/U]</i>. Some instructions work on high or low portions of registers, marked as <i>[H/L]</i>; other optional components are in the tables. Instructions coming in Intel® SSE form and Intel® AVX form are prefixed with a (<i>V</i>) for the Intel® AVX form, allowing three operands and 256-bit register support. Entries in square brackets (<i>[]</i>) are required; entries in parentheses (<i>()</i>) are optional.<br /><br />Examples:<br /> 
<ul>
<li><code>(V)ADD[P/S][D/S]</code> is the addition of packed or scalar, double or single, with eight possible forms-<code>VADDPD, VADDPS, VADDSD, VADDSS</code>, and versions without the leading <code>V</code>.</li>
<li><code>(V)[MIN/MAX][P/S][D/S]</code> represents 16 different instructions for a <code>min</code> or <code>max</code> of packed or scalar of double or single precision.</li>
</ul>
The next table represents the multiple comparison types. VEX-prefixed instructions have 32 comparison types; non-VEX-prefixed comparisons only allow those eight types in parentheses. Each comparison type comes in multiple flavors, where <code>O</code> = ordered, <code>U</code> = unordered, <code>S</code> = signaling, and <code>Q</code> = non-signaling. Ordered/unordered tells whether the comparison is false or true if one operand is NaN (<i>Not-a-Number</i> in floating point, which happens when something failed during the computation, such as divide by 0 or the square root of a negative number). Signaling/non-signaling states whether an exception is fired when at least one operand is QNaN (<i>Quiet Not-a-Number</i>-useful for error trapping).<br /><br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Type</td>
<td >Flavors</td>
<td >Meaning</td>
</tr>
<tr >
<td ><code>EQ</code></td>
<td ><code>(OQ), UQ, OS, US</code></td>
<td >Equal</td>
</tr>
<tr >
<td ><code>LT </code></td>
<td ><code>(OS), OQ</code></td>
<td >Less than</td>
</tr>
<tr >
<td ><code>LE </code></td>
<td ><code>(OS), OQ</code></td>
<td >Less than or equal to</td>
</tr>
<tr >
<td ><code>UNORD </code></td>
<td ><code>(Q), S</code></td>
<td >Tests for unordered (NaN)</td>
</tr>
<tr >
<td ><code>NEQ </code></td>
<td ><code>(UQ), US, OQ, OS</code></td>
<td >Not equal</td>
</tr>
<tr >
<td ><code>NLT </code></td>
<td ><code>(US), UQ</code></td>
<td >Not less than</td>
</tr>
<tr >
<td ><code>NLE </code></td>
<td ><code>(US), UQ</code></td>
<td >Not less than or equal to</td>
</tr>
<tr >
<td ><code>ORD </code></td>
<td ><code>(Q), S</code></td>
<td >Tests for ordered (not NaN)</td>
</tr>
<tr >
<td ><code>NGE </code></td>
<td ><code>US, UQ</code></td>
<td >Not greater than or equal to</td>
</tr>
<tr >
<td ><code>NGT </code></td>
<td ><code>US, UQ</code></td>
<td >Not greater than</td>
</tr>
<tr >
<td ><code>FALSE </code></td>
<td ><code>OQ, OS</code></td>
<td >Comparison is always false</td>
</tr>
<tr >
<td ><code>GE </code></td>
<td ><code>OS, OQ</code></td>
<td >Greater than or equal to</td>
</tr>
<tr >
<td ><code>GT </code></td>
<td ><code>OS, OQ</code></td>
<td >Greater than</td>
</tr>
<tr >
<td ><code>TRUE</code></td>
<td ><code>UQ, US</code></td>
<td >Comparison is always true</td>
</tr>
</tbody>
</table>
<br />Finally, here are all the Intel® AVX instructions:<br /><br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Arithmetic</td>
<td >Description</td>
</tr>
<tr >
<td ><code>(V)[ADD/SUB/MUL/DIV][P/S][D/S]</code></td>
<td >Add/subtract/multiply/divide packed/scalar double/single</td>
</tr>
<tr >
<td ><code>(V)ADDSUBP[D/S]</code></td>
<td >Packed double/single add and subtract alternating indices</td>
</tr>
<tr >
<td ><code>(V)DPP[D/S]</code></td>
<td >Dot product, based on immediate mask</td>
</tr>
<tr >
<td ><code>(V)HADDP[D/S]</code></td>
<td >Horizontally add</td>
</tr>
<tr >
<td ><code>(V)[MIN/MAX][P/S][D/S]</code></td>
<td >Min/max packed/scalar double/single</td>
</tr>
<tr >
<td ><code>(V)MOVMSKP[D/S]</code></td>
<td >Extract double/single sign mask</td>
</tr>
<tr >
<td ><code>(V)PMOVMSKB</code></td>
<td >Make a mask consisting of the most significant bits</td>
</tr>
<tr >
<td ><code>(V)MPSADBW</code></td>
<td >Multiple sum of absolute differences</td>
</tr>
<tr >
<td ><code>(V)PABS[B/W/D]</code></td>
<td >Packed absolute value on bytes/words/doublewords</td>
</tr>
<tr >
<td ><code>(V)P[ADD/SUB][B/W/D/Q]</code></td>
<td >Add/subtract packed bytes/words/doublewords/quadwords</td>
</tr>
<tr >
<td ><code>(V)PADD[S/U]S[B/W]</code></td>
<td >Add packed signed/unsigned with saturation bytes/words</td>
</tr>
<tr >
<td ><code>(V)PAVG[B/W]</code></td>
<td >Average packed bytes/words</td>
</tr>
<tr >
<td ><code>(V)PCLMULQDQ</code></td>
<td >Carry-less multiplication quadword</td>
</tr>
<tr >
<td ><code>(V)PH[ADD/SUB][W/D]</code></td>
<td >Packed horizontal add/subtract word/doubleword</td>
</tr>
<tr >
<td ><code>(V)PH[ADD/SUB]SW</code></td>
<td >Packed horizontal add/subtract with saturation</td>
</tr>
<tr >
<td ><code>(V)PHMINPOSUW</code></td>
<td >Min horizontal unsigned word and position</td>
</tr>
<tr >
<td ><code>(V)PMADDWD</code></td>
<td >Multiply and add packed integers</td>
</tr>
<tr >
<td ><code>(V)PMADDUBSW</code></td>
<td >Multiply unsigned bytes and signed bytes into signed words</td>
</tr>
<tr >
<td ><code>(V)P[MIN/MAX][S/U][B/W/D]</code></td>
<td >Min/max of packed signed/unsigned integers</td>
</tr>
<tr >
<td ><code>(V)PMUL[H/L][S/U]W</code></td>
<td >Multiply packed signed/unsigned integers and store high/low result</td>
</tr>
<tr >
<td ><code>(V)PMULHRSW</code></td>
<td >Multiply packed unsigned with round and shift</td>
</tr>
<tr >
<td ><code>(V)PMULHW</code></td>
<td >Multiply packed integers and store high result</td>
</tr>
<tr >
<td ><code>(V)PMULL[W/D]</code></td>
<td >Multiply packed integers and store low result</td>
</tr>
<tr >
<td ><code>(V)PMUL(U)DQ</code></td>
<td >Multiply packed (un)signed doubleword integers and store quadwords</td>
</tr>
<tr >
<td ><code>(V)PSADBW</code></td>
<td >Compute sum of absolute differences of unsigned bytes</td>
</tr>
<tr >
<td ><code>(V)PSIGN[B/W/D]</code></td>
<td >Change the sign on each element in one operand based on the sign in the other operand</td>
</tr>
<tr >
<td ><code>(V)PS[L/R]LDQ</code></td>
<td >Byte shift left/right amount in operand</td>
</tr>
<tr >
<td ><code>(V)SL[L/AR/LR][W/D/Q]</code></td>
<td >Bit shift left/arithmetic right/logical right</td>
</tr>
<tr >
<td ><code>(V)PSUB(U)S[B/W]</code></td>
<td >Packed (un)signed subtract with (un)signed saturation</td>
</tr>
<tr >
<td ><code>(V)RCP[P/S]S</code></td>
<td >Compute approximate reciprocal of packed/scalar single precision</td>
</tr>
<tr >
<td ><code>(V)RSQRT[P/S]S</code></td>
<td >Compute approximate reciprocal of square root of packed/scalar single precision</td>
</tr>
<tr >
<td ><code>(V)ROUND[P/S][D/S]</code></td>
<td >Round packed/scalar double/single</td>
</tr>
<tr >
<td ><code>(V)SQRT[P/S][D/S]</code></td>
<td >Square root of packed/scalar double/single</td>
</tr>
<tr >
<td ><code>VZERO[ALL/UPPER]</code></td>
<td >Zero all/upper half of YMM registers</td>
</tr>
</tbody>
</table>
<br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Comparison</td>
<td >Description</td>
</tr>
<tr >
<td ><code>(V)CMP[P/S][D/S]</code></td>
<td >Compare packed/scalar double/single</td>
</tr>
<tr >
<td ><code>(V)COMIS[S/D]</code></td>
<td >Compare scalar double/single, set EFLAGS</td>
</tr>
<tr >
<td ><code>(V)PCMP[EQ/GT][B/W/D/Q]</code></td>
<td >Compare packed integers for equality/greater than</td>
</tr>
<tr >
<td ><code>(V)PCMP[E/I]STR[I/M]</code></td>
<td >Compare explicit/implicit length strings, return index/mask</td>
</tr>
</tbody>
</table>
<br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Control</td>
<td >Description</td>
</tr>
<tr >
<td ><code>V[LD/ST]MXCSR</code></td>
<td >Load/store MXCSR control/status register</td>
</tr>
<tr >
<td ><code>XSAVEOPT</code></td>
<td >Save processor extended states optimized</td>
</tr>
</tbody>
</table>
<br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Conversion</td>
<td >Description</td>
</tr>
<tr >
<td ><code>(V)CVTx2y</code></td>
<td >Convert type <i>x</i> to type <i>y</i>, where <i>x</i> and <i>y</i> are chosen from<br /><code>DQ and P[D/S],</code><br /><code>[P/S]S and [P/S]D, or</code><br /><code>S[D/S] and SI.</code></td>
</tr>
</tbody>
</table>
<br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Load/store</td>
<td >Description</td>
</tr>
<tr >
<td ><code>VBROADCAST[SS/SD/F128]</code></td>
<td >Load with broadcast (loads single value into multiple locations)</td>
</tr>
<tr >
<td ><code>VEXTRACTF128</code></td>
<td >Extract 128-bit floating-point values</td>
</tr>
<tr >
<td ><code>(V)EXTRACTPS</code></td>
<td >Extract packed single precision</td>
</tr>
<tr >
<td ><code>VINSERTF128</code></td>
<td >Insert packed floating-point values</td>
</tr>
<tr >
<td ><code>(V)INSERTPS</code></td>
<td >Insert packed single-precision values</td>
</tr>
<tr >
<td ><code>(V)PINSR[B/W/D/Q]</code></td>
<td >Insert integer</td>
</tr>
<tr >
<td ><code>(V)LDDQU</code></td>
<td >Move quad unaligned integer</td>
</tr>
<tr >
<td ><code>(V)MASKMOVDQU</code></td>
<td >Store selected bytes of double quadword with <code>NT Hint</code></td>
</tr>
<tr >
<td ><code>VMASKMOVP[D/S]</code></td>
<td >Conditional SIMD packed load/store</td>
</tr>
<tr >
<td ><code>(V)MOV[A/U]P[D/S]</code></td>
<td >Move aligned/unaligned packed double/single</td>
</tr>
<tr >
<td ><code>(V)MOV[D/Q]</code></td>
<td >Move doubleword/quadword</td>
</tr>
<tr >
<td ><code>(V)MOVDQ[A/U]</code></td>
<td >Move double to quad aligned/unaligned</td>
</tr>
<tr >
<td ><code>(V)MOV[HL/LH]P[D/S]</code></td>
<td >Move high-to-low/low-to-high packed double/single</td>
</tr>
<tr >
<td ><code>(V)MOV[H/L]P[D/S]</code></td>
<td >Move high/low packed double/single</td>
</tr>
<tr >
<td ><code>(V)MOVNT[DQ/PD/PS]</code></td>
<td >Move packed integers/doubles/singles using a non-temporal hint</td>
</tr>
<tr >
<td ><code>(V)MOVNTDQA</code></td>
<td >Move packed integers using a non-temporal hint, aligned</td>
</tr>
<tr >
<td ><code>(V)MOVS[D/S]</code></td>
<td >Move or merge scalar double/single</td>
</tr>
<tr >
<td ><code>(V)MOVS[H/L]DUP</code></td>
<td >Move single odd/even indexed singles</td>
</tr>
<tr >
<td ><code>(V)PACK[U/S]SW[B/W]</code></td>
<td >Pack with unsigned/signed saturation on bytes/words</td>
</tr>
<tr >
<td ><code>(V)PALIGNR</code></td>
<td >Byte align</td>
</tr>
<tr >
<td ><code>(V)PEXTR[B/W/D/Q]</code></td>
<td >Extract integer</td>
</tr>
<tr >
<td ><code>(V)PMOV[S/Z]X[B/W/D][W/D/Q]</code></td>
<td >Packed move with sign/zero extend (only <code>up in length, DD, DW,</code> etc. disallowed)</td>
</tr>
</tbody>
</table>
<br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Logical</td>
<td >Description</td>
</tr>
<tr >
<td ><code>(V)[AND/ANDN/OR]P[D/S]</code></td>
<td >Bitwise logical <code>AND/AND NOT/OR</code> of packed double/single values</td>
</tr>
<tr >
<td ><code>(V)PAND(N)</code></td>
<td >Logical <code>AND (NOT)</code></td>
</tr>
<tr >
<td ><code>(V)P[OR/XOR]</code></td>
<td >Bitwise <code>logical OR/exclusive OR</code></td>
</tr>
<tr >
<td ><code>(V)PTEST</code></td>
<td >Packed bit test, set zero flag if bitwise <code>AND</code> is all <code>0</code></td>
</tr>
<tr >
<td ><code>(V)UCOMIS[D/S]</code></td>
<td >Unordered compare scalar doubles/singles and set <code>EFLAGS</code></td>
</tr>
<tr >
<td ><code>(V)XORP[D/S]</code></td>
<td >Bitwise logical <code>XOR</code> of packed double/single</td>
</tr>
</tbody>
</table>
<br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Shuffle</td>
<td >Description</td>
</tr>
<tr >
<td ><code>(V)BLENDP[D/S]</code></td>
<td >Blend packed double/single; selects elements based on mask</td>
</tr>
<tr >
<td ><code>(V)BLENDVP[D/S]</code></td>
<td >Blend values</td>
</tr>
<tr >
<td ><code>(V)MOVDDUP</code></td>
<td >Copies even values to all values</td>
</tr>
<tr >
<td ><code>(V)PBLENDVB</code></td>
<td >Variable blend packed bytes</td>
</tr>
<tr >
<td ><code>(V)PBLENDW</code></td>
<td >Blend packed words</td>
</tr>
<tr >
<td ><code>VPERMILP[D/S]</code></td>
<td >Permute double/single values</td>
</tr>
<tr >
<td ><code>VPERM2F128</code></td>
<td >Permute floating-point values</td>
</tr>
<tr >
<td ><code>(V)PSHUF[B/D]</code></td>
<td >Shuffle packed bytes/doublewords based on immediate value</td>
</tr>
<tr >
<td ><code>(V)PSHUF[H/L]W</code></td>
<td >Shuffle packed high/low words</td>
</tr>
<tr >
<td ><code>(V)PUNPCK[H/L][BW/WD/DQ/QDQ]</code></td>
<td >Unpack high/low data</td>
</tr>
<tr >
<td ><code>(V)SHUFP[D/S]</code></td>
<td >Shuffle packed double/single</td>
</tr>
<tr >
<td ><code>(V)UNPCK[H/L]P[D/S]</code></td>
<td >Unpack and interleave packed/scalar doubles/singles</td>
</tr>
</tbody>
</table>
<br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >AES</td>
<td >Description</td>
</tr>
<tr >
<td ><code>AESENC/AESENCLAST</code></td>
<td >Perform one round of AES encryption</td>
</tr>
<tr >
<td ><code>AESDEC/AESDECLAST</code></td>
<td >Perform one round of AES decryption</td>
</tr>
<tr >
<td ><code>AESIMC</code></td>
<td >Perform the AES <code>InvMixColumn</code> transformation</td>
</tr>
<tr >
<td ><code>AESKEYGENASSIST</code></td>
<td >AES Round Key Generation Assist</td>
</tr>
</tbody>
</table>
<br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
</tbody>
<tbody >
<tr >
<td >Future Instructions</td>
<td >Description</td>
</tr>
<tr >
<td ><code>[RD/WR][F/G]SBASE</code></td>
<td >Read/write FS/GS register</td>
</tr>
<tr >
<td ><code>RDRAND</code></td>
<td >Read random number (into r16, r32, r64)</td>
</tr>
<tr >
<td ><code>VCVTPH2PS</code></td>
<td >Convert 16-bit floats to single precision floating-point values</td>
</tr>
<tr >
<td ><code>VCVTPS2PH</code></td>
<td >Convert single-precision values to 16-bit floating-point values</td>
</tr>
</tbody>
</table>
<br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >FMA</td>
<td >Each [z] is the string 132 or 213 or 231, giving the order the operands A,B,C are used in:<br />132 is A=AC+B<br />213 is A=AB+C<br />231 is A=BC+A</td>
</tr>
<tr >
<td ><code>VFMADD[z][P/S][D/S]</code></td>
<td >Fused multiply add A = r1 * r2 + r3 for packed/scalar of double/single</td>
</tr>
<tr >
<td ><code>VFMADDSUB[z]P[D/S]</code></td>
<td >Fused multiply alternating add/subtract of packed double/single A = r1 * r2 + r3 for odd index, A = r1 * r2-r3 for even</td>
</tr>
<tr >
<td ><code>VFMSUBADD[z]P[D/S]</code></td>
<td >Fused multiply alternating subtract/add of packed double/single A = r1 * r2-r3 for odd index, A = r1 * r2+r3 for even</td>
</tr>
<tr >
<td ><code>VFMSUB[z][P/S][D/S]</code></td>
<td >Fused multiply subtract A = r1 * r2-r3 of packed/scalar double/single</td>
</tr>
<tr >
<td ><code>VFNMADD[z][P/S][D/S]</code></td>
<td >Fused negative multiply add of packed/scalar double/single A = -r1 * r2+r3</td>
</tr>
<tr >
<td ><code>VFNMSUB[z][P/S][D/S]</code></td>
<td >Fused negative multiply subtract of packed/scalar double/single A = -r1 * r2-r3</td>
</tr>
</tbody>
</table>
<div  id="vc-meta">
<div id="vc-meta-author">
<div>Chris Lomont</div>
</div>
<div id="vc-meta-pubdate">06-21-2011</div>
<div id="vc-meta-modificationdate">06-21-2011</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product"></div>
<div id="vc-meta-category">
<div>Intel® SSE</div>
<div>Intel® AVX</div>
</div>
<div id="vc-meta-thumb"></div>
<div id="vc-meta-thumb-tout"></div>
<div id="vc-meta-thumb-hero"></div>
<div id="vc-meta-tocenable">no</div>
<div id="vc-meta-abstract">Intel® AVX improves upon Intel® SSE by adding 256-bit registers and additional instructions. This article shows how to detect Intel AVX, and deconstructs Intel AVX function names and explains Intel AVX intrinsics. It showcases a fractal calculation that achieves nearly 2x performance speedup relative to the Intel SSE version.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions/</link>
      <pubDate>Tue, 21 Jun 2011 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions/</guid>
      <category>Visual Computing</category>
      <category>Intel® AVX</category>
    </item>
    <item>
      <title>Practical Intel® AVX Optimization on 2nd generation Intel® Core™ Processors</title>
      <description><![CDATA[ <h2 class="sectionHeading">Download Article</h2>
Download <a href="http://software.intel.com/file/36196">Practical Intel® AVX Optimization on 2nd generation Intel® Core™ Processors</a> [PDF 672KB]<br /><br />
<h2 class="sectionHeading">Introduction</h2>
<i>The introduction of the 2nd generation Intel® Core™ processor family (Intel® microarchitecture code name Sandy Bridge) includes a tremendous opportunity to boost application performance. To many developers, those opportunities might be hidden under the hood, as they might not know how to utilize this huge performance treasure. The first effective way to boost performance is by utilizing multi-core CPUs. Another opportunity is Intel® Advanced Vector Extension (Intel® AVX), the new vectorization capabilities for floating point operations. Intel AVX is a new 256-bit instruction set extension to Intel® Streaming SIMD Extensions (Intel® SSE). It was released as part of the Intel microarchitecture code name Sandy Bridge and will be present in platforms ranging from notebooks to servers. The ability to work with 256-bit vectors enables operating on 8 float elements per iteration instead of a single element. This data parallelism can give great computation power to boost application programmers who know how to use it. The following article describes a simple methodology to utilize Intel AVX inside the 2nd generation Intel Core processor family (Intel architecture codename Sandy Bridge). The paper shares insights and methods gained during a shared project by Waves (<a href="http://www.waves.com/" target="_blank">http://www.waves.com/</a>) and Intel Software and Services Group on optimizing a Linear Phase FIR Filter using Intel AVX. It focuses on how Intel AVX technology can be exploited to achieve a significant performance boost. The paper includes concrete code examples to demonstrate the optimization.</i><br /><br /><br />
<h2 class="sectionHeading">Identify your hotspots</h2>
A hotspot is where the CPU spends significant amount of time in an application. The first step of optimization work must be to find the hotspots. Don't skip this stage; otherwise you might optimize parts of the code which have almost no impact on the total application performance. Shortly before the introduction of the 2nd generation Intel® Core™ microprocessor, Intel released <a href="http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/">Intel® VTune™ Amplifier XE 2011</a> as part of the <a href="http://software.intel.com/en-us/articles/intel-parallel-studio-xe/">Intel® Parallel Studio XE 2011</a> tool suite. Intel Parallel Studio XE includes all the capabilities of Intel® Parallel Amplifier, plus a number of advanced capabilities for developers who need to dig deeper. Intel VTune Amplifier XE can pinpoint your hotspots, provide insights on CPU utilization, and show you where your parallelism is poor. There is no need for guessing about the location of performance bottlenecks in the application anymore.<br /><br />One important way to optimize the hotspot is by multi-threading. If your desktop CPU has 4 cores (Quad Core) and supports Intel® Hyper-Threading Technology, it actually means you can run 8 simultaneous threads. If your application is still single-threaded, you have a theoretical opportunity to boost your performance by 8x. Learn more about parallel programming at <a href="http://software.intel.com/en-us/parallel/">http://software.intel.com/en-us/parallel/</a>.<br /><br />A new capability just introduced with the 2nd generation Intel® Core™2 Duo processor is <i>Intel® Advanced Vector Extensions</i> (Intel® AVX). This article describes how to use Intel AVX to improve performance.<br /><br /><br />
<h2 class="sectionHeading">SIMD (Single Instruction, Multiple Data) with Intel® AVX implementation</h2>
Before Intel microarchitecture code name Sandy Bridge, SIMD vectorization was provided by Intel® Streaming SIMD Extensions (Intel® SSE). Intel SSE is a technique for micro level data-level parallelism on x86 architecture. Intel SSE instructions use eight 128-bit registers, XMM0 through XMM7, where uniform type data can be packed. Intel SSE instructions operate on all data items in parallel. Intel SSE with 128-bit vectors enables operating on 4 float elements per iteration instead of a single element.<br /><br /><img src="http://software.intel.com/file/36028" /><br /><b>Figure 1:</b> <i>Going from Scalar to Intel® SSE can provide up to 4x faster performance</i><br /><br /><br />Intel® AVX brings new vectorization capabilities for floating point operations. The new ability to work with 256-bit vectors enables operating on 8 float elements per iteration instead of a single element.<br /><br /><img src="http://software.intel.com/file/36029" /><br /><b>Figure 2:</b> <i>Going from Scalar to Intel® AVX can provide up to 8x faster performance</i><br /><br /><br />It is easier to understand when looking at an example. Following is a representative inner loop from a Linear Phase FIR Filter that is often a hotspot. Let's implement it with Intel SSE first and then Intel AVX. In this case, it was possible to obtain up to a 2.1x increase in speed when using Intel AVX over Intel SSE.<br /><br /><br />
<h2 class="sectionHeading">Implementing A Linear Phase FIR Filter using Intel® AVX</h2>
The following code example demonstrates the use of a specific linear phase audio FIR filter. A FIR (Finite Impulse Response) filter is a vector of filter taps values multiplied and accumulated (convolution) with history input signal samples, which may be used, for example, to achieve EQ effect on an audio signal.<br /><br />A linear phase filter is characterized by a symmetric shape of the filter taps values. Since the filter length is odd, the symmetry is around a single center tap. In this example, we use a double precision floating point implementation. Recognizing that floating point values are subject to large errors when a large value is summed to a much smaller value, it is more accurate to sum the values by their expected magnitude, from the smallest to the largest. In the code for filter implementation, this is approximately true when we sum the values from the edges towards the center. Given the symmetric nature of the filter, it is better to read only the first half of the filter taps - where the second half taps are mirrored.<br /><br />Filtering an audio signal with a FIR filter requires the convolution of the audio signal with the FIR filter.<br />The convolution is implemented by iteration, one iteration per input sample, i.e., for each input sample, N filter taps are dot-multiplied by the last N input signal samples and summed to an accumulator. Infinite length input signal processors (such as in real-time processing) must use a memory FIFO buffer retaining old input samples.<br /><br />A FIFO buffer (First-In-First-Out) data structure is implemented using an array and mask. The mask is of 2^M-1 value (e.g., 0x00FFF) and is used to compute the FIFO access index using bit-wise AND operation rather than conditions, as the bit-wise AND is a faster implementation.<br /><br />The basic algorithm works like this:<br /><br />
<div >For each input signal sample:<br />
<div >A. Insert sample to buffer, Advance the read/write offsets using a mask operation.<br />B. Loop by half of the filter length<br />
<div >1. Load 4 filter taps.<br />2. Load 4 input samples from buffer<br />3. Multiply and accumulate (1)+(2).<br />4. Mirror the 4 filter taps.<br />5. Load 4 new input samples from buffer.<br />6. Multiply and accumulate (4)+(5).<br />7. Advance memory buffer access indexes.</div>
<br />C. Multiply and accumulate center tap.<br /></div>
</div>
<br />Note: <i>Data Type is <b>Double (64bit)</b></i><br /><br /><b><span >Original C implementation</span></b><br />The code example below shows only the implementation of the internal loop (B). It loads one filter tap per iteration.<br /><br /><img src="http://software.intel.com/file/36030" /><br /><br /><code>
<pre><span > for (int j=0 ; j&lt;firHalfLength; j++)  // firHalfLength is 1023
 {
        dFirCoefs = pFIRBuf[j];
	acc1 += pDllBuf[lFirIndex]*dFirCoefs; //acc1 is accumulator for Index
	acc2 += pDllBuf[lFirIndexRev]*dFirCoefs; //acc2 is accumulator for IndexRev
	lFirIndex =(lFirIndex-1)&amp;lMask; //dec backward index (modulo operation)
	lFirIndexRev = (lFirIndexRev+1)&amp;lMask;
 }
</span></pre>
</code><i>Note: In this example, firHalfLength equals 1023 defining the iterations of the internal loop. In addition, the internal loop was nested with an external loop of 1,000,000 iterations in order to get meaningful numbers in the standalone measurements.</i><br /><br /><br />The original C code above runs in ~0.63 sec on Intel® Core™ i7-2820QM CPU @2.3 GHz, RAM 4GB, Windows* 7 Ultimate with Service Pack 1.<br /><br /><br /><b><span >Intel® SSE 128-bit implementation</span></b><br />Converting the code into Intel SSE loads two filter taps per iteration. The new code looks like this:<br /><br /><img src="http://software.intel.com/file/36031" /><br /><br /><code>
<pre><span >__m128d DllVal, FIRCoef, mulVal;

for (int i = 0; i &lt; firHalfLength; <span >i += 2</span>) //Operate on 2 elements at a time 
{
	FIRCoef = _mm_load_pd(pFIRBuf+i);
			
	//acc1
	DllVal = _mm_load_pd(pDllBuf+lFIRIndexRev);
	mulVal = _mm_mul_pd(FIRCoef, DllVal);
	acc1 = _mm_add_pd(acc1, mulVal);

	//acc2
	DllVal = _mm_load_pd(pDllBuf+lFIRIndex);
	DllVal = _mm_shuffle_pd(DllVal, DllVal, 0x1);
	mulVal = _mm_mul_pd(FIRCoef, DllVal);
	acc2 = _mm_add_pd(acc2, mulVal);
		
	lFIRIndex -= <span >2</span>;
	lFIRIndex = (lFIRIndex &amp; lMask);
	lFIRIndexRev += <span >2</span>;
	lFIRIndexRev = (lFIRIndexRev &amp; lMask);
}
</span></pre>
</code>The SSE 128-bit code runs in ~0.40 sec on the exact same hardware as used to run the original code. This is already a speedup of 1.58x over the original code. However, Intel AVX can provide an additional speedup!<br /><br /><b><span >Intel® AVX Implementation</span></b><br /><br /><img src="http://software.intel.com/file/36032" /><br /><b>Figure 3:</b> <i>Benefits of Intel® AVX.</i><br /><br />This implementation uses basic Intel AVX intrinsics code. The use of Intel AVX 256-bit vectors enables twice the number of elements processed per iteration in the loop - thus operating on 4 elements at a time. The main difference when comparing the Intel SSE intrinsics to the new Intel AVX intrinsics is clear when looking at the instruction names, e.g., _mm_load_pd ? _mm256_load_pd. Unlike Intel SSE intrinsics, the Intel AVX intrinsics can support up to 4 operands. In addition, there are different registers type __m128 ? __m256 with different width (128bit ? 256bit). The example uses the basic Intel AVX instructions such as _mm256_mul_pd and _mm256_add_pd.<br /><br /><b><span >Intel® AVX - Arranging the Data</span></b><br /><br />In this example we also use "cross-lane" shuffle "<span >mm256_permute2f128_pd</span>" to arrange the data and eliminate the number of loads. More details on the "lane" concept will be discussed below. Since the index is decremented by (-4) in each iteration, we need to arrange each of the 4 elements being loaded in an opposite consecutive order, as described below:<br /><br /><img src="http://software.intel.com/file/36033" /><br /><b><span ><br /><br />Intel® AVX - Implementation</span></b><br /><br /><code>
<pre><span >__m256d DllVal, FIRCoef, mulVal;
__m128d tmph,tmpl,tmplsh,tmphsh;

for (int i = 0; i &lt; firHalfLength; <span >i += 4</span>) //Operate on 4 elements at a time
{
	
	FIRCoef = _mm256_load_pd(pFIRBuf+i);

	//acc1
	DllVal = _mm256_load_pd(pDllBuf+lFIRIndexRev);
	mulVal = _mm256_mul_pd(FIRCoef, DllVal);
	acc1 = _mm256_add_pd(acc1, mulVal);

	//acc2
	DllVal = _mm256_load_pd(pDllBuf+lFIRIndex);
	DllVal = _mm256_permute2f128_pd (DllVal,DllVal ,0x1); // Cross lane shuffle
	DllVal = _mm256_permute_pd(DllVal, 0x5);
	mulVal = _mm256_mul_pd(FIRCoef, DllVal);
	acc2 = _mm256_add_pd(acc2, mulVal);

	lFIRIndex -= <span >4</span>;
	lFIRIndex = (lFIRIndex &amp; lMask);
	lFIRIndexRev += <span >4</span>;
	lFIRIndexRev = (lFIRIndexRev &amp; lMask);
}
</span></pre>
</code>The Intel® AVX 256-bit code runs in ~0.19 sec on the exact same hardware as used to run the original code.<br />Thus, the Intel AVX code shows a 2.1x improvement over the Intel SSE code.<br /><br /><b><span >Intel® AVX Lanes</span></b><br />Intel AVX implements 256-bit registers, divided into two lanes of 128 bits: a low lane and a high lane. Most of the 256-bit Intel AVX instructions are defined as in-lane: the destination elements in each lane are calculated using source elements only from the same lane. There are only a few cross-lane instructions such as the cross-lane" shuffle "<span >mm256_permute2f128_pd</span>" we used in the example.<br /><br /><img src="http://software.intel.com/file/36034" /><br /><b>Figure 4:</b> <i>Use of Intel® AVX lanes</i><br /><br /><br /><b><span >Strided Load</span></b><br />The strided load technique is a programming method that uses Intel AVX instructions and is useful for algorithms that involve unsupported cross-lane shuffles. The method describes how to arrange data to avoid cross-lane shuffles. The main idea is to use 128-bit loads in a way that mimics the corresponding Intel SSE algorithm, and enables the 256-bit Intel AVX instructions to execute iterations i of the loop in the low lanes and the iteration and i+1 in the high lanes. <b>This method eases the porting of Intel SSE algorithms, since 128 bits is still a natural data type for many algorithms, such as RGBA in single precision and Real and Imaginary in double precision complex.</b> Intel AVX adds new in-lane data movement instructions such as _mm_shuffle_pd , _mm256_insertf128_pd and_mm256_castpd128_pd256 to convert 256-bit to 128-bit width. See examples of Strided load in the presentation: <a href="http://software.intel.com/file/29515">http://software.intel.com/file/29515</a> slide #59.<br /><br /><br />
<h2 class="sectionHeading">Cache alignment</h2>
Aligning data to vector length is always recommended. When using Intel SSE and Intel SSE2 instructions, loaded data should be aligned to 16 bytes. Similarly, to achieve best results use Intel AVX instructions on 32-byte vectors that are 32-byte aligned. The use of Intel AVX instructions on unaligned 32-byte vectors means that every second load will be across a cache-line split, since the cache line is 64 bytes. This doubles the cache line split rate compared to Intel SSE code that uses 16-byte vectors. A high cache-line split rate in memory-intensive code is extremely likely to cause performance degradation. For that reason, it is highly recommended to <b>align the data to 32 bytes</b> for use with Intel AVX.<br /><br /><img src="http://software.intel.com/file/36035" /><br /><b>Figure 5:</b> <i>Cache Line Split When Address is not 32-byte Aligned</i><br /><br />It is important to make every effort to align the data to 32 bytes in advance to avoid potential performance degradation. However, there are instances where the programmer is not able to force the data structure to be 32-byte alignment. For example, this can happen when working with third-party code. In cases where 32-byte alignment is not possible, change the load instruction to unaligned load "_mm256_loadu_pd". This works, but it is clearly a compromise on performance. When we ran the sample Intel® AVX implementation of the linear phase audio FIR filter, we noticed a penalty degradation of more than 20% when working with misaligned data and using the <code>loadu</code> and <code>storeu</code> instructions.<br /><br /><br />
<h2 class="sectionHeading">Intel® Compilers</h2>
The Intel® C++ Compiler version 11.1 supports Intel® AVX both for Windows* and Linux*. In many cases (if there are not too many dependencies) it can generate Intel AVX instructions automatically. The Intel Composer XE (Intel® C++ Compiler version 12), part of Intel® Parallel Studio XE, has additional improvements for Intel AVX. On Windows*, use the command line switch /QxAVX. On Linux*, use -xavx. The switches /QaxAVX (Windows*) and -axavx (Linux*) may be used to build applications that will take advantage of Intel AVX instructions on Intel® processor-based systems that support these, but will use only Intel SSE instructions on other systems.<br /><br />The Intel® C++ Compiler supports Intel AVX-based intrinsics via the header file immintrin.h. To see which loops have been vectorized, use the switch /Qvec-report1 (windows) or -vec-report1 (Linux).<br /><br /><br />
<h2 class="sectionHeading">Important notice for using Intel® AVX:</h2>
<ol>
<li>In order to take advantage of Intel AVX on Windows* 7, Service Pack 1 must have been installed. Microsoft* Windows* 7 SP1 Beta is available from Microsoft at: <a href="http://technet.microsoft.com/hi-in/evalcenter/ff183870.aspx" target="_blank">http://technet.microsoft.com/hi-in/evalcenter/ff183870.aspx</a> Upon upgrade to Windows* 7 SP1, Intel AVX optimized software will automatically "opt-in" and take advantage of Intel AVX.</li>
<li>The Intel® Integrated Performance Primitives (Intel® IPP) library has been optimized for a variety of SIMD instruction sets. Automatic "dispatching" detects the SIMD instruction set that is available on the running processor and selects the optimal SIMD instructions for that processor. It is recommended to review Intel® IPP functions which are already optimized for Intel AVX. This can speed up the optimization and the development cycle. Below is a link for Intel IPP version 7 supporting Intel AVX that may be useful for optimization: <a href="http://software.intel.com/en-us/articles/intel-ipp-functions-optimized-for-intel-avx-intel-advanced-vector-extensions/">http://software.intel.com/en-us/articles/intel-ipp-functions-optimized-for-intel-avx-intel-advanced-vector-extensions/</a></li>
<li>Download the Intrinsics Guide for Intel AVX from <a href="http://software.intel.com/en-us/avx/">http://software.intel.com/en-us/avx/</a>. It is an interactive reference tool that allows programmers to search, sort and reference all of Intel's SIMD instructions. The guide also provides descriptions for each instruction, data type support as well as the intrinsic mnemonics.</li>
<li>See additional Intel AVX papers at <a href="http://software.intel.com/en-us/avx/">http://software.intel.com/en-us/avx/</a>.</li>
</ol><br />
<h2 class="sectionHeading">Summary and Results</h2>
Intel® AVX provides an opportunity to boost application performance on 2nd generation Intel® Core™ processor-based platforms. First, the main hotspots of the application must be identified. Next, Intel AVX vectorization provides an optimization opportunity which is done per core. This optimization has a great impact when using all cores/threads with Intel AVX. Before using direct Intel AVX coding, it is recommended to review the great variety of relevant Intel® IPP version 7 functions which are already optimized for many of existing processors and can speed up the development cycle.<br /><br />Wave's Director of Research and Development Itai Neoran, said of AVX: "<i>Similar double-precision FIR filters are widely used by Waves in real-time sound processing applications, in both music production studios and live-concert sound-processing. Since extensive sound processing is needed on each and every instrumental/vocal track, and since typical applications run hundreds such tracks in real-time, AVX parallelization allows sound engineers worldwide to use more simultaneous tracks, which until now could only be achieved with dedicated DSP hardware</i>."<br /><br />In this article we showed how to use direct Intel AVX coding on a simple loop which was identified as a hotspot. Moving from original C implementation through Intel® SSE 128-bit to the Intel® microarchitecture code name Sandy Bridge 256-bit implementation, we were able to get impressive improvement. We've seen a 2.1x improvement from Intel SSE 128-bit code to Intel AVX 256-bit code for the hotspot that enabled the speedup of the entire FIR Filter application from Waves.<br /><br /> 
<table cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td >Original C code</td>
<td >SSE 128 bit code</td>
<td >AVX 256 bit code</td>
</tr>
<tr>
<td >~0.63 sec</td>
<td >~0.40 sec</td>
<td >~0.19 sec</td>
</tr>
</tbody>
</table>
<b>Table 1:</b><i> Intel® AVX code vs. Intel® SSE: 2.1x Improvement</i><br /><br /><br />
<h2 class="sectionHeading">About the Authors</h2>
<b><span >Guy Ben-Haim:</span></b><br />Guy Ben-Haim is a senior application engineer in Intel corporation at the EMEA Software and Services Group (SSG). Guy works on optimizing applications to take advantage of the latest Intel software and hardware innovations. Guy provides technical training, consultation, and hands-on assistance to SW developers in areas of software optimization and parallel programming. Guy has been working in Intel for the last 10 years as a senior CPU designer where he participated in the development of the first dual core processor and other Centrino mobile processors. Prior to Intel, Guy worked for several startup up companies in the fields of communication, computers HW and semiconductors. Guy holds a B.Sc. from Tel Aviv University and M.B.A from Technion - Israel Institute of Technology.<br /><br /><br /><b><span >Itai Neoran:</span></b><br />Itai Neoran is the director of research and development in Waves Audio Ltd. (Waves), Israel, a Grammy-awarded maker of software and hardware tools for digital audio processing in the music industry. During the last 17 years, Itai is leading a group of researchers and engineers who develop audio processing algorithms for professional and home users, and implement them in real-time, particularly on Intel CPUs. Itai also develops methods for acceleration of audio algorithms using parallel computation such as SSE and AVX, while keeping high bit accuracy and low input to output latency. Prior to Waves, Itai worked as a researcher in several companies employing signal processing and image processing algorithms. Itai holds an M.Sc in Electrical Engineering from the Technion Institute, Israel, and a B.Sc. in Computer Science from The Hebrew University of Jerusalem.<br /><br /><br /><b><span >Ishay Tubi:</span></b><br />Ishay Tubi is a senior DSP engineer at Waves Audio Ltd., Israel. For the past 5 years, Ishay has been developing professional high-end audio signal processing algorithms. His command of FIR filters and Intel's parallel programing facilitates acceleration on various Waves Audio Plugins. Ishay holds a B.Sc. degree in computer science from Tel-Hai Academic College and has 15 years of experience in computer programming. Ishay is also the CTO and co-founder of MixBox Audio - an iOS Interactive music applications software company who's goal is to enable consumers without a musical background to mix and play music. <br />
<div  id="vc-meta">
<div id="vc-meta-author">
<div>Guy Ben Haim</div>
</div>
<div id="vc-meta-pubdate">05-09-2011</div>
<div id="vc-meta-modificationdate">05-09-2011</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product">
<div>Intel® VTune™ Amplifier XE</div>
</div>
<div id="vc-meta-category">
<div>Performance Analysis</div>
<div>Intel® AVX</div>
<div>Intel® SSE</div>
</div>
<div id="vc-meta-thumb"></div>
<div id="vc-meta-thumb-tout"></div>
<div id="vc-meta-thumb-hero"></div>
<div id="vc-meta-tocenable">no</div>
<div id="vc-meta-abstract">With the release of the 2nd generation Intel® Core™ processors (Intel® microarchitecture code name Sandy Bridge), there is a tremendous opportunity to boost application performance. This article discusses how one can identify hotspots in their applications and optimize them using SIMD with Intel® AVX implementation.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors/</link>
      <pubDate>Mon, 09 May 2011 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors/</guid>
      <category>Visual Computing</category>
      <category>Intel® AVX</category>
      <category>Media</category>
    </item>
    <item>
      <title>Developing Intel® AVX Optimized  Microsoft* Real-Time Audio (MSRTA) Codec using Intel® IPP</title>
      <description><![CDATA[ <h2 class="sectionHeading">Download Article</h2>
Download <a href="http://software.intel.com/file/35651">Developing Intel® AVX Optimized Microsoft* Real-Time Audio (MSRTA) Codec using Intel® IPP</a> [PDF 284KB]<br /><br />
<h2 class="sectionHeading">Introduction</h2>
This white paper describes how Intel® Integrated Performance Primitives (Intel® IPP) can provide the building blocks to develop the Microsoft* Real Time Audio (MSRTA) codec on the latest Intel® microarchitecture, code name Sandy Bridge. MSRTA is specifically designed for real-time two-way Voice over IP (VoIP) applications. We developed a speech codec that is fully bitstream compliant with the Microsoft RTAudio codec with comparable quality. It has been implemented using Intel® IPP.<br /><br />We describe how to use Intel® IPP to build the Intel® Advanced Vector Extensions (Intel® AVX) optimized Microsoft* Real Time Audio Codec for VoIP applications. We provide performance results for Intel® microarchitecture code name Sandy Bridge.<br /><br />
<h2 class="sectionHeading">Unified Speech Component</h2>
The Unified Speech Component (USC) interface is a C language framework designed for implementation of speech codecs, echo cancellers, and other voice processing modules using the Intel® IPP library. Most of the speech codec standards mentioned in the above section utilize this extensible USC interface.<br /><br />The purpose of the USC interface is to provide unified access to an algorithm module, the access being independent of the algorithm internals. The USC interface also enables binaries to be easily integrated into existing software applications. Decoupling the interface and the algorithm details enables making the development of system components independent of the algorithm implementation.<br /><br />For more information, refer to the Unified Speech Component Interface manual (uscmanual.pdf) in the <a href="http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-code-samples/">speech coding sample</a> (part of the Intel® IPP main sample).<br /><br />
<h2 class="sectionHeading">Intel® IPP functions optimized for Intel® Advanced Vector Extensions</h2>
<a href="http://www.intel.com/software/avx">Intel® Advanced Vector Extensions (Intel® AVX)</a> is a 256-bit instruction set extension to Intel® Streaming SIMD Extensions (Intel® SSE), designed to provide even higher performance for applications that are floating-point intensive. Intel AVX adds new functionality to the existing Intel SIMD instruction set (based on Intel® SSE), and includes a more compact SIMD encoding format.<br /><br />The Intel® IPP library has been optimized for a variety of SIMD instruction sets. Automatic "dispatching" detects the SIMD instruction set that is available on the running processor and selects the optimal SIMD instructions for that processor. Refer to <a href="http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-understanding-cpu-optimized-code-used-in-intel-ipp/">Understanding CPU Dispatching in the Intel® IPP Library</a> for more information regarding dispatching.<br /><br />Intel® AVX optimization in the Intel® IPP library consists of "hand-optimized" and "compiler-tuned" functions-code that has been directly optimized for the Intel® AVX instruction set. <br /><br />For more information on Intel® IPP functions optimized for Intel® AVX, refer to the article <a href="http://software.intel.com/en-us/articles/intel-ipp-functions-optimized-for-intel-avx-intel-advanced-vector-extensions/">Intel® IPP Functions Optimized for Intel® AVX</a>.<br /><br />
<h2 class="sectionHeading">Microsoft* Real-Time Audio Codec (MSRTA)</h2>
RTAudio* is the preferred Microsoft® Real-Time audio codec, and is used by <a href="http://lync.microsoft.com/en-us/pages/default.aspx">Microsoft Lync Server</a>* (formerly Microsoft Office Communications Server*) and other communications applications like <a href="http://lync.microsoft.com/en-us/pages/default.aspx">Microsoft Lync</a>* (formerly Microsoft Office Communicator*) and <a href="http://office.microsoft.com/en-us/live-meeting/">Microsoft LiveMeeting* Console</a>.<br /><br />To get more information about the Microsoft RTAudio Codec, refer to <a href="http://www.microsoft.com/downloads/en/details.aspx?FamilyID=5D79B584-79C9-42A8-90C4-4AB3F03D19C4&amp;displaylang=en">Overview of the Microsoft RTAudio* Speech Codec</a>.<br /><br />
<h2 class="sectionHeading">Intel® IPP Real Time Audio functions</h2>
By combing Intel® IPP RT audio functions (refer to <b>Appendix B</b>), it is possible to construct a speech codec compliant with the Microsoft RTAudio* codec. The primitives are primarily designed to implement the well-defined, computationally expensive core operations that comprise the codec portion of the RTAudio system. <br /><br />
<h2 class="sectionHeading">USC MSRTA Codec</h2>
The USC MSRTA codec supports 16-bit wideband 16000 Hz and narrowband 8000 Hz PCM mono signal compression and decompression, with 20ms frame lengths at bitrates 8800 bps and 18000 bps respectively.<br /><br />To understand the usage of Intel® IPP in developing and measuring the USC MSRTA codec, please download the following free <b>code samples</b>: <a href="http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-code-samples/">Code Samples for the Intel® Integrated Performance Primitives (Intel® IPP) Library</a><br /><br />Extract all files in <b><code>w_ipp -sample_*.zip</code></b> to the desired folder. Make sure to preserve the directory structure. The files for the USC Speech Codec, USC Echo Cancellation, UMC Speech RTP codec, and USC Filter and tones samples will be found in the<br /><br /><b><code>../ ipp_samples/speech-codecs</code></b> folder.<br /><br /><b>How to build the source code</b><br /><ol>
<li>Set system environment variable <b>IPPROOT</b></li>
<li>Open solution/project file in related <b>Microsoft* Visual Studio</b></li>
<li>Select the configuration/platform you need </li>
<li>Build all projects in Microsoft Visual Studio* </li>
<li>Run the codec</li>
</ol>To run the sample for <b>encode</b> or for <b>decode</b>, use the following command line:<br /><br /><code>usc_speech_rtp_codec.exe [options] &lt;infile&gt; &lt;outfile&gt;</code><br /><br />Depending on which of the two formats-WAVE or RTPDump-has the input file <b>&lt;infile&gt;</b>, either an encode or a decode operation will be performed respectively. For an encode operation, the output file <b>&lt;outfile&gt;</b> is stored in RTPDump format; for decode, the ouput file is stored in WAVE format. Option list:<br /><br /> 
<table border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td><b><i>-format &lt;codecname&gt; </i></b></td>
<td>- codec option</td>
</tr>
<tr>
<td><b><i>-r&lt;bitrate&gt;</i></b></td>
<td>- bitrate option (mandatory)</td>
</tr>
<tr>
<td><b><i>-v</i></b></td>
<td><b>-Voice Activity Detector</b>(VAD) enabled. Default: VAD disabled</td>
</tr>
</tbody>
</table>
<br /> 
<table class="tableformat1" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td >Codec name</td>
<td >Supported bitrate, in bps</td>
<td >Codec description</td>
</tr>
<tr>
<td><b>IPP_MSRTAnb_FP</b></td>
<td>8800</td>
<td>Narrowband 8000 KHz MSRTA codec, floating-point implementation</td>
</tr>
<tr>
<td><b>IPP_MSRTAwb_FP</b></td>
<td>18000</td>
<td>Wideband 16000 KHz MSRTA codec, floating-point implementation</td>
</tr>
</tbody>
</table>
<br />To enable RT Audio codecs <b>IPP_MSRTAnb_FP</b> and <b>IPP_MSRTAwb_FP</b>, compile the sample source with <b>_USC_MSRTA</b> definition and link with the IPP RTA static library (<b><i>ipp_rta.lib</i></b>) provided in binary form.<br /><br />For information on building the sample, see the readme file in the <i>ipp_samples\ speech-codecs</i> directory. <br /><br />Refer to <b>Appendix</b> C for USC MSRTA Codec description<br /><br /><br /><b>Performance of USC MSRTA Codec</b><br /><br /> 
<table class="tableformat1" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td >System Configuration</td>
<td >Intel® microarchitecture code name Nehalem (NHM)</td>
<td >Intel® microarchitecture code name Sandy Bridge (SNB)</td>
</tr>
<tr>
<td>CPU</td>
<td>Intel® Xeon® processor X5570 @ 2.93GHz</td>
<td>Genuine Intel® CPU 0 @ 3.00GHz</td>
</tr>
<tr>
<td>Operating System</td>
<td>Microsoft Windows* 2003</td>
<td>Microsoft Windows* x64 with SNB patch</td>
</tr>
</tbody>
</table>
<br />We used Intel® IPP 7.0.2 to measure the performance of USC MSRTA codec on Intel® microarchitecture code name Sandy Bridge with frequency of 3.0 GHz and Intel® microarchitecture code name Nehalem at 2.93 GHz. Both systems had Microsoft Windows* 64-bit installed. To measure the performance of the codec, different data sets were used to satisfy the requirement of the MSRTA codec. The two different types of data streams were 16-bit with 16000 Hz wideband data stream (<b><i>s_16000_16.wav</i></b>) and 16-bit with 8000 Hz narrowband (<b><i>s_8000_16.wav</i></b>).<br /><br /><b><span >IA32:</span></b><br /><br /><img src="http://software.intel.com/file/35649" /><br /><br /><b><span >Intel® 64:</span></b><br /><br /><img src="http://software.intel.com/file/35650" /><br /><br />As mentioned in the performance table above, encode and decode performances are measured in MHz. By multiplying the performance number in seconds by CPU frequency, you will get performance numbers in MHz. The duration of narrowband data is 1070 sec, and wideband is 1090 sec. To measure the performance improvement of the USC MSRTA codec on Intel® microarchitecture code name Sandy Bridge compared to Intel® microarchitecture code name Nehalem, divide the performance number of Intel microarchitecture code name Nehalem by the performance number of Intel microarchitecture code name Sandy Bridge. <br /><br />For example, on 32-bit Intel microarchitecture code name Nehalem, narrowband encode performance of USC MSRTA without VAD is 20.42 sec. Performance of the same data set on Intel microarchitecture code name Sandy Bridge is 16.65 sec. To measure the speedup, divide 20.42 by 16.65 to get 1.23x. USC MSRTA decode performance exhibits similar benefits.<br /><br /><br />
<h2 class="sectionHeading">Summary</h2>
Intel® IPP is a highly optimized library for the latest Intel architecture, including Intel® microarchitecture code name Sandy Bridge. By using Intel IPP functions, Intel® AVX optimized audio and speech codecs can be developed, including the Microsoft RTAudio* (real time audio) codec. Intel IPP offers sample code to demonstrate the development and usage of the MSRTA codec.<br /><br /><br />
<h2 class="sectionHeading">Appendix A: Introduction to Intel® IPP</h2>
Intel® Integrated Performance Primitives Intel® Integrated Performance Primitives (Intel® IPP) is an extensive library of multi-core ready, highly optimized software functions for digital media and data processing applications. Intel® IPP offers thousands of optimized functions covering frequently-used fundamental algorithms. Intel® IPP functions are designed to deliver performance beyond what optimized compilers alone can deliver. Intel® IPP is a software library that offers various highly optimized functions including multimedia and speech codecs. For advanced performance and greater value, Intel® IPP is also available with Intel® Parallel Studio XE 2011. For more information, go to <a href="http://software.intel.com/en-us/articles/intel-ipp/">http://software.intel.com/en-us/articles/intel-ipp/</a>.<br /><br /><br />
<h2 class="sectionHeading">Intel® IPP Speech Codec</h2>
Intel® IPP includes functions that can be used for implementing speech codecs. These codecs follow the <a href="http://www.itu.int/" target="_blank">International Telecommunication Union (ITU)</a>* recommendations for G711 (companding functions), G.722, G.722.1, G.723.1, G726, G.728, G.729.1 and G.729 codecs, G.167, G.168 for Echo Canceller, G.169 for Audio Level control, <a href="http://www.etsi.org/WebSite/homepage.aspx" target="_blank">European Telecommunications Standards Institute (ETSI)</a>* specifications for GSM-AMR and GSM-FR codecs, as well as <a href="http://www.3gpp.org/" target="_blank">3GPP</a>* specification for AMRWB and AMRWB+ codecs, and <b>Microsoft* Real-Time Audio codec</b>.<br /><br /><i><b>Note: </b>Implementations of these standards or the standard-enabled platforms may require licenses from various entities, including Intel Corporation. </i><br /><br /><br />
<h2 class="sectionHeading">Appendix B: Intel® AVX-optimized Intel® IPP RT Audio Functions</h2>
<table class="tableformat1" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td >Function Base Name</td>
<td >Operation</td>
</tr>
<tr>
<td><a href="http://software.intel.com/sites/products/documentation/hpc/ipp/ipps/ipps_ch9/functn_AdaptiveCodebookSearch_RTA.html">AdaptiveCodebookSearch_RTA</a></td>
<td>Searches for the adaptive codebook index and the lag, and computes the adaptive vector</td>
</tr>
<tr>
<td><a href="http://software.intel.com/sites/products/documentation/hpc/ipp/ipps/ipps_ch9/functn_FixedCodebookSearch_RTA.html">FixedCodebookSearch_RTA</a>, <a href="http://software.intel.com/sites/products/documentation/hpc/ipp/ipps/ipps_ch9/functn_FixedCodebookSearch_RTA.html">FixedCodebookSearchRandom_RTA</a></td>
<td>Searches for the fixed codebook vector</td>
</tr>
<tr>
<td><a href="http://software.intel.com/sites/products/documentation/hpc/ipp/ipps/ipps_ch9/functn_HighPassFilter_RTA.html">HighPassFilter_RTA</a></td>
<td>Performs high-pass filtering</td>
</tr>
<tr>
<td><a href="http://software.intel.com/sites/products/documentation/hpc/ipp/ipps/ipps_ch9/functn_LSPQuant_RTA.html">LSPQuant_RTA</a></td>
<td>Performs quantization of LSP coefficients</td>
</tr>
<tr>
<td><a href="http://software.intel.com/sites/products/documentation/hpc/ipp/ipps/ipps_ch9/functn_LSPToLPC_RTA.html">LSPToLPC_RTA</a></td>
<td>Converts LSP coefficients to LP coefficients</td>
</tr>
<tr>
<td><a href="http://software.intel.com/sites/products/documentation/hpc/ipp/ipps/ipps_ch9/functn_QMFDecode_RTA.html">QMFDecode_RTA</a></td>
<td>Performs QMF synthesis</td>
</tr>
<tr>
<td><a href="http://software.intel.com/sites/products/documentation/hpc/ipp/ipps/ipps_ch9/functn_PostFilter_RTA.html">PostFilter_RTA</a></td>
<td>Restores speech signal from the residual</td>
</tr>
</tbody>
</table>
<br />Other required functions for RT Audio functions are as below<br /><br /> 
<table class="tableformat1" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td >Function Base Name</td>
<td >Operation</td>
</tr>
<tr>
<td><a href="http://software.intel.com/sites/products/documentation/hpc/ipp/ipps/ipps_ch9/functn_LPCToLSP_RTA.html">LPCToLSP_RTA</a></td>
<td>Converts LP coefficients to LSP coefficients</td>
</tr>
<tr>
<td><a href="http://software.intel.com/sites/products/documentation/hpc/ipp/ipps/ipps_ch9/functn_LevinsonDurbin_RTA.html">LevinsonDurbin_RTA</a></td>
<td>Calculates LP coefficients from the autocorrelation coefficients</td>
</tr>
<tr>
<td><a href="http://software.intel.com/sites/products/documentation/hpc/ipp/ipps/ipps_ch9/functn_QMFGetStateSize_RTA.html">QMFGetStateSize_RTA</a></td>
<td>Calculates the size of the QMF filter state memory</td>
</tr>
<tr>
<td><a href="http://software.intel.com/sites/products/documentation/hpc/ipp/ipps/ipps_ch9/functn_QMFInit_RTA.html">QMFInit_RTA</a></td>
<td>Initializes the QMF filter state memory</td>
</tr>
<tr>
<td><a href="http://software.intel.com/sites/products/documentation/hpc/ipp/ipps/ipps_ch9/functn_QMFEncode_RTA.html">QMFEncode_RTA</a></td>
<td>Performs QMF analysis</td>
</tr>
<tr>
<td><a href="http://software.intel.com/sites/products/documentation/hpc/ipp/ipps/ipps_ch9/functn_PostFilterGetStateSize_RTA.html">PostFilterGetStateSize_RTA</a></td>
<td>Calculates the size of the post filter state memory</td>
</tr>
<tr>
<td><a href="http://software.intel.com/sites/products/documentation/hpc/ipp/ipps/ipps_ch9/functn_PostFilterGetStateSize_RTA.html">PostFilterInit_RTA</a></td>
<td>Initializes the post filter state memory</td>
</tr>
<tr>
<td><a href="http://software.intel.com/sites/products/documentation/hpc/ipp/ipps/ipps_ch9/functn_BandPassFilter_RTA.html">BandPassFilter_RTA</a></td>
<td>Performs band pass filtering</td>
</tr>
</tbody>
</table>
<br />*LSP - Line spectral pairs<br />*LP - Linear Prediction<br />*QMF- Quadrature mirror filter<br /><br /><br /><br />
<h2 class="sectionHeading">Appendix C: USC MSRTA Codec Description</h2>
For a description of the USC Interface, refer to the USC manual in the sample directory. Note that in the description of the Codec API function Decode when out-&gt; pBuffer is zero, fake decode is performed, with no PCM stream output setting. out-&gt; nBytes is set to bitstream length bytes without FEC.<br /><br />The USC MSRTA codec is supported with following parameters:<br /><br /> 
<table class="tableformat1" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td>Codec names</td>
<td>: IPP_MSRTAnb_FP</td>
</tr>
<tr>
<td>Compression algorithms</td>
<td>: QMF, LPC, Adaptive and Fixed codebook for low band, Unvoiced Fixed codebook for high band, bandwidth control for variable bitrate mode</td>
</tr>
<tr>
<td>Linkage</td>
<td>: USC_MSRTAFP_Fxns</td>
</tr>
<tr>
<td>Signal</td>
<td>: 16bit linear PCM</td>
</tr>
<tr>
<td>Sampling</td>
<td>: 8000</td>
</tr>
<tr>
<td>Frame</td>
<td>: 20ms</td>
</tr>
<tr>
<td>Bitrates</td>
<td>: 8800 bps (176 bpf, 22 bytes)</td>
</tr>
<tr>
<td>Voice Activity Detection</td>
<td>: variable rate support</td>
</tr>
<tr>
<td>Packet Loss Concealment</td>
<td>: PLC supported</td>
</tr>
<tr>
<td>Frame type (value)</td>
<td>: default 0, not required</td>
</tr>
<tr>
<td>Standard</td>
<td>: Microsoft Real-Time Audio.</td>
</tr>
<tr>
<td>Codec names</td>
<td>: IPP_MSRTAwb_FP</td>
</tr>
<tr>
<td>Compression algorithms</td>
<td>: QMF, LPC, Adaptive and Fixed codebook for low band, Unvoiced Fixed codebook for high band, bandwidth control for variable bitrate mode.</td>
</tr>
<tr>
<td>Linkage:</td>
<td>: USC_MSRTAFP_Fxns</td>
</tr>
<tr>
<td>Signal:</td>
<td>: 16bit linear PCM.</td>
</tr>
<tr>
<td>Sampling:</td>
<td>: 16000 Hz.</td>
</tr>
<tr>
<td>Frame:</td>
<td>: 20ms.</td>
</tr>
<tr>
<td>Bitrates:</td>
<td>: 18000 bps (360 bpf, 45 bytes)</td>
</tr>
<tr>
<td>Voice Activity Detection</td>
<td>: variable rate support</td>
</tr>
<tr>
<td>Packet Loss Concealment</td>
<td>: PLC supported</td>
</tr>
<tr>
<td>Frame type (value):</td>
<td>: default 0, not required</td>
</tr>
<tr>
<td>Standard:</td>
<td>: Microsoft Real-Time Audio.</td>
</tr>
</tbody>
</table>
<br /><br />
<h2 class="sectionHeading">Additional Resources</h2>
<ul>
<li><a href="http://software.intel.com/en-us/articles/intel-ipp-functions-optimized-for-intel-avx-intel-advanced-vector-extensions/">Intel® Integrated Performance Primitives (Intel® IPP) Functions Optimized for Intel® Advanced Vector Extensions (Intel® AVX) </a></li>
<li><a href="http://software.intel.com/en-us/articles/intel-avx-new-frontiers-in-performance-improvements-and-energy-efficiency/">Intel® AVX: New Frontiers in Performance Improvements and Energy Efficiency </a></li>
<li><a href="http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/">Intel® AVX and CPU Instructions Forum </a></li>
<li><a href="http://software.intel.com/en-us/forums/intel-integrated-performance-primitives/">Intel® Integrated Performance Primitives Forum</a></li>
</ul>
<p><strong class="sectionHeadingText"><span >About the Author</span></strong></p>
<p ><b>Naveen Gv</b> is a Technical Consulting Engineer (TCE) in the performance library team. At Intel he has specialized on Multi-core programming, Intel Performance Primitives and Intel Math Kernel Library. His professional interests are teaching Multi core Programming Methodology to software community and implementing Digital Signal Processing algorithms on x86 platform. Naveen has worked with several universities across Asia Pacific to implement Multi-Core programming in academia. His e-mail address is <a href="http://software.intel.commailto:naveen.gv@intel.com">naveen.gv at intel.com</a></p>
<div  id="vc-meta">
<div id="vc-meta-author">
<div>Naveen Gv</div>
</div>
<div id="vc-meta-pubdate">04-21-2011</div>
<div id="vc-meta-modificationdate">04-21-2011</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product"></div>
<div id="vc-meta-category">
<div>Intel® AVX</div>
</div>
<div id="vc-meta-thumb"></div>
<div id="vc-meta-thumb-tout"></div>
<div id="vc-meta-thumb-hero"></div>
<div id="vc-meta-tocenable">no</div>
<div id="vc-meta-abstract">This article describes how Intel® Integrated Performance Primitives can provide the building blocks to develop an Intel® AVX optimized Microsoft* Real Time Audio codec for VoIP applications.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/developing-intel-avx-optimized-microsoft-real-time-audio-msrta-codec-using-intel-ipp/</link>
      <pubDate>Thu, 21 Apr 2011 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/developing-intel-avx-optimized-microsoft-real-time-audio-msrta-codec-using-intel-ipp/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/developing-intel-avx-optimized-microsoft-real-time-audio-msrta-codec-using-intel-ipp/</guid>
      <category>Visual Computing</category>
      <category>Intel® AVX</category>
      <category>Media</category>
    </item>
    <item>
      <title>Don&amp;#39;t Spill That Register - Ensuring Optimal Performance From Intrinsics</title>
      <description><![CDATA[ <h2 class="sectionHeading">Download Article</h2>
Download <a href="http://software.intel.com/file/34693">Don't Spill That Register - Ensuring Optimal Performance From Intrinsics</a> [PDF 79KB]<br /><br /><br />
<h2 class="sectionHeading">Objective</h2>
The goal of this article is to help developers ensure their C/C++ code with intrinsics produces the optimal assembly, and to show how to spot unnecessary register spilling. <br /><br /><br />
<h2 class="sectionHeading">Abstract</h2>
Programming with intrinsics can be as optimal as implementing code directly in assembly. Compared to assembly code, C/C++ code using intrinsics is subject to more compilation steps to generate the final code. Compilation in Debug mode, and possibly in Release mode with improperly set compilation flags, may generate code with seemingly unnecessary instructions for copying registers to and from the stack. Register copying can also result from the source code using more __m256 or __m128 variables than the number of corresponding registers available in hardware. From a simple example using intrinsics, this short article shows good and bad assembly produced and then explains what happened and how to avoid it. <br /><br /><br />
<h2 class="sectionHeading">Introduction</h2>
The topics of x86 intrinsics, Intel® Streaming SIMD Extensions (Intel® SSE), and the Intel® Advanced Vector Extensions (Intel® AVX) are discussed in detail online on the Intel Software Network site (<a href="http://software.intel.com/">http://software.intel.com</a>). An intrinsic looks just like a function call in C/C++ code, but the compiler sees it and turns that intrinsic into a single line of assembly. For example, consider the following code: <br /><br />
<pre name="code" class="cpp">         __m128 a = _mm_rsqrt_ss(b);  // a = 1.0f/sqrt(b) approx<br /></pre>
This line of code will cause the compiler to emit an RSQRTSS instruction at this spot. <br /><br />Intrinsics let the programmer do instruction level optimization directly, but without the burden of dealing with register allocation, loop syntax, etc. Developers sometimes ask, "Are intrinsics as optimal as assembly?" The answer is usually yes, or at least close to optimal. Furthermore, code with intrinsics is more future-proof, since code initially written for Intel SSE can be recompiled using Intel AVX. Intel AVX versions normally run faster than their Intel SSE counterparts on the same hardware. Given the ease of use and forward compatibility, intrinsics are the logical choice for optimizing to the hardware. <br /><br />To use intrinsics with the confidence that the program is optimal, it is worthwhile knowing how code gets compiled and be aware of what to look out for. We look at a short intrinsic example and see the corresponding assembly that should result, as well as compilation that generated suboptimal results.<br /><br /><br />
<h2 class="sectionHeading">Generated Assembly</h2>
For this example, we use an intrinsic implementation of the loop s[i]=a*x[i]+y[i], commonly known as "saxpy", and show the code generated. <br /><br />
<pre name="code" class="cpp">  inline void saxpy_simd4(float* S,float _a,const float* X,const float *Y,int n)<br />  {<br />     __m128 a = _mm_set1_ps(_a); <br />     for(int i=0; i!=n ;i+=4)  // process 4 elements at a time<br />     {<br />         __m128 x = _mm_load_ps(X+i);<br />         __m128 y = _mm_load_ps(Y+i);  <br />         __m128 s = _mm_add_ps(_mm_mul_ps(a,x),y);  // a*x + y <br />         _mm_store_ps(S+i, s );<br />     }<br />  }<br /></pre>
When this x86 code gets compiled, it should look something like: <br /><br />
<pre name="code" class="cpp">    (AVX assembly instructions from loop listed only):<br /> 001B4FF0  vmulps      xmm1,xmm0,xmmword ptr xpoints (1B97A0h)[eax]  <br /> 001B4FF8  vaddps      xmm1,xmm1,xmmword ptr ypoints (1B9390h)[eax]  <br /> 001B5000  vmovaps     xmmword ptr dest (1C0440h)[eax],xmm1  <br /> 001B5008  add         eax,10h  <br /> 001B500B  cmp         eax,400h  <br /> 001B5010  jl          saxpy_128+20h (1B4FF0h)  <br /></pre>
This assembly sequence was generated by Microsoft Visual Studio* C++ Compiler 2010 with default release mode settings and /arch:AVX added to the command line. Only the loop instructions within the loop that are repeated many times are shown. Variable xmm0 is initially loaded with the constant a. Clearly the first 3 assembly instructions directly map to the intrinsics in the C++ code. The assembly is actually shorter than the corresponding C code, since the register loading intrinsics have been combined with the vmulps and vaddps instructions. The last 3 instructions correspond to the for i loop. <br /><br />Compiling this small function in another project without optimization flag set resulted in the following assembly:<br /><br />
<pre name="code" class="cpp"> 00B5F7C2  mov         eax,dword ptr [i]  <br /> 00B5F7C5  add         eax,4  <br /> 00B5F7C8  mov         dword ptr [i],eax  <br /> 00B5F7CB  cmp         dword ptr [i],100h  <br /> 00B5F7D2  jge         saxpy_128+122h (0B5F872h)  <br /> 00B5F7D8  mov         eax,dword ptr [i]  <br /> 00B5F7DB  vmovaps     xmm0,xmmword ptr xpoints (0B715A0h)[eax*4]  <br /> 00B5F7E4  vmovaps     xmmword ptr [ebp-1D0h],xmm0  <br /> 00B5F7EC  vmovaps     xmm0,xmmword ptr [ebp-1D0h]  <br /> 00B5F7F4  vmovaps     xmmword ptr [x],xmm0  <br /> 00B5F7F9  mov         eax,dword ptr [i]  <br /> 00B5F7FC  vmovaps     xmm0,xmmword ptr ypoints (0B71180h)[eax*4]  <br /> 00B5F805  vmovaps     xmmword ptr [ebp-1B0h],xmm0  <br /> 00B5F80D  vmovaps     xmm0,xmmword ptr [ebp-1B0h]  <br /> 00B5F815  vmovaps     xmmword ptr [y],xmm0  <br /> 00B5F81A  vmovaps     xmm0,xmmword ptr [x]  <br /> 00B5F81F  vmovaps     xmm1,xmmword ptr [a]  <br /> 00B5F824  vmulps      xmm0,xmm1,xmm0  <br /> 00B5F828  vmovaps     xmmword ptr [ebp-190h],xmm0  <br /> 00B5F830  vmovaps     xmm0,xmmword ptr [y]  <br /> 00B5F835  vmovaps     xmm1,xmmword ptr [ebp-190h]  <br /> 00B5F83D  vaddps      xmm0,xmm1,xmm0  <br /> 00B5F841  vmovaps     xmmword ptr [ebp-170h],xmm0  <br /> 00B5F849  vmovaps     xmm0,xmmword ptr [ebp-170h]  <br /> 00B5F851  vmovaps     xmmword ptr [s],xmm0  <br /> 00B5F859  vmovaps     xmm0,xmmword ptr [s]  <br /> 00B5F861  mov         eax,dword ptr [i]  <br /> 00B5F864  vmovaps     xmmword ptr dest (0B78240h)[eax*4],xmm0  <br /> 00B5F86D  jmp         saxpy_128+72h (0B5F7C2h)  <br /></pre>
Here we see that the compiler issued additional instructions that do not correspond to loop management or intrinsics in the original source code. What is happening here is that after registers are loaded from memory, they are copied to and from the stack. The explanation is that, from the C language perspective, the __m128 variables reside on the stack, and the compiler is just putting the data to the place where it was declared. It is the O2 optimization step, not the fact that we used intrinsics, that is normally responsible for removing such unnecessary copying. The extra copying will likely happen in Debug, but may also happen in Release mode if the project's Optimization setting is not Maximum Speed or /O2. The example shown here compiles to Intel AVX, but the same thing happens with Intel SSE as well. <br /><br /><br />
<h2 class="sectionHeading">Register Shortage</h2>
When using temporary __m128 or __m256 variables for single instruction multiple data (SIMD) programming, the optimizing compiler usually does a good job of keeping these as registers. Even with optimizations, the compiler may still sometimes generate assembly code that copies temporary values to the stack. Consider for example a 3D spring (distance constraint) update written using hybrid structure of arrays (SOA) style programming. The following example is based on code from the AVX cloth sample available on Intel's website: <br /><br />
<pre name="code" class="cpp">  void springupdate(__m256 A[][3], __m256 B[][3],__m256 &amp;restlen)<br />  {<br />    __m256 half = _mm256_set1_ps(0.5f)<br />    for(int i=0;i != N ; i++)  // 8*N constraints in total<br />    {<br />      // each a and b contain the xyz endpoints for 8 pseudo-springs<br />      __m256 *a=A[i];<br />      __m256 *b=B[i];<br />      __m256 vx  = _mm256_sub_ps(b[0],a[0]); // v.x=b.x-a.x<br />      __m256 vy  = _mm256_sub_ps(b[1],a[1]); // v.x=b.x-a.x<br />      __m256 vz  = _mm256_sub_ps(b[2],a[2]); // v.x=b.x-a.x<br />      __m256 dp  = vx*vx+vy*vy+vz*vz;        // assume operator overloads for add and mul <br />      __m256 imag= _mm256_rsqrt_ps(dp);      // inverse magnitude<br />      // normalize v<br />      vx = _mm256_mul_ps(vx,imag); // vx *= inverse magnitude <br />      vy = _mm256_mul_ps(vy,imag); // vy *= imag<br />      vz = _mm256_mul_ps(vz,imag); // vz *= imag <br />      __m256 half_stretch = ( dp*imag - restlen) * half;<br />      // move endpoints a and b together <br />      a[0]=a[0]+ vx * half_stretch;    <br />      a[1]=a[1]+ vy * half_stretch;    <br />      a[2]=a[2]+ vz * half_stretch;    <br />      b[0]=b[0]- vx * half_stretch;    <br />      b[1]=b[1]- vy * half_stretch;    <br />      b[2]=b[2]- vz * half_stretch;   <br />    }   <br />  }<br /></pre>
For brevity, the above code assumes the obvious operator overloads are implemented and inlined. Even if all intrinsic calls were written out by hand, there is a good chance that compiling a routine like this will produce assembly code that will copy data to and from the stack and/or load the same data from arrays A and B multiple times. While there is no limit to how many variables of this type a programmer uses, there are a limited number of hardware registers available. When Intel AVX code is compiled into a 32-bit executable, the compiler has only 8 YMM registers available. Even with this oversimplified distance constraint equation, the code uses 6 registers for the endpoints, another 3 for the vector v between them, as well as registers for the inverse magnitude, rest length, magnitude minus rest length, half constant, and half stretch amount. Clearly, the compiler must use the same register for more than one variable in this loop. Therefore, it will have to reload values from (and possibly copy values to) the stack. In this situation, the solution to avoid register spilling is to compile the code for 64-bit. Then, instead of just 8, the compiler has 16 YMM (256-bit) registers at its disposal, which is more than enough for this particular simulation. <br /><br /><br />
<h2 class="sectionHeading">Conclusion</h2>
Starting from a C/C++ intrinsics sample, we've shown the good and the bad of what sort of assembly code can be generated. The suboptimal extra register copying code generation can result from compiler settings such as not using fast-code optimization (O2), or from not using 64-bit when more than 8 registers are needed. There may be other reasons why a compiler might not generate the assembly the programmer expects. Therefore, while intrinsics are often the preferred choice for code optimization, it is still a good idea to inspect the generated assembly to ensure the compiled result is as expected.
<div  id="vc-meta">
<div id="vc-meta-author">
<div>Stan Melax</div>
</div>
<div id="vc-meta-pubdate">03-06-2011</div>
<div id="vc-meta-modificationdate">03-06-2011</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product">
<div></div>
</div>
<div id="vc-meta-category">
<div>Intel® AVX</div>
<div>Intel® SSE</div>
</div>
<div id="vc-meta-thumb"></div>
<div id="vc-meta-thumb-tout"></div>
<div id="vc-meta-thumb-hero"></div>
<div id="vc-meta-tocenable">no</div>
<div id="vc-meta-abstract">Compared to assembly code, C/C++ code using intrinsics is subject to more compilation steps to generate the final code. Improper compliation settings may result in assembly instructions that copy registers unnecessarily and reduce performance. From a simple example using intrinsics, this short article shows good and bad assembly produced and then explains what happened and how to avoid it.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/dont-spill-that-register-ensuring-optimal-performance-from-intrinsics/</link>
      <pubDate>Mon, 07 Mar 2011 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/dont-spill-that-register-ensuring-optimal-performance-from-intrinsics/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/dont-spill-that-register-ensuring-optimal-performance-from-intrinsics/</guid>
      <category>Visual Computing</category>
      <category>Intel® AVX</category>
      <category>Game Development</category>
    </item>
    <item>
      <title>3D Vector Normalization Using 256-Bit Intel® Advanced Vector Extensions (Intel® AVX)</title>
      <description><![CDATA[ <h2 class="sectionHeading">Download Article</h2>
Download <a href="http://software.intel.com/file/30814" target="_blank">3D Vector Normalization Using 256-Bit Intel® Advanced Vector Extensions (Intel® AVX)</a> [PDF 202KB]<br /><br />
<h2 class="sectionHeading">Objective</h2>
Show developers how to improve the performance of geometry computations by transposing packed 3D data on-the-fly to take advantage of 8-wide SIMD processing.<br /><br /><br />
<h2 class="sectionHeading">Abstract</h2>
This article shows how to exploit 256-bit Intel® Advanced Vector Extensions (Intel® AVX) to normalize an array of 3D vectors. We describe a shuffle approach to convert between AOS and SOA on-the-fly in order to make data ready for up to 8-wide single instruction multiple data (SIMD) processing. The 8x3 to 3x8 transpose can be done in 5 shuffles, and the 3x8 to 8x3 transpose back takes 6 shuffles. Results demonstrate the benefit of wider SIMD on the normalization computation with 2.3 and 2.9 times speedups on 128-bit and 256-bit respectively. The round trip cost, 11 extra instructions, of enabling SOA processing is low enough to justify its usage on this small computation.<br /><br /><br />
<h2 class="sectionHeading">Introduction</h2>
Many interactive applications do a lot of geometry processing of one form or another on the CPU. While structure of array (SOA) memory layout is most efficient for processing, in some applications it may not be practical to store data this way. Quite often there is performance-critical code that operates on a regular array of 3D vectors - a very common data structure found in 3D applications. A common example is making an array of 3D normal vectors to all be of unit length. In this article, we begin with this function and optimize the code to exploit the capabilities of the x86 processor, including new 256-bit instructions from the Intel® Advanced Vector Extensions (Intel® AVX), which are part of the new microarchitecture of the second-generation Intel® Core™ processor family, codenamed Sandy Bridge.<br /><br />The C code for the loop being optimized is shown below:<br />
<pre name="code" class="cpp">  void Normalize(float V[][3],int N)
  {
    for(int i=0;i != N;i++)
    {
        float *v=V[i];
        float invmag = 1.0f/sqrtf(v[0]*v[0]+v[1]*v[1]+v[2]*v[2]); 
        v[0] *= invmag;
        v[1] *= invmag;
        v[2] *= invmag;
    }
  }
</pre>
In practice, such a routine may be implemented using a C++ 3D vector class with overloaded operators, but the underlying operations and data layout would be the same.<br /><br />Usage of single instruction multiple data (SIMD) processing with 4-wide 32-bit precision floating point has become widespread since the introduction of the Intel® Streaming SIMD Extensions 2 (SSE2) in 2001. Because the 128-bit register size is a natural fit for 4D data, many 3D interactive applications will use homogeneous vectors, or pad 3D vectors with an extra 32 bits. While an easy way to exploit SIMD processing the gains are rarely a 4x improvement due to other performance bottlenecks. Applying this style of SIMD to our normalization example would only use 3/4 of a 128-bit register. While the multiplications can be done in parallel, summing the squares and the inverse square root do not benefit from SIMD. Most importantly, this programming pattern does not scale to wider SIMD such as 256-bit Intel Advanced Vector eXtensions (AVX) that support parallel operations on 8 floating point numbers.<br /><br />Another approach to exploit SIMD is to process 8 iterations of the loop at a time, i.e., normalize 8 vectors at once. This requires the data to be rearranged so all the X components of the 8 vectors fit into one register, the Y's in the next, and Z's in another. Loading the registers would be easy if our data happened to be stored this way in a structure of arrays (SOA). However, some applications do require the data storage to be a packed sequence of 3D vectors - an array of structures (AOS). Therefore, to utilize 8-wide SIMD processing, we need to transpose the data on-the-fly, do the computation, and transpose back. This article describes how to do the transpose using shuffles on the x86, followed by provides performance results, including both optimized serial and shuffle transpose implementations, that shows the speedup that can be obtained using Intel AVX for normalizing an (AOS) array of 3D vectors. <br /><br /><br />
<h2 class="sectionHeading">The 128-bit and 256-bit AOS to SOA shuffle</h2>
The most efficient way to move data from the first level cache into registers is to load 128 bits (or more) at a time. This is a different stride than our packed array of 3D values. However, the pattern of 128-bit alignment repeats for every fourth vector (or 12 floats). Therefore, three aligned 128-bit loads pull the next four 3D vectors into three 128-bit registers. <br /><br />The order of the vector elements is not very useful right after the three loads. Next we need to get the data from four 3D vectors into a usable form by grouping the X's, Y's and Z's into separate registers. The following figure shows the 4x3 to 3x4 transpose using five Intel AVX 128-bit shuffles. <br /><br /><img src="http://software.intel.com/file/30811" /><br /><br />With the data now in SOA form, the computation is done with the same steps as the serial implementation but using the Intel AVX instructions to normalize 4 vectors at a time. <br /><br />The result of the computation will be in SOA form and needs to be transposed back into AOS. Our conversion operations are not symmetrical. In other words, the transpose back from SOA to AOS cannot be done with the same code. In fact, it takes one more shuffle to convert back. The technique is illustrated in the following figure.<br /><br /><img src="http://software.intel.com/file/30812" /><br /><br />So far we have described 4 float or 128-bit SIMD usage. We can extend this to 8x3 by using the full 256-bit registers and instructions provided by Intel AVX. When shuffles are involved, the way to think about 256-bit Intel AVX is that there are two lanes that are 128-bit each. In other words, AVX provides two lanes of 4 wide SIMD units. As before, we load the first 12 floats into three 128-bit registers which actually are the lower halves of three 256-bit registers. All we have to do now is load the next 12 floats (or next four 3D vectors) into the upper halves of those same three 256-bit registers. Then the shuffle computation is implemented exactly the same way, but using the 256-bit versions of each of the instructions/intrinsics.<br /><br /><img src="http://software.intel.com/file/30813" /><br /><br />The 256 bit conversion back to AOS in 6 shuffles and subsequent stores are similarly an extension of the 128-bit version above. With the ability to transpose data into SOA and back, we are now ready to apply this technique to normalizing an array of 3D vectors.<br /><br /><br />
<h2 class="sectionHeading">Testing and Results</h2>
For our testing, we used a single core from an Intel® processor-based system with the microarchitecture codename Sandy Bridge that supports Intel AVX. To focus on instruction costs, L1 caches were primed with the data for 1024 3D vectors. The RDTSC instruction was called before and after each loop to measure the total number of cycles to process the entire dataset. In the tables below, we divide by the size of the array and show throughput results in CPU cycles per 3D vector processed. In addition to the various normalization implementations, we also measure loops with parts removed to gauge the cost of various steps in isolation.<br /><br />The following table shows the cost per 3D vector of the round trip from AOS to SOA and back to AOS for 3D data on SIMD 128-bit (4 float) and 256-bit (8 float). <br /><br /> 
<table width="700" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody>
<tr>
<td><b>Transpose Only Round Trip</b></td>
<td><b>Cycles Per Vector Processed</b></td>
<td><b>Description</b></td>
</tr>
<tr>
<td>shuf_trans128:</td>
<td>3.0</td>
<td>128-bit AOS-&gt;SOA-&gt;AOS via shuffles</td>
</tr>
<tr>
<td>shuf_trans256:</td>
<td>1.5</td>
<td>256-bit transpose and back</td>
</tr>
</tbody>
</table>
<br />The values in the table are cycles per 3D vector. Note that there is more than one vector being transposed per loop iteration. In both cases, the loop iteration takes 12 cycles which is reasonable since there are 11 shuffles per loop. The meaningful result is that the cost is three or fewer cycles per vector for the round trip conversion from our packed array of structures to a structure of arrays and back. <br /><br />For the next test, we wanted to gauge what the cost would be if there was no transpose required, so we made another data structure where the data was already prearranged in SOA format, i.e., three separate arrays for all the X's, Y's, and Z's. The following table shows the cost of normalizing 3D vectors already in SOA format. <br /><br /> 
<table width="700" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody>
<tr>
<td><b>SIMD Math Only Tests</b></td>
<td><b>Cycles Per Vector Normalized</b></td>
<td><b>Description</b></td>
</tr>
<tr>
<td>norm_soa_data4:</td>
<td>1.8</td>
<td>128-bit intrinsics norm on SOA data</td>
</tr>
<tr>
<td>norm_soa_data8:</td>
<td>1.1</td>
<td>256-bit intrinsics norm on SOA data</td>
</tr>
</tbody>
</table>
<br />Results are presented for both 128-bit and 256-bit SIMD normalization implementations on preformatted SOA data. Because the cost of shuffling data is not free, it is doubtful that normalizing the AOS data could be as fast as these times. This is meant for reference and provides an indication of the absolute performance limit for just the mathematical computation for our loop.<br /><br />Note that normalization is using the low level of numerical accuracy. The result of the approximate inverse square root assembly routine is not refined with Newton-Raphson. For some applications, such as generating mesh normals for lighting calculations, low accuracy may be considered sufficient. The goal of these tests is to evaluate the potential benefit for the transpose on-the-fly technique for very small computations.<br /><br />The cost of normalizing our regular array of 3D vectors (AOS dataset) is shown in the following table:<br /><br /> 
<table width="700" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody>
<tr>
<td><b>3D Vector Normalization</b></td>
<td><b>Cycles Per Vector Normalized</b></td>
<td><b>Description</b></td>
</tr>
<tr>
<td>x87:</td>
<td>45.0</td>
<td>serial function without any Intel® SSE or Intel® AVX</td>
</tr>
<tr>
<td>arch_optim:</td>
<td>24.3</td>
<td>serial normalization (no SIMD) with -arch optimization</td>
</tr>
<tr>
<td>rsqrt_ss:</td>
<td>8.0</td>
<td>serial vector normalization using RSQRT assembly</td>
</tr>
<tr>
<td>mask_trans:</td>
<td>9.4</td>
<td>AOS-&gt;SOA-&gt;AOS uses masking for 4x3 trans</td>
</tr>
<tr>
<td>shuf_trans4:</td>
<td>3.5</td>
<td>4 float (128 bit) SOA AOS conversion with shuffles</td>
</tr>
<tr>
<td>shuf_trans8:</td>
<td>2.7</td>
<td>8 float (256 bit) 8x3 transpose with shuffles</td>
</tr>
</tbody>
</table>
<br />The results are shown for a variety of implementations. To get a fair basis for comparison, we wrote, using hand-picked assembly instructions, an optimal serial implementation, with less accuracy, that averages 8 cycles per loop or 8 cycles per vector normalization. Also for interest, we compared with another 128-bit AOS-&gt;SOA-&gt;AOS programming pattern that uses masks and bitwise logical operations to implement the transpose. On current hardware, this method is slower than the serial implementation. <br /><br />Using Intel AVX's 256 bit SIMD, a regular 3D array of vectors are normalized with a 2.7 cycle throughput. In other words, the SIMD implementation using shuffles performs better than the best serial implementation. Note the cost is not exactly the sum of the computation and the shuffle from the previous two tables. Intel's x86 CPU has independent execution ports for processing shuffles, multiplies, additions, and moves. Overall, the shuffle-based transpose to enable an SOA implementation results in a respectable speedup. <br /><br /> 
<table width="700" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody>
<tr>
<td><b>SIMD Benefit</b></td>
<td><b>Best Serial</b></td>
<td><b>128 bit</b></td>
<td><b>256 bit</b></td>
</tr>
<tr>
<td><b>Speedup</b></td>
<td>1 (baseline)</td>
<td>2.3</td>
<td>2.9</td>
</tr>
</tbody>
</table>
<br /><br />
<h2 class="sectionHeading">Conclusion</h2>
From a vanilla C/C++ implementation, there are a number of incremental steps that a developer can take to improve code performance, including setting compiler flags to use the available architecture/instructions, picking the faster instructions within the implementation, and exploiting SIMD. Starting from our humble non-tuned Release mode compilation and progressing all the way to our 256-bit Intel® AVX version, the performance improves by over an order of magnitude.<br /><br />Working with data structures that are already in SOA layout would certainly be the most efficient, but this is not always possible in many applications. Shuffling 3D data between SOA and AOS on the fly is a worthwhile way to get better utilization of the Intel® CPU processor and potentially make 3D applications run faster. The amount of computation per vector in this example is quite small. When there is more work to be done in the innermost loop, the speedup attained by this approach will likely be even greater. <br /><br /><br />
<h2 class="sectionHeading">About The Author</h2>
Stan Melax received his Bachelor's and Master's degrees in Computing Science from the University of Alberta in Canada in the early 90s. He has spent much of his career in the game industry including BioWare, EA, and Ageia. Stan is now a Graphics Software Engineer at Intel Corporation where he gets to help developers write faster code and make better applications. <br /><br /><br />
<h2 class="sectionHeading">Appendix: Transpose Source Code</h2>
<b>AOS to SOA 128-bit</b><br /><br />C source code with 128-bit Intel® Advanced Vector Extensions (Intel® AVX) intrinsics for converting AOS to SOA: <br />
<pre name="code" class="cpp"> float *p;  // address of first vector
 __m128 x0y0z0x1 = _mm_load_ps(p+0);
 __m128 y1z1x2y2 = _mm_load_ps(p+4);  
 __m128 z2x3y3z3 = _mm_load_ps(p+8);
 __m128 x2y2x3y3 = _mm_shuffle_ps(y1z1x2y2,z2x3y3z3,_MM_SHUFFLE( 2,1,3,2));  
 __m128 y0z0y1z1 = _mm_shuffle_ps(x0y0z0x1,y1z1x2y2,_MM_SHUFFLE( 1,0,2,1)); 
 __m128 x        = _mm_shuffle_ps(x0y0z0x1,x2y2x3y3,_MM_SHUFFLE( 2,0,3,0)); // x0x1x2x3
 __m128 y        = _mm_shuffle_ps(y0z0y1z1,x2y2x3y3,_MM_SHUFFLE( 3,1,2,0)); // y0y1y2y3
 __m128 z        = _mm_shuffle_ps(y0z0y1z1,z2x3y3z3,_MM_SHUFFLE( 3,0,3,1)); // z0z1z2z3
</pre>
The output is found in __m128 registers x,y and z. <br /><br /><b>SOA to AOS 128-bit.</b><br /><br />C source code with 128-bit Intel® Advanced Vector Extensions (Intel® AVX) intrinsics for converting SOA to AOS: <br />
<pre name="code" class="cpp">  __m128 x,y,z;  // Starting SOA data
  __m128 x0x2y0y2 = _mm_shuffle_ps(x,y, _MM_SHUFFLE(2,0,2,0)); 
  __m128 y1y3z1z3 = _mm_shuffle_ps(y,z, _MM_SHUFFLE(3,1,3,1)); 
  __m128 z0z2x1x3 = _mm_shuffle_ps(z,x, _MM_SHUFFLE(3,1,2,0)); 

  __m128 rx0y0z0x1= _mm_shuffle_ps(x0x2y0y2,z0z2x1x3, _MM_SHUFFLE(2,0,2,0));  
  __m128 ry1z1x2y2= _mm_shuffle_ps(y1y3z1z3,x0x2y0y2, _MM_SHUFFLE(3,1,2,0));  
  __m128 rz2x3y3z3= _mm_shuffle_ps(z0z2x1x3,y1y3z1z3, _MM_SHUFFLE(3,1,3,1));  

  _mm_store_ps(p+0, rx0y0z0x1 );
  _mm_store_ps(p+4, ry1z1x2y2 );
  _mm_store_ps(p+8, rz2x3y3z3 );
</pre>
Registers x,y,z containing the data for 4 vectors is shuffled and stored into packed array starting at pointer p. <br /><br /><b>AOS to SOA 256-bit</b><br /><br />C source code with 256-bit Intel® Advanced Vector Extensions (Intel® AVX) intrinsics for converting AOS to SOA: <br />
<pre name="code" class="cpp">  float *p;  // address of first vector
 __m128 *m = (__m128*) p;
 __m256 m03;
 __m256 m14; 
 __m256 m25; 
 m03  = _mm256_castps128_ps256(m[0]); // load lower halves
 m14  = _mm256_castps128_ps256(m[1]);
 m25  = _mm256_castps128_ps256(m[2]);
 m03  = _mm256_insertf128_ps(m03 ,m[3],1);  // load upper halves
 m14  = _mm256_insertf128_ps(m14 ,m[4],1);
 m25  = _mm256_insertf128_ps(m25 ,m[5],1);

 __m256 xy = _mm256_shuffle_ps(m14, m25, _MM_SHUFFLE( 2,1,3,2)); // upper x's and y's 
 __m256 yz = _mm256_shuffle_ps(m03, m14, _MM_SHUFFLE( 1,0,2,1)); // lower y's and z's
 __m256 x  = _mm256_shuffle_ps(m03, xy , _MM_SHUFFLE( 2,0,3,0)); 
 __m256 y  = _mm256_shuffle_ps(yz , xy , _MM_SHUFFLE( 3,1,2,0)); 
 __m256 z  = _mm256_shuffle_ps(yz , m25, _MM_SHUFFLE( 3,0,3,1)); 
</pre>
Eight 3D vectors are loaded from address p and the output is found in __m256 registers x,y and z. Although this may appear intimidating, it is a natural extension of the 128 bit version. <br /><br /><b>SOA to AOS 256-bit</b><br /><br />C source code with 256-bit Intel® Advanced Vector Extensions (Intel® AVX) intrinsics for converting SOA to AOS: <br />
<pre name="code" class="cpp">  __m256 x,y,z;  // Starting SOA data
  float *p; // output pointer
__m128 *m = (__m128*) p;

  __m256 rxy = _mm256_shuffle_ps(x,y, _MM_SHUFFLE(2,0,2,0)); 
  __m256 ryz = _mm256_shuffle_ps(y,z, _MM_SHUFFLE(3,1,3,1)); 
  __m256 rzx = _mm256_shuffle_ps(z,x, _MM_SHUFFLE(3,1,2,0)); 

  __m256 r03 = _mm256_shuffle_ps(rxy, rzx, _MM_SHUFFLE(2,0,2,0));  
  __m256 r14 = _mm256_shuffle_ps(ryz, rxy, _MM_SHUFFLE(3,1,2,0)); 
  __m256 r25 = _mm256_shuffle_ps(rzx, ryz, _MM_SHUFFLE(3,1,3,1));  

  m[0] = _mm256_castps256_ps128( r03 );
  m[1] = _mm256_castps256_ps128( r14 );
  m[2] = _mm256_castps256_ps128( r25 );
  m[3] = _mm256_extractf128_ps( r03 ,1);
  m[4] = _mm256_extractf128_ps( r14 ,1);
  m[5] = _mm256_extractf128_ps( r25 ,1);
</pre>
Registers x,y,z containing the data for eight 3D vectors is shuffled and stored into packed array starting at pointer p.
<div  id="vc-meta">
<div id="vc-meta-author">
<div>Stan Melax</div>
</div>
<div id="vc-meta-pubdate">09-27-2010</div>
<div id="vc-meta-modificationdate">09-27-2010</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product"></div>
<div id="vc-meta-category">
<div>Intel® AVX</div>
</div>
<div id="vc-meta-thumb"></div>
<div id="vc-meta-thumb-tout"></div>
<div id="vc-meta-thumb-hero"></div>
<div id="vc-meta-tocenable">no</div>
<div id="vc-meta-abstract">This article shows how to exploit 256-bit Intel® Advanced Vector Extensions (Intel® AVX) to normalize an array of 3D vectors. A shuffle approach is described to convert between an Array of Structures (AOS) and a Structure of Arrays (SOA) on-the-fly in order to make data ready for up to 8-wide single instruction multiple data (SIMD) processing. The round trip cost, 11 extra instructions, of enabling SOA processing is low enough to justify its usage on this small computation.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/3d-vector-normalization-using-256-bit-intel-advanced-vector-extensions-intel-avx/</link>
      <pubDate>Mon, 27 Sep 2010 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/3d-vector-normalization-using-256-bit-intel-advanced-vector-extensions-intel-avx/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/3d-vector-normalization-using-256-bit-intel-advanced-vector-extensions-intel-avx/</guid>
      <category>Visual Computing</category>
      <category>Intel® AVX</category>
      <category>Game Development</category>
    </item>
    <item>
      <title>Intel® AVX C/C++ Intrinsics Emulation</title>
      <description><![CDATA[ <p>Intel® AVX instruction set extension <a target="_blank" href="http://software.intel.com/en-us/avx/">[1]</a> will appear in the next generation Intel microarchitecture codename ‘Sandy Bridge'. We chose to announce AVX early to get as much support from software vendors as possible by the hardware launch time. Now, most software development platforms are supporting Intel AVX, examples are compilers and assemblers from Intel, Microsoft and GCC as well as UNIX binutils.</p>
<p>For early adopters we introduced support of AVX in Intel® Software Development Emulator <a target="_blank" href="http://software.intel.com/en-us/articles/intel-software-development-emulator/">[2]</a>, it allows you to run and check functional correctness of the code with the actual AVX instructions before hardware is available.</p>
<p>Today we are adding another useful piece to help those who may not be able to use new tools supporting AVX in their current development environment but plan to migrate in the future or are using a software platform which is not supported by Intel SDE. These software developers can still start programming with Intel AVX using intrinsics.</p>
<p>Here we are providing the C and C++ header file which emulates Intel AVX intrinsics. The AVX emulation header file uses intrinsics for the prior Intel instruction set extensions up to Intel SSE4.2. SSE4.2 support in your development environment as well as hardware is required in order to use the AVX emulation header file. <br /><br />To use simply have this file included:</p>
<p>#include "avxintrin_emu.h"</p>
<p>Instead of usual:</p>
<p>#include &lt;immintrin.h&gt;</p>
<p><br />One can also create alternative immintrin.h file (which in turn includes avxintrin_emu.h) to avoid an intrusive change to the source base and then simply switch between real AVX code generation and emulation via alternating the path to include directories.</p>
<p>Emulation header is primarily targeting UNIX type of environments, and was tested on such with GCC and Intel C/C++ compilers. We have a strong support with other tools (compilers, assemblers and SDE) on Microsoft Windows platform, but this header file can still be used on Windows, if desired, with Intel Compiler.</p>
<p>Note that the AVX emulation header file is designed to allow functional correctness of an AVX implementation and not recommended for long-term usage or release in a final product. Once your development environment and hardware supports AVX, we recommend that you switch to the real AVX intrinsic header file.<br /><br />Although we did our best to debug it, this file must <em>not</em> be considered a reference functional implementation of AVX instructions or even bug-free. Please see the current version's limitations and caveats in the beginning of the file. Please let us know about the issues you faced using it.</p>
<p><b><br />Example</b></p>
<pre name="code" class="cpp:nogutter:nocontrols">#include "avxintrin_emu.h"  // #include &lt;immintrin.h&gt;

void saxpy( float a, const float* x, const float* y, float* __restrict z, size_t len )
{
    size_t i = 0;
    __m256 a_ = _mm256_set1_ps( a );

    for ( size_t len16_ = len &amp; -16; i + 16 &lt;= len16_; i += 16 )
    {
        __m256 x1_ = _mm256_loadu_ps( x + i );
        __m256 x2_ = _mm256_loadu_ps( x + i + 8 );

        __m256 y1_ = _mm256_loadu_ps( y + i );
        __m256 y2_ = _mm256_loadu_ps( y + i + 8 );

        x1_ = _mm256_mul_ps( x1_, a_ );
        x2_ = _mm256_mul_ps( x2_, a_ );

        x1_ = _mm256_add_ps( x1_, y1_ );
        x2_ = _mm256_add_ps( x2_, y2_ );

        _mm256_storeu_ps( z + i     , x1_ );
        _mm256_storeu_ps( z + i + 8 , x2_ );
    }

    for ( ; i &lt; len; ++i )
        z[i] = x[i] * a + y[i];
}</pre>
<p><br /><strong><br />References </strong></p>
<p>[1] Intel AVX - <a target="_blank" href="http://software.intel.com/en-us/avx/">http://software.intel.com/en-us/avx/</a></p>
<p>[2] Intel Software Development Emulator - <a target="_blank" href="http://software.intel.com/en-us/articles/intel-software-development-emulator/">http://software.intel.com/en-us/articles/intel-software-development-emulator/</a></p> ]]></description>
      <link>http://software.intel.com/en-us/articles/avx-emulation-header-file/</link>
      <pubDate>Wed, 23 Jun 2010 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/avx-emulation-header-file/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/avx-emulation-header-file/</guid>
      <category>Parallel Programming</category>
      <category>Open Source</category>
      <category>What If Experimental Software</category>
      <category>Tools</category>
      <category>Intel® AVX</category>
      <category>Software News</category>
      <category>Code &amp; Downloads</category>
    </item>
    <item>
      <title>Using Intel® Advanced Vector Extensions to Implement an Inverse Discrete Cosine Transform</title>
      <description><![CDATA[ <h1 class="sectionHeading">Download Article</h1>
Download <a href="http://software.intel.com/file/30921">Using Intel® Advanced Vector Extensions to Implement an Inverse Discrete Cosine Transform</a> [PDF 335KB]<br />Download <a href="http://software.intel.com/file/29048">IDCT source code</a> [Zip 22KB]<br /><br />
<h1 class="sectionHeading">Introduction</h1>
Transform coding is an important step of image and video processing applications. Pixels in an image have a level of correlation with their neighboring pixels. Adjacent pixels in successive frames show a very high correlation. These correlations can be used to predict the value of a pixel from its neighbor. The highly correlated spatial data is transformed into uncorrelated coefficients in the frequency domain. The transformed data, represented as coefficients, are independent, allowing them to be manipulated separately. The human eye is more perceptive to low-frequency changes than high-frequency changes. Encoders can achieve compression after setting the high-frequency coefficients to zero. An inverse transform is used in the decoder pipeline to reconstruct the source data.<br /><br /> Figure 1 and Figure 2 depict the components of a typical image encoder/decoder system.<br /><br />
<p ><img src="http://software.intel.com/file/28098" /></p>
<br />
<div ><b>Figure 1 - Source Encoder Block Diagram</b><br /></div>
<br />
<p ><img src="http://software.intel.com/file/28099" /></p>
<br />
<div ><b>Figure 2 - Source Decoder Block Diagram</b><br /></div>
<br /> The Discrete Cosine Transform (DCT) and Inverse Discrete Cosine Transform (IDCT) are widely used in the encoder and decoder pipelines of image processing systems such as MPEG, JPEG, and other standards. A fast and accurate IDCT transformation is crucial for the best user experience.<br /><br /> Intel® Advanced Vector Extensions (Intel® AVX), introduced with the new Intel® processor microarchitecture codenamed Sandy Bridge, extends the capabilities of Intel® Streaming SIMD Extensions (Intel® SSE) for floating point data and operations such as Inverse Discrete Cosine Transforms. Intel® AVX essentially doubles the width of the current XMM registers and adds new extensions that can operate on the wider data width. Intel® AVX significantly increases the floating-point performance density with improved power efficiency over previous 128-bit SIMD instruction set extensions. This document specifically examines how Intel® AVX and Sandy Bridge microarchitecture features such as wider 256-bit SIMD registers, non-destructive source operands, new data manipulation and arithmetic primitives, dual 128-bit load ports, and doubling of computational execution width can benefit the Inverse Discrete Cosine Transform (IDCT) operation. Intel® AVX improves the computational throughput of high performance precision-sensitive floating point transform applications. Integer implementations may also benefit from features such as non-destructive source operands. <br /><br /><br /> Intel also provides a set of Intel® AVX software development tools like the Intel® AVX-enabled Intel® C++ Compiler, the Intel® Software Development Emulator (SDE), and the Intel® Architecture Code Analyzer.  All of these tools were effectively used during the development of this kernel, and can be downloaded from the Intel® AVX website: <a href="http://software.intel.com/en-us/avx/">http://software.intel.com/en-us/avx/</a><br /><br />
<h1 class="sectionHeading">Testing Environment</h1>
The performance speedups stated in this paper are based on runs on actual Sandy Bridge microarchitecture-based pre-production silicon.  It can also be assumed that the test data is already in the first level processor cache prior to the computation of the IDCT algorithm.  Performance comparisons are made based on the relative performance of Intel® AVX versus corresponding Intel® SSE implementations using C intrinsic instructions, both run on the Sandy Bridge microarchitecture-based silicon. The code was compiled using the 64-bit Intel® C++ Compiler, version 11.1.038. The applications were compiled using the following command line options:<br /><br /> 
<ul>
<li>Intel® SSE: /QxSSE4.1 /O3</li>
<li>Intel® AVX: /QxAVX /O3</li>
</ul>
A comparison of short integer performance was performed by compiling the Intel<sup>®</sup> SSE implementation for different architectures. The compiler will generate VEX-encoded Intel<sup>®</sup> SSE instructions when the /QxAVX switch is specified. The compiler will generate Intel<sup>® </sup>SSE 4.1 instructions when the /QxSSE4.1 switch is specified. Non-destructive source operands are introduced with the VEX-encoded instructions. <br /><br /> The test application follows the 8x8 IDCT accuracy requirements as documented in section 3 of the IEEE standard 1180-1900 (Reference 5).  Randomly generated input data is processed by a reference DCT. The output of the DCT is rounded to the nearest integer and clipped such that -2048 &lt;= value &lt; 2047. The clipped values are inputs to both the reference IDCT and the IDCT under test. Their outputs are rounded to the nearest integer and clipped such that -256 &lt;= output &lt; 255. The two results are then compared and accuracy measurements are made.<br /><br /> In this application, the references are implemented using double precision floating point scalar C code. Several proposed IDCT implementations are tested; a short integer Intel® SSE version, a single precision floating point Intel® SSE version, and a single precision floating point Intel® AVX version. All the vector versions are implemented with C intrinsics.<br /><br />
<p ><img src="http://software.intel.com/file/28086" /></p>
<br />
<div ><b>Figure 3 - Test Application</b><br /></div>
<br />
<h1 class="sectionHeading">Discrete Cosine Transforms</h1>
The Discrete Cosine Transform (DCT) converts the spatial data of an image into the frequency domain. The mathematical operations are described in detail in Reference [1].<br /><br /> The two-dimensional (2D) DCT transforms 64 pixel values, an 8x8 block of pixels, producing 64 coefficients. The image's pixel values are transformed into coefficients derived from amplitudes of cosine basis functions.<br /><br />
<p ><img src="http://software.intel.com/file/28087" /></p>
<br /><br /> Where:<br /><br />
<p ><img src="http://software.intel.com/file/28088" /></p>
<br /> Reference work [2] summarizes the number of mathematical operations; the numbers are worth repeating.  Comparing the DCT and IDCT equations, it can be shown that the transforms require the same number of operations.  An analysis of Equation 2 shows that the 2D DCT requires 64 multiplications and 63 additions for each coefficient. Therefore 4096 multiplications and 4023 additions are required to transform an 8x8 block. Replacing the 2D DCT with 8 1D DCTs for the 8 rows and 8 1D DCTs for the 8 columns can reduce the number of operations. Equation 1 shows that the 1D DCT requires 64 multiplications and 56 additions producing 8 coefficients. Transforming the 8x8 block with 1D DCTs requires 1024 multiplications and 896 additions. References [1] and [4] discuss other DCT and IDCT algorithms.<br /><br /><br />
<h1 class="sectionHeading">Previous Work and Specifications</h1>
The following previous works and specifications should be consulted for background information:<br /> 
<ul>
<li>Reference [1] for the JPEG specification.</li>
<li>The IDCT algorithm used in the Intel® AVX implementation discussed in this whitepaper follows the optimized algorithm discussed in Reference [2].</li>
<li>Reference [3] provides additional background information.</li>
</ul>
<br />
<h1 class="sectionHeading">Intel® AVX Implementation of IDCT</h1>
As in the previous works, the algorithm in this whitepaper performs 8 1D IDCT transforms on the rows of an 8x8 block. Then 8 1D IDCT is performed on the columns of those results. Two different 1D IDCT transforms are used, and neither transform requires a transpose.<br /><br /> The AVX implementation operates on two rows of data simultaneously. It begins by loading four floats from each row into separate 128-bit registers. The reason for performing 128-bit loads will become clearer in a moment. It is important to remember that the 128-bit XMM registers overlay the lower 128-bits of the corresponding 256-bit YMM register. The upper 128-bits of the YMM register is loaded with the next set of four floats from those rows via the _mm256_insertf128_ps instruction.  The resultant YMM register contains the first set of four floats from one row in the lower 128-bits, and the second set of four floats from the second row in the upper 128-bits. This seems like a lot of work to load eight floats, so why take this approach? This is known as a strided load (because the next load will read from an area of memory that is more than a unit stride away from the current memory location).<br /><br /> The strided load is effective in certain applications.<br /><br /> 
<ul>
<li>Eight products need to be summed in the IDCT and it is more straightforward to sum the products if they are in the same 128-bit lane of two YMM registers. Two registers are built with the contents from one row in the bottom 128-bits of both registers, and the contents from the second row in the upper 128-bits of both register. This is depicted in Figure 4 through Figure 6.</li>
<li>Another benefit of the strided load approach is that data movement, or placement of the 128-bit memory operand into the upper 128 bits, can take place on a different execution port than the shuffle port. This reduces the utilization of the port that performs shuffles, and allocates the work more evenly across the execution ports.</li>
<li>One last benefit of placing floats 4-7 of two rows into one register allows for a reduction in shuffles that takes place near the end of row processing.</li>
</ul>
<p ><img src="http://software.intel.com/file/28089" /></p>
<br />
<div ><b>Figure 4 - Loading Four Floats from Row 0</b><br /></div>
<br /> The combination of the <span ><i>_mm_load_ps</i></span> and <span ><i>_mm256_castps128_ps256</i></span> instructions loads 128-bits into the least significant 128-bits of a ymm register.<br /><br /> Figure 5 depicts the strided load of Row 4 data, and the insertion into the most significant 128-bits of ymm0.<br /><br />
<p ><img src="http://software.intel.com/file/28090" /></p>
<br />
<div ><b>Figure 5 - Strided Load of Second Row</b><br /></div>
<br /> A second YMM register is populated in a similar manner; it contains the second set of four floats (floats 4-7) from both rows.  The results are shown in Figure 6.<br /><br />
<p ><img src="http://software.intel.com/file/28091" /></p>
<br />
<div ><b>Figure 6 - Second Set of Floats from Two Rows</b><br /></div>
<br /> The input data must be multiplied by different cosine terms.  A single float from each row is broadcasted via the <span ><i>_mm256_shuffle_ps</i></span> to prepare for the multiplication, as shown in Figure 7. In this case float 0 is duplicated.<br /><br />
<p ><img src="http://software.intel.com/file/28092" /></p>
<br />
<div ><b>Figure 7 - Preparing for Cosine Multiplication</b><br /></div>
<br /> Two <span ><i>_mm256_mul_ps</i></span> instructions perform the necessary cosine multiplications.  This completes one of the seven sums of product terms.<br /><br /> This sequence of instructions is repeated three times, operating on a different pair of input values and cosine terms each time. At this point there are four registers containing sums of products. One register contains sums of products for floats 0 and 2, another contains sums of products for floats 1 and 3, 4 and 6, and finally 5 and 7.<br /><br /> Three more sums are necessary to complete the sum of products. The sums of products of all even terms are added, and the sums of products of all odd terms are added. That is only two of the three sums.<br /><br />
<pre name="code" class="cpp">      ymm_even = _mm256_add_ps(r_ymm02, r_ymm46);<br />ymm_odd = _mm256_add_ps(r_ymm13, r_ymm57);<br /></pre>
The seventh and final sum produces the least significant four floats of the output. A subtraction is necessary to produce the most significant four floats; however, they are not in the correct order. <br /><br />
<pre name="code" class="cpp">     ymm_diff = _mm256_sub_ps(ymm_even, ymm_odd);<br />ymm_sum = _mm256_add_ps(ymm_even, ymm_odd); <br /></pre>
The order of the most significant floats of two rows can be properly adjusted using a single <span ><i>_mm256_shuffle_ps</i></span> instruction, as shown in Figure 8. The benefit of placing the most significant four floats from each input row is more apparent.<br /><br />
<p ><img src="http://software.intel.com/file/28093" /></p>
<br />
<div ><b>Figure 8 - Reversing the Order of Differences</b><br /></div>
<br /> The final step for the 1D IDCT for these two rows is to recombine the results for each row into the same 256-bit register using two <span ><i>_mm256_ permute2f128</i></span> instructions. <br /><br />
<p ><img src="http://software.intel.com/file/28094" /></p>
<br />
<div ><b>Figure 9 - Recombining Row 0</b><br /></div>
<br />
<p ><img src="http://software.intel.com/file/28095" /></p>
<br />
<div ><b>Figure 10 - Recombining Row 4</b><br /></div>
<br /> That completes the 1D IDCT for two sets of rows. The sequence is repeated for rows 1 and 7, 2 and 6, and finally 3 and 5 to complete the 1D IDCT for eight rows.<br /><br /> Now that the 1D IDCT of the eight rows is completed, the 1D IDCT of the eight columns can begin. The algorithm is similar, yet simpler because the sums of products are already in the proper position for the addition. The 1D IDCT of the columns does not require any shuffles.<br /><br /> It was possible to eliminate four add instructions in the floating point IDCT of the columns.<br /><br />
<pre name="code" class="cpp">      //row5*tangent + row5<br />r_ymm0 = _mm256_mul_ps(row5, tangent);<br />r_ymm0 = _mm256_add_ps(row5, r_ymm0);<br /></pre>
A single multiply of the row times the tangent term plus 1 produces identical results in the floating point implementation.<br /><br />
<pre name="code" class="cpp">Tangent_p1 = tan(x) + 1<br /></pre>
<pre name="code" class="cpp">//row5*(tangent+1)<br />r_ymm0 = _mm256_mul_ps(row5, tangent_p1);<br />//row3*(tangent+1)<br />r_ymm1 = _mm256_mul_ps(row3, tangent_p1);<br /></pre>
This optimization could not be applied to the short integer implementation due to the scaling of numbers in the short algorithm.<br /><br /><br />
<h1 class="sectionHeading">Using Intel® Architecture Code Analyzer</h1>
The Intel® Architecture Code Analyzer is an excellent tool that identifies the critical path in a basic block, execution port utilization, and instruction usage alternatives. There is much more Intel® Architecture Code Analyzer information available at the <a href="http://software.intel.com/en-us/avx/">Intel® AVX developer site</a>.<br /><br /> The basic block analysis in Figure 11 shows that the vshufps (invoked via the <span ><i>_mm256_shuffle_ps</i></span> intrinsic) executes on port 5. The analysis also shows that the vinsertf128 instruction (<span ><i>_mm256_insertf128_ps</i></span> intrinsic) offers the programmer choices for port execution. The "X" in the Intel® Architecture Code Analyzer report indicates that an instruction can be executed on the port, but in this particular algorithm it was executed elsewhere. In this case, the vinsertf128 was executed on port 0 and port 3 rather than port 3 and port 5 because the second source operand was loaded from memory. The insert takes place on port 0, thus reducing the pressure on port 5.<br /><br />
<p ><img src="http://software.intel.com/file/28096" /></p>
<br />
<div ><b>Figure 11 - Basic Block Analysis</b><br /></div>
<br /> The Intel® Architecture Code Analyzer is an effective optimization tool for software developers who want to fine tune and improve their application's performance.<br /><br /><br />
<h1 class="sectionHeading">Results</h1>
The 128-bit code was compiled for the Intel® microarchitecture codenamed Nehalem to generate Intel® SSE code, and executed on the Sandy Bridge microarchitecture-based silicon.  The corresponding 256-bit Intel® AVX-enabled code was compiled for the Sandy Bridge microarchitecture and executed on the Sandy Bridge microarchitecture-based silicon. Data was aligned on 16-Byte boundaries for the Intel® SSE code and 32-Byte boundaries for the Intel® AVX code. Both applications were compiled using the 64-bit version of the Intel® C++ Compiler Professional Edition, version 11.1.038.  The speedups listed here are for 256-bit code relative to the128-bit code.<br /><br /> <b>Table 1 - Performance Results</b><br /><br /> 
<table class="tableFormat1" border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody >
<tr >
<td  width="33%"><b>Algorithm</b></td>
<td  width="33%"><b>Speedup</b></td>
<td  width="33%"><b>Parameters</b></td>
</tr>
<tr >
<td >Intel® AVX floating point implementation<br /></td>
<td >0.94x</td>
<td >Compared to Intel® SSE short integer implementation</td>
</tr>
<tr >
<td >Intel® SSE short integer implementation compiled with /QxAVX<br /></td>
<td >1.07x</td>
<td >Compared to Intel® SSE short integer implementation compiled with /QxSSE4.1<br /></td>
</tr>
<tr >
<td >Intel® AVX floating point implementation<br /></td>
<td >1.78x</td>
<td >Compared to Intel® SSE floating point implementation</td>
</tr>
</tbody>
</table>
<br /> Table 2 summarizes the overall mean error results for each implementation.<br /><br /> <b>Table 2 - Overall Mean Error Results</b><br /><br /> 
<table class="tableFormat1" border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody >
<tr >
<td  width="25%">L, H (Reference 5)</td>
<td  width="25%">Intel® SSE Short</td>
<td  width="25%">Intel® SSE Single Precision Floating Point</td>
<td  width="25%">Intel® AVX Single Precision Floating Point</td>
</tr>
<tr >
<td >256, 255</td>
<td >3.44e-5</td>
<td >6.25e-6</td>
<td >6.25e-6</td>
</tr>
<tr >
<td >5, 5</td>
<td >2.58e-4</td>
<td >1.56e-6</td>
<td >1.56e-6</td>
</tr>
<tr >
<td >300, 300</td>
<td >4.69e-6</td>
<td >6.25e-6</td>
<td >6.25e-6</td>
</tr>
<tr >
<td >-255, 256</td>
<td >7.53e-4</td>
<td >3.13e-6</td>
<td >3.13e-6</td>
</tr>
<tr >
<td >-5, 5</td>
<td >0</td>
<td >0</td>
<td >0</td>
</tr>
<tr >
<td >-300, 300</td>
<td >0</td>
<td >0</td>
<td >0</td>
</tr>
<tr >
<td >Input = 0</td>
<td >0</td>
<td >0</td>
<td >0</td>
</tr>
</tbody>
</table>
<br /> Both the short integer and single precision floating point implementations meet the error requirements of IEEE 1180-1900 (Reference 5). The floating point implementations produce lower overall mean error in many of the tests; 5x, 165x, and 240x lower than the short integer implementation.<br /><br /><br />
<h1 class="sectionHeading">Conclusion</h1>
The results for the IDCT of 10,000 8x8 blocks shows that the Intel® AVX version outperformed the Intel® SSE single precision floating point implementation by 1.78x. The accuracy is excellent when compared to the reference IDCT.<br /><br /> The results also show the Intel<sup>®</sup> SSE short integer version compiled with the /QxAVX option outperformed the exact same code compiled with the /QxSSE4.1 option by 1.07x. The assembly language produced when compiling with /QxSSE4.1 has 22 register-to-register moves. The code produced with the /QxAVX switch did not have any register-to-register moves. The Intel<sup>®</sup> AVX non-destructive source instructions reduce the need for register copies in this application. There can be benefits to using Intel AVX for integer-based algorithms today.<br /><br /> Although the Intel® AVX implementation is slightly slower than the Intel® SSE short version, the Intel® AVX single precision floating point version is more accurate. The Intel® SSE short version required several adjustments to improve the accuracy and minimize rounding errors.  Those adjustments were not necessary in either floating point implementation, resulting in a cleaner and more accurate implementation.<br /><br /><br />
<h1 class="sectionHeading">Source Code for IDCT</h1>
The source code for the IDCT can be downloaded <a href="http://software.intel.com/file/29048">here</a><br /><br /> Partial source code listing is provided in the following section.<br /><br />
<pre name="code" class="cpp">void idctAVX(void)	{<br />__m128 r_xmm0, r_xmm2, r_xmm1;<br />__m256 r_ymm02, r_ymm46, r_ymm13, r_ymm57;<br />__m256 r_ymm0, r_ymm1, r_ymm2, r_ymm3, r_ymm4, r_ymm5, r_ymm6, r_ymm7;<br />__m256 row0, row1, row2, row3, row4, row5, row6, row7;<br />__m256 ymm_even, ymm_odd, ymm_sum, ymm_diff;<br />__m256 temp3, temp7;<br />__m256 tangent_1, tangent_2, tangent_3, cos_4;<br /><br />tangent_1 = AVX_tg_1_16;<br />tangent_2 = AVX_tg_2_16;<br />tangent_3 = AVX_tg_3p1_16;<br />cos_4 = AVX_cos_4p1_16;<br /><br />const float * pInput;<br />float * pOutput;<br />float * pFTab_i_04 = float_tab_i_04;<br />float * pFTab_i_26 = float_tab_i_26;<br />float * pFTab_i_17 = float_tab_i_17;<br />float * pFTab_i_35 = float_tab_i_35;<br />const int blockSize = 8*8;<br /><br />//Transform all the blocks N times<br />//The iteration count is a command line option<br />const int maxLoopCount = g_loopCount;<br />long startTime = getTimestamp();<br /><br />for(int loopCount = 0; loopCount &lt; maxLoopCount; loopCount++)	{<br />	<br />   //Operate on all the blocks<br />   for(int i = 0; i &lt; g_blockCount; i++)	{<br /><br />      //Get pointers for this input and output<br />      pInput = &amp;dctData[i*blockSize];				<br />      pOutput = &amp;kernelResults[i*blockSize];<br />      pFTab_i_04 = float_tab_i_04;<br />      pFTab_i_26 = float_tab_i_26;<br />      pFTab_i_17 = float_tab_i_17;<br />      pFTab_i_35 = float_tab_i_35;<br />//    IACA_START<br /><br />      //Rows 0 and 4<br />      //Process the first four floats of these two rows<br />      //Read input data from row 0, read eight floats via two 128-bit loads<br />      r_ymm0 = _mm256_castps128_ps256(_mm_load_ps(pInput));<br />      r_ymm1 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[4]));<br /> <br />      //Insert data from row 4 into the upper lane<br />      r_ymm0 = _mm256_insertf128_ps(r_ymm0, _mm_load_ps(&amp;pInput[4*8]), 1);<br />      r_ymm4 = _mm256_insertf128_ps(r_ymm1, _mm_load_ps(&amp;pInput[4*8+4]), 1);<br /><br />      //Broadcast float 0 <br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x00);<br />   <br />      //Multiply by the coefficients<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) pFTab_i_04));<br /><br />      //Broadcast float 2 and multiply by coefficients<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[8]));<br /><br />      //Add<br />      r_ymm02 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Broadcast float 1 and multiply by coefficients<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[16]));<br /><br />      //Broadcast float 3 and multiply by coefficients<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[24]));<br />   <br />      //Add<br />      r_ymm13 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process the second four floats of these two rows<br />      //Broadcast float 4 <br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x00);<br /><br />      //Multiply by the coefficients<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[32]));<br /><br />      //Broadcast float 6 and multiply by coefficients<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[40]));<br /><br />      //Add<br />      r_ymm46 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Broadcast float 5 and multiply by coefficients<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[48]));<br /><br />      //Broadcast float 7 and multiply by coefficients<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[56]));<br /><br />      //Add to create this partial sum, <br />      //then create final sums and differences.<br />      r_ymm57 = _mm256_add_ps(r_ymm2, r_ymm3);<br />      ymm_even = _mm256_add_ps(r_ymm02, r_ymm46);<br />      ymm_odd = _mm256_add_ps(r_ymm13, r_ymm57);<br />      ymm_diff = _mm256_sub_ps(ymm_even, ymm_odd);<br />      ymm_sum = _mm256_add_ps(ymm_even, ymm_odd);<br /><br />      //Reverse the order of the differences, then build outputs 0 and 4<br />      ymm_diff = _mm256_shuffle_ps(ymm_diff, ymm_diff, 0x1b);<br />      row0 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x20);<br />      row4 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x31);<br />      //End of Rows 0 and 4 processing<br /><br />      //Rows 1 and 7<br />      //Process the first four floats of these two rows<br />      //Read input data from row 1, read eight floats via two 128-bit loads<br />      r_ymm0 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[8]));<br />      r_ymm1 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[8+4]));<br /><br />      //Insert data from row 7 into the upper lane<br />      r_ymm0 = _mm256_insertf128_ps(r_ymm0, _mm_load_ps(&amp;pInput[7*8]), 1);<br />      r_ymm4 = _mm256_insertf128_ps(r_ymm1, _mm_load_ps(&amp;pInput[7*8+4]), 1);<br />      <br />      //Process floats 0 and 2 <br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x00);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) pFTab_i_17));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[8]));<br />      r_ymm02 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />       //Process floats 1 and 3<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[16]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[24]));<br />      r_ymm13 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process the second four floats of these two rows<br />      //Process floats 4 and 6<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x00);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[32]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[40]));<br />      r_ymm46 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process floats 5 and 7<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[48]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[56]));<br />      r_ymm57 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Create final sums and differences<br />      ymm_even = _mm256_add_ps(r_ymm02, r_ymm46);<br />      ymm_odd = _mm256_add_ps(r_ymm13, r_ymm57);<br />      ymm_diff = _mm256_sub_ps(ymm_even, ymm_odd);<br />      ymm_sum = _mm256_add_ps(ymm_even, ymm_odd);<br />    <br />      //Reverse the order of the differences, then build outputs 1 and 7<br />      ymm_diff = _mm256_shuffle_ps(ymm_diff, ymm_diff, 0x1b);<br />      row1 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x20);<br />      row7 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x31);<br />      //End of Rows 1 and 7 processing<br /><br />      //Rows 2 and 6<br />      //Process the first four floats of these two rows<br />      //Read input data from row 2, read eight floats via two 128-bit loads<br />      r_ymm0 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[2*8]));<br />      r_ymm1 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[2*8+4]));<br /><br />      //Insert data from row 6 into the upper lane<br />      r_ymm0 = _mm256_insertf128_ps(r_ymm0, _mm_load_ps(&amp;pInput[6*8]), 1);<br />      r_ymm4 = _mm256_insertf128_ps(r_ymm1, _mm_load_ps(&amp;pInput[6*8+4]), 1);   <br />   <br />      //Process floats 0 and 2 <br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x00);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) pFTab_i_26));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[8]));<br />      r_ymm02 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process floats 1 and 3<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[16]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[24]));<br />      r_ymm13 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process the second four floats of these two rows<br />      //Process floats 4 and 6<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x00);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[32]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[40]));<br />      r_ymm46 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process floats 5 and 7<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[48]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[56]));<br />      r_ymm57 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Create the final sum and difference<br />      ymm_even = _mm256_add_ps(r_ymm02, r_ymm46);<br />      ymm_odd = _mm256_add_ps(r_ymm13, r_ymm57);<br />      ymm_diff = _mm256_sub_ps(ymm_even, ymm_odd);<br />      ymm_sum = _mm256_add_ps(ymm_even, ymm_odd);<br />   <br />      //Reverse the order of the differences, then build outputs 2 and 6<br />      ymm_diff = _mm256_shuffle_ps(ymm_diff, ymm_diff, 0x1b);<br />      row2 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x20);<br />      row6 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x31);<br />      //End of Rows 2 and 6 processing<br /><br />      //Rows 3 and 5<br />      //Process the first four floats of these two rows<br />      //Read input data from row 3, read eight floats via two 128-bit loads<br />      r_ymm0 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[3*8]));<br />      r_ymm1 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[3*8+4]));<br /><br />      //Insert data from row 5 into the upper lane<br />      r_ymm0 = _mm256_insertf128_ps(r_ymm0, _mm_load_ps(&amp;pInput[5*8]), 1);<br />      r_ymm4 = _mm256_insertf128_ps(r_ymm1, _mm_load_ps(&amp;pInput[5*8+4]), 1);<br />    <br />      //Process floats 0 and 2 <br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x00);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) pFTab_i_35));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[8]));<br />      r_ymm02 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process floats 1 and 3<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[16]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[24]));<br />      r_ymm13 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process the second four floats of these two rows<br />      //Process floats 4 and 6<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x00);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[32]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[40]));<br />      r_ymm46 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process floats 5 and 7<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[48]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[56]));<br />      r_ymm57 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Create the final sum and difference<br />      ymm_even = _mm256_add_ps(r_ymm02, r_ymm46);<br />      ymm_odd = _mm256_add_ps(r_ymm13, r_ymm57);<br />      ymm_diff = _mm256_sub_ps(ymm_even, ymm_odd);<br />      ymm_sum = _mm256_add_ps(ymm_even, ymm_odd);<br /><br />      //Reverse the order of the differences, then build outputs 3 and 5<br />      ymm_diff = _mm256_shuffle_ps(ymm_diff, ymm_diff, 0x1b);<br />      row3 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x20);<br />      row5 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x31);<br />      //End of Rows 3 and 5 processing<br /><br />      //******************************<br />      //perform 1D IDCT on the columns<br />   <br />      //Multiply several rows by the appropriate tangent value<br />      //row5*(tangent3+1)<br />      r_ymm0 = _mm256_mul_ps(row5, tangent_3);<br />      //row3*(tangent3+1)<br />      r_ymm1 = _mm256_mul_ps(row3, tangent_3);	<br />      //row7*tangent1<br />      r_ymm4 = _mm256_mul_ps(row7, tangent_1);	<br />      //row1*tangent1<br />      r_ymm5 = _mm256_mul_ps(row1, tangent_1);<br /><br />      //Begin to create results	<br />      //[row5*tangent3 + row5] + row3<br />      r_ymm0 = _mm256_add_ps(r_ymm0, row3);<br />      //row5 - [row1*tangent3 + row1]		<br />      r_ymm2 = _mm256_sub_ps(row5, r_ymm1);<br />      //row6*tangent2<br />      r_ymm7 = _mm256_mul_ps(row6, tangent_2);	<br />      //row2*tangent2<br />      r_ymm3 = _mm256_mul_ps(row2, tangent_2);<br />      //row1*tangent1 - row7	<br />      r_ymm5 = _mm256_sub_ps(r_ymm5, row7);<br />      //row7*tangent1 + row1		<br />      r_ymm4 = _mm256_add_ps(r_ymm4, row1);		<br /><br />      //Save intermediate row 7 results, used as an input later<br />      //[row7*tangent1 + row1] + [row5*tangent3 + row3]<br />      temp7 = _mm256_add_ps(r_ymm4, r_ymm0);		<br /><br />      //Save intermediate row 3 results, used as an input later<br />      //[row1*tangent1 - row7] + [row5 - [row1*tangent3 + row1]]<br />      temp3 = _mm256_add_ps(r_ymm5, r_ymm2);		<br /><br />      //[row7*tangent1 + row1] - [row5*tangent3 + row3]			<br />      r_ymm4 = _mm256_sub_ps(r_ymm4, r_ymm0);<br />      //[row1*tangent1 - row7] - [row5 - [row1*tangent3 + row1]]<br />      r_ymm5 = _mm256_sub_ps(r_ymm5, r_ymm2);	<br />      //{[row7*tangent1 + row1] - [row5*tangent3 + row3]} - <br />      //{[row1*tangent1 - row7] - [row5 - [row1*tangent3 + row1]]}	<br />      r_ymm1 = _mm256_sub_ps(r_ymm4, r_ymm5);	<br />      //{[row7*tangent1 + row1] - [row5*tangent3 + row3]} + <br />      //{[row1*tangent1 - row7] - [row5 - [row1*tangent3 + row1]]}	<br />      r_ymm4 = _mm256_add_ps(r_ymm4, r_ymm5);<br />      //multiply by cos_4+1		<br />      r_ymm4 = _mm256_mul_ps(r_ymm4, cos_4);<br />      //row6*tangent2 + row2<br />      r_ymm7 = _mm256_add_ps(r_ymm7, row2);<br />      //row2*tangent2 - row6		<br />      r_ymm3 = _mm256_sub_ps(r_ymm3, row6);<br />      //multiply by cos_4+1		<br />      r_ymm0 = _mm256_mul_ps(r_ymm1, cos_4);<br />      //row0 + row4<br />      r_ymm5 = _mm256_add_ps(row0, row4);	<br />      //row0 - row4<br />      r_ymm6 = _mm256_sub_ps(row0, row4);	<br />      //[row0 + row4] - [row6*tangent2 + row2]		<br />      r_ymm2 = _mm256_sub_ps(r_ymm5, r_ymm7);<br />      //[row0 + row4] + [row6*tangent2 + row2]<br />      r_ymm5 = _mm256_add_ps(r_ymm5, r_ymm7);		<br />      //[row0 - row4] - [row2*tangent2 - row6]<br />      r_ymm1 = _mm256_sub_ps(r_ymm6, r_ymm3);<br />      //[row0 - row4] + [row2*tangent2 - row6]	<br />      r_ymm6 = _mm256_add_ps(r_ymm6, r_ymm3);		<br />      //[[row7*tangent1 + row1] + [row5*tangent3 + row3]] + <br />      //[[row0 + row4] + [row6*tangent2 + row2]]<br />      r_ymm7 = _mm256_add_ps(temp7, r_ymm5);		<br />			<br />      //Store row 0 results (store 1 of 8)<br />      _mm256_store_ps(pOutput, r_ymm7);<br />      //[[row0 - row4] + [row2*tangent2 - row6]] - <br />      //cos4*{[row7*tangent1 + row1] - [row5*tangent3 + row3]} +<br />      //{[row1*tangent1 - row7] - [row5 - [row1*tangent3 + row1]]}<br />      r_ymm3 = _mm256_sub_ps(r_ymm6, r_ymm4);<br />      //[[row0 - row4] + [row2*tangent2 - row6]] + <br />      //cos4*{[row7*tangent1 + row1] - [row5*tangent3 + row3]} + <br />      //{[row1*tangent1 - row7] - [row5 - [row1*tangent3 + row1]]}	<br />      r_ymm6 = _mm256_add_ps(r_ymm6, r_ymm4);		<br /><br />      //Store row 1 results (store 2 of 8)<br />      _mm256_store_ps(&amp;pOutput[1*8], r_ymm6); <br /><br />      r_ymm7 = _mm256_sub_ps(r_ymm1, r_ymm0);<br />      r_ymm1 = _mm256_add_ps(r_ymm1, r_ymm0);<br /><br />      r_ymm6 = _mm256_add_ps(r_ymm2, temp3);<br />      r_ymm2 = _mm256_sub_ps(r_ymm2, temp3);	 <br /><br />      r_ymm5 = _mm256_sub_ps(r_ymm5, temp7); <br /><br />      //Store final results<br />      _mm256_store_ps(&amp;pOutput[2*8], r_ymm1); <br />      _mm256_store_ps(&amp;pOutput[3*8], r_ymm6); <br />      _mm256_store_ps(&amp;pOutput[4*8], r_ymm2);<br />      _mm256_store_ps(&amp;pOutput[5*8], r_ymm7);<br />      _mm256_store_ps(&amp;pOutput[6*8], r_ymm3);<br />      _mm256_store_ps(&amp;pOutput[7*8], r_ymm5); <br /><br />//      IACA_END<br />      }	<br />   }<br />   long duration = getTimestamp() - startTime;<br />   cout &lt;&lt; "AVX Timestamp = " &lt;&lt; duration &lt;&lt; endl;<br />}<br /><br /><br /></pre>
<h1 class="sectionHeading">References</h1>
The following documents are referenced in this application note, and provide background or supporting information for understanding the topics presented in this document.<br /><br /> 1.	 Pennebaker and Mitchell, JPEG: Still Image Data Compression Standard, Van Nostrand Reinhold, New York, 1993, pp. 29-64.<br /> 2.	 A Fast Precise Implementation of 8x8 Discrete Cosine Transform Using the Streaming SIMD Extensions and MMXTM Instructions, Intel Application Note, AP-922, Copyright 1999<br /> 3.	 Using Streaming SIMD Extensions 2 (SSE2) to Implement and Inverse Discrete Cosine Transform, Intel Application Note, AP-945, Copyright 2000<br /> 4.	 Rao and Yip, Discrete Cosine Transform Algorithms, Advantages, Applications, Academic Press, Inc., Boston, 1990, Appendix A.2<br /> 5.	 IEEE Standard Specifications for the Implementations of 8x8 Inverse Discrete Cosine Transform, IEEE Std 1180-1990.<br /><br /><br />
<h1 class="sectionHeading">About the Author</h1>
Richard Hubbard is a Senior Software Engineer and member of the SSG Apple enabling team, working on optimizing Mac OS X* applications for power and performance.  Richard holds a Masters degree in Electrical Engineering from Stevens Institute of Technology and a Bachelors in Computer Engineering from New Jersey Institute of Technology. ]]></description>
      <link>http://software.intel.com/en-us/articles/using-intel-advanced-vector-extensions-to-implement-an-inverse-discrete-cosine-transform/</link>
      <pubDate>Mon, 07 Jun 2010 21:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/using-intel-advanced-vector-extensions-to-implement-an-inverse-discrete-cosine-transform/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/using-intel-advanced-vector-extensions-to-implement-an-inverse-discrete-cosine-transform/</guid>
      <category>Intel® AVX</category>
    </item>
    <item>
      <title>Wiener Filtering Using Intel® Advanced Vector Extensions</title>
      <description><![CDATA[ <p><b>1 Introduction<br /></b>Intel® Advanced Vector Extensions (Intel® AVX) is a 256 bit instruction set extension to Intel® Streaming SIMD Extensions (Intel® SSE) and is designed for applications that are floating point intensive. These instructions provide a means to accelerate applications that rely heavily on floating-point operations, such as 3D geometry, video processing, image processing, and spatial (3D) audio. This application note discusses Wiener filtering<i>, </i>and includes an example of code that has been optimized using Intel® AVX. The source reference for the code and application note for this article has been AP-807 Wiener Filtering Using Streaming SIMD Extensions [4]. The original article included optimizations using streaming SIMD extensions and this whitepaper discusses migration of the code to 256 bit instruction set extension (i.e. Intel® AVX).</p>
<p><b><br />2 The Wiener Filter Algorithm<br /></b>Wiener filtering (also known as Least Mean Square filtering) is a technique for removing unwanted noise from an image. The description of this algorithm is from <i>The Pocket Handbook of Image</i> <i>Processing Algorithms in C, </i>by Harley R Myler and Aruthur R. Weeks [1]. The algorithm has four (Fourier transformed) vector inputs, representing (one component of) the original image (Image), the degraded image (Guv), the noise image spectra (Noise) and the degradation function (Huv). Each input is a vector of row*col complex numbers. The complex numbers are represented as two contiguous floats for the real and imaginary parts of the number. An additional parameter, gamma, is included in the computation. When gamma is 1.0, the filter is known as non-parametric. The parameters to the filter can be adjusted until the filtered image is satisfactory.</p>
<p><b><br />2.1 Applications for Wiener Filters<br /></b>Wiener filters are commonly used in image processing applications to remove noise from reconstructed images. Wiener filtering is often used to restore a blurry image. However, the Wiener filter has proved important in adaptive filtering, has been used for wavelet transforms, and has found application in communications and other DSP-related disciplines. The reader also should be aware that Fourier transformation is a key element in any signal processing discipline. Please refer to the Intel application note, Split-Radix FFT (AP-808) for further information on implementing a Fourier transform.</p>
<p><b><br />2.2 Implementing the Wiener Filter<br /></b>As described in Section 2.1, the input to the function is four arrays of complex numbers. For each<br />Element of the image, the following operations are carried out using complex arithmetic. The complex variables D and Hs are intermediate variables used during the computation. The function<br />Complex_conj is used to take the complex conjugate of a complex number. When divides occur, a check (using an if statement) must be done to ensure that the denominator is non-zero. In the event that a denominator is zero, the result should be set to zero.<br />1. Complex Noise = gamma * (Noise * Complex_conj ( Noise ) )<br />2. Complex D = Image * Complex_conj ( Image )<br />3. Complex D = Noise / D<br />4. Complex Hs = Huv * Complex_conj ( Huv )<br />5. Complex Num = Complex_conj ( Huv ) * Guv<br />6. Complex Image = Num / (Hs + D)</p>
<p><b><br />3 Vectorizing the code with 128-bit SIMD<br /></b>The reference source code for the article (both the C version and 128-bit SIMD version) is available in Intel application note AP-807. The following relates to the original port of the scalar code to 128-bit SIMD. First, the code can be optimized simply by observing that many of the operations involve multiplying a number by its complex conjugate. Since the result has no imaginary component, many of the operations specified in Section 2.2 can be simplified. The resultant C code is given in Section 5. Before the code can be optimized for Intel® SSE, it must be converted to a form suitable for SIMD execution. Four iterations of the original C code are gathered together and processed in a single iteration of the new loop. Each pass through the new loop does the work of four of the original iterations. The list of operations in Section 2.2 shows that three divides must be done per iteration: one divide in Step 3, (the imaginary component requires no divide since the imaginary part is zero), and two divides in Step 6. One of the latter two divides can be removed by noting that they both have the same denominator. The if statement required to check for a zero denominator can be removed by using a masking technique. Further improvements can be obtained by replacing all divides with reciprocal approximations. These techniques are described below. After the code has been converted to a SIMD format, the checks for zero denominators (if statements) can be removed by creating a mask for the non-zero denominator elements, and ANDing that mask to the result of the division to zero out the elements where a division by zero occurred. This technique assumes that QNANs are being generated rather than SNANs, by masking SIMD floating-point exceptions in the MXCSR register, so that no floating point divide by zero exceptions will occur. For example, assume that you want to compute the quantity ( N / D ), where N and D are floats. A typical code sequence is given below.</p>
<p >If ( D != 0 )<br />Result = N / D;<br />Else<br />Result = 0.0;</p>
<p>The result can be computed as (and ( div ( N, D ), cmp_neq ( D, 0 ) ) ), a<br />computation that does not require an if statement. Using intrinsics, this would be expressed as</p>
<p >mm_and_ps ( mm_div_ps ( N, D ), mm_cmpneq_ps ( D, zero ) ).</p>
<p>This technique is specific to vectorized code, and is implemented in the attached Intrinsics<br />and assembly language versions of the code.</p>
<p>The Newton-Raphson method is a classic technique for approximating functions. The initial “guess” at the reciprocal is computed using the rcpps instruction. Subsequently, the “guess” is improved using the Newton-Raphson method. The result is not as accurate as what is provided by the divide instruction; however, it can be obtained significantly faster. (Programmers must determine if their application allows a reduced precision answer.) Full details and discussion of this technique are available in the Newton-Raphson application note [2]. The specific code sequence employed in this Wiener filter is given below. (The denominator should be examined to be certain a division by zero is not occurring.)</p>
<p >RC = _mm_rcpps( D );<br />RECIP = _mm_sub( _mm_add( RC, RC ), _mm_mul( RC, _mm_mul( RC, D ) ) );</p>
<p><b><br />4 Vectorizing the code with 256-bit SIMD<br /></b>The 128-bit vectorized code can be easily ported to 256-bit SIMD by simply migrating the changes. The 256-bit code will do 8 iterations in one loop (the 128-bit Intel® SSE will do 4 iterations in one loop). The load/store instructions as well rcp/mul/add/sub instructions will now operate on 8 data points. The branchless technique described above is retained in the Intel® AVX code as well. As can be seen from the code, using the corresponding Intel® AVX intrinsics (for example _mm256_mul_ps for 256-bit vs _mm_mul_ps for 128-bit) completes the port. Section 6.3 provides the complete 256-bit Intel® AVX code.</p>
<p><b><br />5 Grouping input/output arrays<br /></b>Additionally, the 256-bit SIMD code performance can be improved by grouping input and output arrays in sequential order. Having sequential memory accesses through cache/memory decreases potential cache way conflicts. This provides a simpler accessing pattern for the CPU hardware prefetcher which leads to more accurate data.</p>
<p><b><br />6 Conclusion</b><br />The performance results (interms of CPU clock cycles) of the 128-bit SIMD code vs 256-bit SIMD code over large number of iterations are as below.</p>
<table border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="top" width="104">
<p> </p>
</td>
<td valign="top" width="101">
<p>Intel® AVX (256-bit)</p>
</td>
<td valign="top" width="101">
<p>Intel® SSE (128-bit)</p>
</td>
<td valign="top" width="95">
<p>Intel® AVX vs. Intel® SSE</p>
</td>
</tr>
<tr>
<td valign="top" width="104">
<p>Wiener filter</p>
</td>
<td valign="top" width="101">
<p>45871</p>
</td>
<td valign="top" width="101">
<p>66933</p>
</td>
<td valign="top" width="95">
<p>1.46x</p>
</td>
</tr>
<tr>
<td valign="top" width="104">
<p>Wiener filter with grouped arrays</p>
</td>
<td valign="top" width="101">
<p>42464</p>
</td>
<td valign="top" width="101">
<p>64473</p>
</td>
<td valign="top" width="95">
<p>1.51x</p>
</td>
</tr>
</tbody>
</table>
<p><br />The overall improvement of Intel® AVX over Intel® SSE with Wiener filtering is 1.46x. Intel® AVX can provide a significant performance improvement for the Wiener filter algorithm, compared to coding with 128-bit SIMD. By grouping the input/output arrays, the overall improvement of Intel® AVX over Intel® SSE with Wiener filtering is 1.51x. The improvements cited in this document are the result of several techniques. The techniques include using Intel® AVX (vectorizing the code), and removing conditional branch instructions (if statements) by using masking operations provided by both Intel® SSE and Intel® AVX. If it is acceptable to reduce the numerical precision of the result, then a further gain can be realized by replacing divide operations with reciprocal approximations (employing a Newton-Raphson technique). Further, the article also highlights the ease with which an existing floating point code can be ported to Intel® AVX.</p>
<p><b><br />7 Coding Example</b></p>
<pre name="code" class="cpp">/*<br />* Wiener Filter (also known as the Least Mean Square filter)<br />*<br />* Reference: The Pocket Handbook of Image Processing Algorithms in C<br />* by Harley R Myler &amp; Arthur R. Weeks<br />* 1993 Prentice-Hall, ISBN 0-13-642240-3 p260-3.<br />*<br />* The data is several arrays of complex floats in row major order.<br />* The description for the algorithm from p260 states:<br />*<br />* The algorithm computes a parametric Wiener filter on the<br />* Fourier transform of a degraded image, Guv, with noise<br />* spectra N, degradation function Huv, and original image Img.<br />* The computation is in place, so that the filtered version of<br />* the input is returned in the original image variable. The<br />* original and noise images are either estimations form some<br />* predictive function or ad hoc approximations. If the noise<br />* image is zero, the process reduces to the inverse filter.<br />*<br />* The Weiner parameter gamma is passed to the algorithm.<br />* If this parameter is 1.0, the filter is non-parametric.<br />* Methods exist in the literature to derive the parameter value;<br />* however, it is sometimes determined from trial and error.<br />*<br />*NOTE!!!! The code on page 263 has an error. In cxml, the complex<br />* multiply routine, the imaginary part of the computation should be<br />* a*d + b*c, not a*d - b*c.<br />*<br />*NOTE! (another error) The *complex* array length is rows*cols, so the<br />* *float* array length should be 2*rows*cols. Also, note that the<br />* algorithm operates on one component of the pixel.<br />*/<br />void wiener_filter ( float *Img,<br />float *Huv,<br />float *No,<br />float *Guv,<br />float gamma,<br />int rows,<br />int cols)<br />{<br />int i, sz;<br />float numr, numi, dr, hsr;<br />sz = 2 * rows * cols;<br />for (i = 0; i &lt; sz; i += 2)<br />{<br />/* Compute (in place) the noise spectral density with Wiener gamma*/<br />No[i] = (float) ( gamma * ( No[i]*No[i] + No[i+1]*No[i+1] ) );<br />No[i+1] = (float) 0.0;<br />/* Compute image spectral density */<br />dr = (float) ( Img[i]*Img[i] + Img[i+1]*Img[i+1] );<br />/* Compute denominator spectral density term */<br />if (dr != 0.0)<br />dr = (float) (No[i] / dr) ;<br />/* Compute degradation power spectrum */<br />hsr = (float) ( Huv[i]*Huv[i] + Huv[i+1]*Huv[i+1] );<br />/* Compute numerator term */<br />numr = (float) ( Huv[i]*Guv[i] + Huv[i+1]*Guv[i+1] );<br />numi = (float) ( Huv[i]*Guv[i+1] - Huv[i+1]*Guv[i ] );<br />/* Final computation */<br />if ( (hsr + dr) != 0.0 )<br />{<br />Img[i] = (float) (numr / (hsr + dr));<br />Img[i+1] = (float) (numi / (hsr + dr));<br />}<br />else<br />{<br />Img[i] = (float) 0.0;<br />Img[i+1] = (float) 0.0;<br />}<br />}<br />} /* wiener_filter */<br /><br /></pre>
<p><b><br /><br />7.2 128-bit Intrinsics Code</b></p>
<pre name="code" class="cpp">/*<br />#include <iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><br />//#define MM_FUNCTIONALITY<br />#include <xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><br />#include <assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><br />void intrin_wiener_rcp_sse( float *Img,<br />float *Huv,<br />float *No,<br />float *Guv,<br />float gamma,<br />int rows,<br />int cols )<br />{<br />int i, sz;<br />__m128 first2, next2, nor4, noi4, nr4, inr4, ini4, dr4;<br />__m128 hr4, hi4, hsr4, gr4, gi4, numr4, numi4;<br />__m128 rc, denom;<br />__m128 zero = _mm_set_ps1 (0.0);<br />sz = 2 * rows * cols;<br />assert( (sz &gt; 3) &amp; !(sz &amp; 3) );<br />assert( !( ((int)Img) &amp; 15 ) ); /* Assume alignment */<br />assert( !( ((int)Huv) &amp; 15 ) );<br />assert( !( ((int)No) &amp; 15 ) );<br />assert( !( ((int)Guv) &amp; 15 ) );<br />for (i = 0; i &lt; sz; i += 8)<br />{<br />* Compute (in place) the noise spectral density with Wiener gamma<br />*<br />* complex Noise = gamma * (Noise * complex conj Noise)<br />*<br />* No[i] = (float) ( gamma * ( No[i]*No[i] + No[i+1]*No[i+1] ) );<br />* No[i+1] = (float) 0.0;<br />*/<br />first2 = _mm_load_ps ( &amp;No[i] );<br />next2 = _mm_load_ps ( &amp;No[i+4] );<br />nor4 = _mm_shuffle_ps( first2, next2, 0x88 );<br />noi4 = _mm_shuffle_ps( first2, next2, 0xdd );<br />nr4 = _mm_mul_ps ( _mm_set_ps1( gamma ) ,<br />_mm_add_ps ( _mm_mul_ps( nor4 , nor4 ),<br />_mm_mul_ps( noi4 , noi4 ) ) );<br />_mm_store_ps( &amp;No[i ], _mm_unpacklo_ps ( nr4, zero ) );<br />_mm_store_ps( &amp;No[i+4], _mm_unpackhi_ps ( nr4, zero ) );<br />/*<br />* Compute image spectral density<br />*<br />* Complex D = Image * complex conj Image<br />*<br />* dr = (float) ( Img[i]*Img[i] + Img[i+1]*Img[i+1] );<br />*/<br />first2 = _mm_load_ps ( &amp;Img[i] );<br />next2 = _mm_load_ps ( &amp;Img[i+4] );<br />inr4 = _mm_shuffle_ps( first2, next2, 0x88 );<br />ini4 = _mm_shuffle_ps( first2, next2, 0xdd );<br />dr4 = _mm_add_ps ( _mm_mul_ps( inr4 , inr4),<br />_mm_mul_ps( ini4 , ini4) );<br />/*<br />* Compute denominator spectral density term<br />*<br />* Complex D = noise / D<br />*<br />* if (dr != 0.0)<br />* dr = (float) (No[i] / dr) ;<br />*<br />* Do that reciprical division thing!<br />*/<br />rc = _mm_rcp_ps(dr4);<br />rc = _mm_sub_ps( _mm_add_ps( rc, rc),<br />_mm_mul_ps( rc, _mm_mul_ps( rc, dr4) ) );<br />dr4 = _mm_and_ps ( _mm_mul_ps ( nr4 , rc ),<br />_mm_cmpneq_ps( dr4, zero ) );<br />/*<br />* Compute degradation power spectrum<br />*<br />* Complex Hs = Huv * complex conj Huv<br />*<br />* hsr = (float) ( Huv[i]*Huv[i] + Huv[i+1]*Huv[i+1] );<br />*/<br />first2 = _mm_load_ps ( &amp;Huv[i] );<br />next2 = _mm_load_ps ( &amp;Huv[i+4] );<br />hr4 = _mm_shuffle_ps( first2, next2, 0x88 );<br />hi4 = _mm_shuffle_ps( first2, next2, 0xdd );<br />hsr4 = _mm_add_ps ( _mm_mul_ps (hr4 , hr4 ),<br />_mm_mul_ps (hi4 , hi4 ) );<br />/*<br />* Compute numerator term<br />*<br />* Complex Num = complex conj Huv * Guv<br />*<br />* numr = (float) ( Huv[i]*Guv[i] + Huv[i+1]*Guv[i+1] );<br />* numi = (float) ( Huv[i]*Guv[i+1] - Huv[i+1]*Guv[i ] );<br />*/<br />first2 = _mm_load_ps ( &amp;Guv[i] );<br />next2 = _mm_load_ps ( &amp;Guv[i+4] );<br />gr4 = _mm_shuffle_ps( first2, next2, 0x88 );<br />gi4 = _mm_shuffle_ps( first2, next2, 0xdd );<br />numr4 = _mm_add_ps ( _mm_mul_ps (hr4 , gr4),<br />_mm_mul_ps (hi4 , gi4) );<br />numi4 = _mm_sub_ps ( _mm_mul_ps (hr4 , gi4),<br />_mm_mul_ps (hi4 , gr4) );<br />/*<br />* Final computation<br />*<br />* Complex Image = Num / (Hs + D)<br />*<br />* if ( (hsr + dr) != 0.0 )<br />* {<br />* Img[i] = (float) (numr / (hsr + dr));<br />* Img[i+1] = (float) (numi / (hsr + dr));<br />* }<br />* else<br />* {<br />* Img[i] = (float) 0.0;<br />* Img[i+1] = (float) 0.0;<br />* }<br />*<br />* Do the reciprical division thing<br />*/<br />denom = _mm_add_ps( hsr4, dr4 );<br />rc = _mm_rcp_ps(denom);<br />rc = _mm_sub_ps( _mm_add_ps( rc, rc),<br />_mm_mul_ps( rc, _mm_mul_ps( rc, denom) ) );<br />inr4 = _mm_and_ps( _mm_mul_ps ( numr4 , rc ) ,<br />_mm_cmpneq_ps( denom, zero ) );<br />ini4 = _mm_and_ps( _mm_mul_ps ( numi4 , rc ) ,<br />_mm_cmpneq_ps( denom, zero ) );<br />_mm_store_ps( &amp;Img[i ], _mm_unpacklo_ps ( inr4, ini4 ) );<br />_mm_store_ps( &amp;Img[i+4], _mm_unpackhi_ps ( inr4, ini4 ) );<br />}<br />} /* intrin_wiener_rcp */<br /><br /><br />    <assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><assert.h></assert.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><xmmintrin.h></xmmintrin.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h><iostream.h></iostream.h></pre>
<p><b><br />7.3 256-bit Intrinsics Code</b></p>
<pre name="code" class="cpp">void intrin_wiener_rcp_avx( float *Img,<br />					   float *Huv,<br />					   float *No,<br />					   float *Guv,<br />					   float gamma,<br />					   int rows,<br />					   int cols )<br />{<br />	int i, sz;<br />	__m256 first2, next2, nor4, noi4, nr4, inr4, ini4, dr4;<br />	__m256 hr4, hi4, hsr4, gr4, gi4, numr4, numi4;<br />	__m256 rc, denom;<br />	__m256 zero = _mm256_setzero_ps();<br />	sz = 2 * rows * cols;<br />	assert( (sz &gt; 3) &amp; !(sz &amp; 3) );<br />	assert( !( ((int)Img) &amp; 15 ) ); /* Assume alignment */<br />	assert( !( ((int)Huv) &amp; 15 ) );<br />	assert( !( ((int)No) &amp; 15 ) );<br />	assert( !( ((int)Guv) &amp; 15 ) );<br />	for (i = 0; i &lt; sz; i += 16)<br />	{<br />		* Compute (in place) the noise spectral density with Wiener gamma<br />		*<br />		* complex Noise = gamma * (Noise * complex conj Noise)<br />		*<br />		* No[i] = (float) ( gamma * ( No[i]*No[i] + No[i+1]*No[i+1] ) );<br />		* No[i+1] = (float) 0.0;<br />		*/<br />		first2 = _mm256_load_ps ( &amp;No[i] );<br />		next2 = _mm256_load_ps ( &amp;No[i+4*2] );<br />		nor4 = _mm256_shuffle_ps( first2, next2, 0x88 );<br />		noi4 = _mm256_shuffle_ps( first2, next2, 0xdd );<br />		nr4 = _mm256_mul_ps ( _mm256_set1_ps( gamma ) ,<br />		_mm256_add_ps ( _mm256_mul_ps( nor4 , nor4 ),<br />		_mm256_mul_ps( noi4 , noi4 ) ) );<br />		_mm256_store_ps( &amp;No[i ], _mm256_unpacklo_ps ( nr4, zero ) );<br />		_mm256_store_ps( &amp;No[i+4*2], _mm256_unpackhi_ps ( nr4, zero ) );<br />		<br />		/*<br />		* Compute image spectral density<br />		*<br />		* Complex D = Image * complex conj Image<br />		*<br />		* dr = (float) ( Img[i]*Img[i] + Img[i+1]*Img[i+1] );<br />		*/<br />		first2 = _mm256_load_ps ( &amp;Img[i] );<br />		next2 = _mm256_load_ps ( &amp;Img[i+4*2] );<br />		inr4 = _mm256_shuffle_ps( first2, next2, 0x88 );<br />		ini4 = _mm256_shuffle_ps( first2, next2, 0xdd );<br />		dr4 = _mm256_add_ps ( _mm256_mul_ps( inr4 , inr4),<br />		_mm256_mul_ps( ini4 , ini4) );<br />		/*<br />		* Compute denominator spectral density term<br />		*<br />		* Complex D = noise / D<br />		*<br />		* if (dr != 0.0)<br />		* dr = (float) (No[i] / dr) ;<br />		*<br />		* Do that reciprical division thing!<br />		*/<br />		rc = _mm256_rcp_ps(dr4);<br />		rc = _mm256_sub_ps( _mm256_add_ps( rc, rc),<br />		_mm256_mul_ps( rc, _mm256_mul_ps( rc, dr4) ) );<br />		dr4 = _mm256_and_ps ( _mm256_mul_ps ( nr4 , rc ),<br />		_mm256_cmpneq_ps( dr4, zero ) );<br />		/*<br />		* Compute degradation power spectrum<br />		*<br />		* Complex Hs = Huv * complex conj Huv<br />		*<br />		* hsr = (float) ( Huv[i]*Huv[i] + Huv[i+1]*Huv[i+1] );<br />		*/<br />		first2 = _mm256_load_ps ( &amp;Huv[i] );<br />		next2 = _mm256_load_ps ( &amp;Huv[i+4*2] );<br />		hr4 = _mm256_shuffle_ps( first2, next2, 0x88 );<br />		hi4 = _mm256_shuffle_ps( first2, next2, 0xdd );<br />		hsr4 = _mm256_add_ps ( _mm256_mul_ps (hr4 , hr4 ),<br />		_mm256_mul_ps (hi4 , hi4 ) );<br />		/*<br />		* Compute numerator term<br />		*<br />		* Complex Num = complex conj Huv * Guv<br />		*<br />		* numr = (float) ( Huv[i]*Guv[i] + Huv[i+1]*Guv[i+1] );<br />		* numi = (float) ( Huv[i]*Guv[i+1] - Huv[i+1]*Guv[i ] );<br />		*/<br />		first2 = _mm256_load_ps ( &amp;Guv[i] );<br />		next2 = _mm256_load_ps ( &amp;Guv[i+4*2] );<br />		gr4 = _mm256_shuffle_ps( first2, next2, 0x88 );<br />		gi4 = _mm256_shuffle_ps( first2, next2, 0xdd );<br />		numr4 = _mm256_add_ps ( _mm256_mul_ps (hr4 , gr4),<br />		_mm256_mul_ps (hi4 , gi4) );<br />		numi4 = _mm256_sub_ps ( _mm256_mul_ps (hr4 , gi4),<br />		_mm256_mul_ps (hi4 , gr4) );<br />		/*<br />		* Final computation<br />		*<br />		* Complex Image = Num / (Hs + D)<br />		*<br />		* if ( (hsr + dr) != 0.0 )<br />		AP-807 Wiener Filtering Using Streaming SIMD Extensions<br />		01/28/99 15<br />		* {<br />		* Img[i] = (float) (numr / (hsr + dr));<br />		* Img[i+1] = (float) (numi / (hsr + dr));<br />		* }<br />		* else<br />		* {<br />		* Img[i] = (float) 0.0;<br />		* Img[i+1] = (float) 0.0;<br />		* }<br />		*<br />		* Do the reciprical division thing<br />		*/<br />		denom = _mm256_add_ps( hsr4, dr4 );<br />		rc = _mm256_rcp_ps(denom);<br />		rc = _mm256_sub_ps( _mm256_add_ps( rc, rc),<br />		_mm256_mul_ps( rc, _mm256_mul_ps( rc, denom) ) );<br />		inr4 = _mm256_and_ps( _mm256_mul_ps ( numr4 , rc ) ,<br />		_mm256_cmpneq_ps( denom, zero ) );<br />		ini4 = _mm256_and_ps( _mm256_mul_ps ( numi4 , rc ) ,<br />		_mm256_cmpneq_ps( denom, zero ) );<br />		_mm256_store_ps( &amp;Img[i ], _mm256_unpacklo_ps ( inr4, ini4 ) );<br />		_mm256_store_ps( &amp;Img[i+4*2], _mm256_unpackhi_ps ( inr4, ini4 ) );<br /><br />	}<br />} /* intrin_wiener_rcp */<br /><br /><br />    </pre>
<p><b><br />7.4 256-bit Intrinsics Code with grouped arrays</b></p>
<p>blockHNG Structure<b> </b></p>
<table border="1" cellspacing="0" cellpadding="0" width="92%">
<tbody>
<tr>
<td valign="top" width="8%">Huv[0]</td>
<td valign="top" width="4%">…</td>
<td valign="top" width="8%">Huv[15]</td>
<td valign="top" width="7%">No[0]</td>
<td valign="top" width="4%">…</td>
<td valign="top" width="8%">No[15]</td>
<td valign="top" width="8%">Guv[0]</td>
<td valign="top" width="4%">…</td>
<td valign="top" width="9%">Guv[15]</td>
<td valign="top" width="8%">Huv[16]</td>
<td valign="top" width="4%">…</td>
<td valign="top" width="8%">Huv[31]</td>
<td valign="top" width="8%">No[16]</td>
<td valign="top" width="8%">…</td>
</tr>
</tbody>
</table>
<pre name="code" class="cpp">void intrin_wiener_rcp_avx ( float *Img,<br />					float *_blockHNG,<br />					float gamma,<br />					int rows,<br />					int cols)<br />{<br />	int sz;<br />	__m256 first2, next2, nor4, noi4, nr4, inr4, ini4, dr4;<br />	__m256 hr4, hi4, hsr4, gr4, gi4, numr4, numi4;<br />	__m256 rc, denom;<br />	__m256 zero = _mm256_setzero_ps();<br />	sz = 2 * rows * cols;<br /><br />	assert( (sz &gt; 3) &amp; !(sz &amp; 3) );<br />	assert( !( ((int)Img) &amp; 15 ) ); // Assume alignment <br />	assert( !( ((int)_blockHNG) &amp; 15 ) ); // Assume alignment <br /><br />	float *Huv;<br />	float *No;<br />	float *Guv;<br />	<br />	int j = 0;	// img index<br />	for (int _blockHNG_tracker = 0; _blockHNG_tracker &lt; 2 * rows * cols * 3; _blockHNG_tracker += 48)<br />	{<br />		Huv = &amp;(_blockHNG[_blockHNG_tracker]);<br />		No = &amp;(_blockHNG[_blockHNG_tracker + 16]);<br />		Guv = &amp;(_blockHNG[_blockHNG_tracker + 32]);<br /><br />		/*<br />		* Compute (in place) the noise spectral density with Wiener gamma<br />		*<br />		* complex Noise = gamma * (Noise * complex conj Noise)<br />		*<br />		AP-807 Wiener Filtering Using Streaming SIMD Extensions<br />		01/28/99 13<br />		* No[i] = (float) ( gamma * ( No[i]*No[i] + No[i+1]*No[i+1] ) );<br />		* No[i+1] = (float) 0.0;<br />		*/<br />		first2 = _mm256_load_ps ( &amp;No[0] );<br />		next2 = _mm256_load_ps ( &amp;No[8] );<br />		nor4 = _mm256_shuffle_ps( first2, next2, 0x88 );<br />		noi4 = _mm256_shuffle_ps( first2, next2, 0xdd );<br />		nr4 = _mm256_mul_ps ( _mm256_set1_ps( gamma ) ,<br />		_mm256_add_ps ( _mm256_mul_ps( nor4 , nor4 ),<br />		_mm256_mul_ps( noi4 , noi4 ) ) );<br /><br />		_mm256_store_ps( &amp;No[0], _mm256_unpacklo_ps ( nr4, zero ) );<br />		_mm256_store_ps( &amp;No[8], _mm256_unpackhi_ps ( nr4, zero ) );<br /><br />		/*<br />		* Compute image spectral density<br />		*<br />		* Complex D = Image * complex conj Image<br />		*<br />		* dr = (float) ( Img[i]*Img[i] + Img[i+1]*Img[i+1] );<br />		*/<br />		first2 = _mm256_load_ps ( &amp;Img[j] );<br />		next2 = _mm256_load_ps ( &amp;Img[j+8] );<br />		inr4 = _mm256_shuffle_ps( first2, next2, 0x88 );<br />		ini4 = _mm256_shuffle_ps( first2, next2, 0xdd );<br />		dr4 = _mm256_add_ps ( _mm256_mul_ps( inr4 , inr4),<br />		_mm256_mul_ps( ini4 , ini4) );<br />		/*<br />		* Compute denominator spectral density term<br />		*<br />		* Complex D = noise / D<br />		*<br />		* if (dr != 0.0)<br />		* dr = (float) (No[i] / dr) ;<br />		*<br />		* Do that reciprical division thing!<br />		*/<br />		rc = _mm256_rcp_ps(dr4);<br />		rc = _mm256_sub_ps( _mm256_add_ps( rc, rc),<br />		_mm256_mul_ps( rc, _mm256_mul_ps( rc, dr4) ) );<br />		dr4 = _mm256_and_ps ( _mm256_mul_ps ( nr4 , rc ),<br />		_mm256_cmpneq_ps( dr4, zero ) );<br />		/*<br />		* Compute degradation power spectrum<br />		*<br />		* Complex Hs = Huv * complex conj Huv<br />		*<br />		* hsr = (float) ( Huv[i]*Huv[i] + Huv[i+1]*Huv[i+1] );<br />		*/<br />		first2 = _mm256_load_ps ( &amp;Huv[0] );<br />		next2 = _mm256_load_ps ( &amp;Huv[8] );<br />		hr4 = _mm256_shuffle_ps( first2, next2, 0x88 );<br />		hi4 = _mm256_shuffle_ps( first2, next2, 0xdd );<br />		hsr4 = _mm256_add_ps ( _mm256_mul_ps (hr4 , hr4 ),<br />		_mm256_mul_ps (hi4 , hi4 ) );<br />		/*<br />		* Compute numerator term<br />		*<br />		* Complex Num = complex conj Huv * Guv<br />		*<br />		* numr = (float) ( Huv[i]*Guv[i] + Huv[i+1]*Guv[i+1] );<br />		* numi = (float) ( Huv[i]*Guv[i+1] - Huv[i+1]*Guv[i ] );<br />		*/<br />		first2 = _mm256_load_ps ( &amp;Guv[0] );<br />		next2 = _mm256_load_ps ( &amp;Guv[8] );<br />		gr4 = _mm256_shuffle_ps( first2, next2, 0x88 );<br />		gi4 = _mm256_shuffle_ps( first2, next2, 0xdd );<br />		numr4 = _mm256_add_ps ( _mm256_mul_ps (hr4 , gr4),<br />		_mm256_mul_ps (hi4 , gi4) );<br />		numi4 = _mm256_sub_ps ( _mm256_mul_ps (hr4 , gi4),<br />		_mm256_mul_ps (hi4 , gr4) );<br />		/*<br />		* Final computation<br />		*<br />		* Complex Image = Num / (Hs + D)<br />		*<br />		* if ( (hsr + dr) != 0.0 )<br />		AP-807 Wiener Filtering Using Streaming SIMD Extensions<br />		01/28/99 15<br />		* {<br />		* Img[i] = (float) (numr / (hsr + dr));<br />		* Img[i+1] = (float) (numi / (hsr + dr));<br />		* }<br />		* else<br />		* {<br />		* Img[i] = (float) 0.0;<br />		* Img[i+1] = (float) 0.0;<br />		* }<br />		*<br />		* Do the reciprical division thing<br />		*/<br />		denom = _mm256_add_ps( hsr4, dr4 );<br />		rc = _mm256_rcp_ps(denom);<br />		rc = _mm256_sub_ps( _mm256_add_ps( rc, rc),<br />		_mm256_mul_ps( rc, _mm256_mul_ps( rc, denom) ) );<br />		inr4 = _mm256_and_ps( _mm256_mul_ps ( numr4 , rc ) ,<br />		_mm256_cmpneq_ps( denom, zero ) );<br />		ini4 = _mm256_and_ps( _mm256_mul_ps ( numi4 , rc ) ,<br />		_mm256_cmpneq_ps( denom, zero ) );<br /><br />		_mm256_store_ps( &amp;Img[j ], _mm256_unpacklo_ps ( inr4, ini4 ) );<br />		_mm256_store_ps( &amp;Img[j+8], _mm256_unpackhi_ps ( inr4, ini4 ) );<br />		j+=16;<br />	}<br />} /* Intrin_wiener_rcp_avx */<br /><br /></pre>
<p><b><br /><br />Acknowledgements<br /></b>The authors would like to thank Phil Kerly, Raghu Muthyalampalli and Justin Landon who assisted with the performance assessment of the code, providing performance recommendations and review of the whitepaper.</p>
<p><b>References<br /></b>The following documents are referenced in this application note, and provide background or supporting information for understanding the topics presented in this document.</p>
<p>1. <i>The Pocket Handbook of Image Processing Algorithms in C, </i>by Harley R Myler and Aruthur R.<br />Weeks. ISBN 0-13-642240-3.<br />2. <i>Increasing the Accuracy of the Results from the Reciprocal and Reciprocal Square Root</i><br /><i>Instructions using the Newton-Raphson Method</i>, Intel application note (AP-803, Order Number: 243637-001).<br />3. <i>Split-Radix FFT</i>, Intel application note (AP-808, Order Number: 243642-001).<br />4. <i>Wiener Filtering Using Streaming SIMD Extensions</i>, Intel application note (AP-807).</p>
<div  id="vc-meta">
<div id="vc-meta-author">
<div>Kit Chung</div>
<div>Karthikeyan Krishnan</div>
</div>
<div id="vc-meta-pubdate">01-25-2010</div>
<div id="vc-meta-modificationdate">01-25-2010</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product"></div>
<div id="vc-meta-category">
<div>Intel® SSE</div>
<div>Intel® AVX</div>
</div>
<div id="vc-meta-thumb"></div>
<div id="vc-meta-thumb-tout"></div>
<div id="vc-meta-thumb-hero"></div>
<div id="vc-meta-tocenable">no</div>
<div id="vc-meta-abstract">Wiener filtering is a technique for removing unwanted noise from an image. Wiener filtering is often used to restore a blurry image. Intel® AVX is a SIMD 256-bit instruction set that improves upon the existing Intel® SSE 128-bit instruction set. This article shows an Intel AVX Wiener filter that has 1.46x performance relative to the Intel SSE Wiener filter.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/wiener-filtering-using-intel-advanced-vector-extensions/</link>
      <pubDate>Tue, 26 Jan 2010 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/wiener-filtering-using-intel-advanced-vector-extensions/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/wiener-filtering-using-intel-advanced-vector-extensions/</guid>
      <category>Parallel Programming</category>
      <category>Visual Computing</category>
      <category>Intel® AVX</category>
      <category>Media</category>
    </item>
  </channel></rss>
