<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Mon, 21 May 2012 04:47:47 -0700 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/articles/avx/type/tutorials/feed/" rel="self" type="application/rss+xml" />
    <title>Intel Software Network articles Feed</title>
    <link>http://software.intel.com/en-us/articles/avx/type/tutorials/</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>Introduction to Intel® Advanced Vector Extensions</title>
      <description><![CDATA[ <b>By Chris Lomont</b><br /><br />
<h2 class="sectionHeading">Download Article</h2>
Download <a href="http://software.intel.com/file/37205">Introduction to Intel® Advanced Vector Extensions</a> [PDF 1.4MB]<br /><br />Intel® Advanced Vector Extensions (Intel® AVX) is a set of instructions for doing Single Instruction Multiple Data (SIMD) operations on Intel® architecture CPUs. These instructions extend previous SIMD offerings (MMX™ instructions and Intel® Streaming SIMD Extensions (Intel® SSE)) by adding the following new features:<br /> 
<ul>
<li>The 128-bit SIMD registers have been expanded to 256 bits. Intel® AVX is designed to support 512 or 1024 bits in the future.</li>
<li>Three-operand, nondestructive operations have been added. Previous two-operand instructions performed operations such as A = A + B, which overwrites a source operand; the new operands can perform operations like A = B + C, leaving the original source operands unchanged.</li>
<li>A few instructions take four-register operands, allowing smaller and faster code by removing unnecessary instructions.</li>
<li>Memory alignment requirements for operands are relaxed.</li>
<li>A new extension coding scheme (VEX) has been designed to make future additions easier as well as making coding of instructions smaller and faster to execute.</li>
</ul>
Closely related to these advances are the new Fused-Multiply-Add (FMA) instructions, which allow faster and more accurate specialized operations such as single instruction A = A * B + C. The FMA instructions should be available in the second-generation Intel® Core™ CPU. Other features include new instructions for dealing with Advanced Encryption Standard (AES) encryption and decryption, a packed carry-less multiplication operation (PCLMULQDQ) useful for certain encryption primitives, and some reserved slots for future instructions, such as a hardware random number generator.<br /><br /><br />
<h2 class="sectionHeading">Instruction Set Overview</h2>
The new instructions are encoded using what Intel calls a <i>VEX prefix</i>, which is a two- or three-byte prefix designed to clean up the complexity of current and future x86/x64 instruction encoding. The two new VEX prefixes are formed from two obsolete 32-bit instructions-Load Pointer Using DS (LDS-0xC4, 3-byte form) and Load Pointer Using ES (LES-0xC5, two-byte form)-which load the DS and ES segment registers in 32-bit mode. In 64-bit mode, opcodes LDS and LES generate an invalid-opcode exception, but under Intel® AVX, these opcodes are repurposed for encoding new instruction prefixes. As a result, the VEX instructions can only be used when running in 64-bit mode. The prefixes allow encoding more registers than previous x86 instructions and are required for accessing the new 256-bit SIMD registers or using the three- and four-operand syntax. As a user, you do not need to worry about this (unless you're writing assemblers or disassemblers).<br /><br /><br /><b>Note:</b> <i>The rest of this article assumes operation in 64-bit mode.</i><br /><br /><br />SIMD instructions allow processing of multiple pieces of data in a single step, speeding up throughput for many tasks, from video encoding and decoding to image processing to data analysis to physics simulations. Intel® AVX instructions work on Institute of Electrical and Electronics Engineers (IEEE)-754 floating-point values in 32-bit length (called <i>single precision</i>) and in 64-bit length (called <i>double precision</i>). IEEE-754 is the standard defining reproducible, robust floating-point operation and is the standard for most mainstream numerical computations.<br /><br />The older, related Intel® SSE instructions also support various signed and unsigned integer sizes, including signed and unsigned byte (B, 8-bit), word (W, 16-bit), doubleword (DW, 32-bit), quadword (QW, 64-bit), and doublequadword (DQ, 128-bit) lengths. Not all instructions are available in all size combinations; for details, see the links provided in "For More Information." See Figure 2 later in this article for a graphical representation of the data types.<br /><br />The hardware supporting Intel® AVX (and FMA) consists of the 16 256-bit YMM registers YMM0-YMM15 and a 32-bit control/status register called <i>MXCSR</i>. The YMM registers are aliased over the older 128-bit XMM registers used for Intel SSE, treating the XMM registers as the lower half of the corresponding YMM register, as shown in Figure 1.<br /><br />Bits 0-5 of MXCSR indicate SIMD floating-point exceptions with "sticky" bits-after being set, they remain set until cleared using <code>LDMXCSR</code> or <code>FXRSTOR</code>. Bits 7-12 mask individual exceptions when set, initially set by a power-up or reset. Bits 0-5 represent invalid operation, denormal, divide by zero, overflow, underflow, and precision, respectively. For details, see the links "For More Information."<br /><br />
<p ><img src="http://software.intel.com/file/37206" /></p>
<div ><b>Figure 1.</b><i> XMM registers overlay the YMM registers.</i><br /></div>
<br />Figure 2 illustrates the data types used in the Intel® SSE and Intel® AVX instructions. Roughly, for Intel AVX, any multiple of 32-bit or 64-bit floating-point type that adds to 128 or 256 bits is allowed as well as multiples of any integer type that adds to 128 bits.<br /><br />
<p ><img src="http://software.intel.com/file/37207" /></p>
<div ><b>Figure 2.</b> <i>Intel® AVX and Intel® SSE data types</i><br /></div>
<br />Instructions often come in scalar and vector versions, as illustrated in Figure 3. Vector versions operate by treating data in the registers in parallel "SIMD" mode; the scalar version only operates on one entry in each register. This distinction allows less data movement for some algorithms, providing better overall throughput.<br /><br />
<p ><img src="http://software.intel.com/file/37208" /></p>
<div ><b>Figure 3.</b> SIMD versus scalar operations<br /></div>
<br />Data is <i>memory aligned</i> when the data to be operated upon as an n-byte chunk is stored on an n-byte memory boundary. For example, when loading 256-bit data into YMM registers, if the data source is 256-bit aligned, the data is called <i>aligned</i>.<br /><br />For Intel® SSE operations, memory alignment was required unless explicitly stated. For example, under Intel SSE, there were specific instructions for memory-aligned and memory-unaligned operations, such as the MOVAPD (move-aligned packed double) and <code>MOVUPD</code> (move-unaligned packed double) instructions. Instructions not split in two like this required aligned accesses.<br /><br />Intel® AVX has relaxed some memory alignment requirements, so now Intel AVX by default allows unaligned access; however, this access may come at a performance slowdown, so the old rule of designing your data to be memory aligned is still good practice (16-byte aligned for 128-bit access and 32-byte aligned for 256-bit access). The main exceptions are the VEX-extended versions of the SSE instructions that explicitly required memory-aligned data: These instructions still require aligned data. Other specific instructions requiring aligned access are listed in Table 2.4 of the <i>Intel® Advanced Vector Extensions Programming Reference</i> (see "For More Information" for a link).<br /><br />Another performance concern besides unaligned data issues is that mixing legacy XMM-only instructions and newer Intel AVX instructions causes delays, so minimize transitions between VEX-encoded instructions and legacy Intel SSE code. Said another way, do not mix VEX-prefixed instructions and non-VEX-prefixed instructions for optimal throughput. If you must do so, minimize transitions between the two by grouping instructions of the same VEX/non-VEX class. Alternatively, there is no transition penalty if the upper YMM bits are set to zero via <code>VZEROUPPER</code> or <code>VZEROALL</code>, which compilers should automatically insert. This insertion requires an extra instruction, so profiling is recommended.<br /><br /><br />
<h2 class="sectionHeading">Intel® AVX Instruction Classes</h2>
As mentioned, Intel® AVX adds support for many new instructions and extends current Intel SSE instructions to the new 256-bit registers, with most old Intel SSE instructions having a V-prefixed Intel AVX version for accessing new register sizes and three-operand forms. Depending on how instructions are counted, there are up to a few hundred new Intel AVX instructions.<br /><br />For example, the old two-operand Intel SSE instruction <code>ADDPS xmm1, xmm2/m128</code> can now be expressed in three-operand syntax as <code>VADDPS xmm1, xmm2, xmm3/m128</code> or the 256-bit register using the form <code>VADDPS ymm1, ymm2, ymm3/m256</code>. A few instructions allow four operands, such as <code>VBLENDVPS ymm1, ymm2, ymm3/m256, ymm4,</code> which conditionally copies single-precision floating-point values from <code>ymm2</code> or <code>ymm3/m256</code> to <code>ymm1</code> based on masks in <code>ymm4</code>. This is an improvement on the previous form, where <code>xmm0</code> was implicitly needed, requiring compilers to free up <code>xmm0</code>. Now, with all registers explicit, there is more freedom for register allocation. Here, <code>m128</code> is a 128-bit memory location, <code>xmm1</code> is the 128-bit register, and so on.<br /><br />Some new instructions are VEX only (not Intel SSE extensions), including many ways to move data into and out of the YMM registers. Examples are the useful <code>VBROADCASTS[S/D]</code>, which loads a single value into all elements of an XMM or YMM register, and ways to shuffle data around in a register using <code>VPERMILP[S/D]</code>. (The bracket notation is explained in the Appendix A.)<br /><br />Intel® AVX adds arithmetic instructions for variants of add, subtract, multiply, divide, square root, compare, min, max, and round on single- and double-precision packed and scalar floating-point data. Many new conditional predicates are also useful for 128-bit Intel SSE, giving 32 comparison types. Intel® AVX also includes instructions promoted from previous SIMD covering logical, blend, convert, test, pack, unpack, shuffle, load, and store. The toolset adds new instructions, as well, including non-strided fetching (broadcast of single or multiple data into a 256-bit destination, masked-move primitives for conditional load and store), insert and extract multiple-SIMD data to and from 256-bit SIMD registers, permute primitives to manipulate data within a register, branch handling, and packed testing instructions.<br /><br /><br /><b><i>Future Additions</i></b><br />The Intel® AVX manual also lists some proposed future instructions, covered here for completeness. This is not a guarantee that these instructions will materialize as written.<br /><br />Two instructions (<code>VCVTPH2PS</code> and <code>VCVTPS2PH</code>) are reserved for supporting 16-bit floating-point conversions to and from single- and double-floating-point types. The 16-bit format is called <i>half-precision</i> and has a 10-bit mantissa (with an implied leading 1 for non-denormalized numbers, resulting in 11-bit precision), 5-bit exponent (biased by 15), and 1-bit sign.<br /><br />The proposed <code>RDRAND</code> instruction uses a cryptographically secure hardware digital random bit generator to generate random numbers for 16- 32- , and 64-bit registers. On success, the carry flag is set to 1 (<code>CF=1</code>). If not enough entropy is available, the carry flag is cleared (<code>CF=0</code>).<br /><br />Finally, there are four instructions (<code>RDFDBASE, RDGSBASE, WRFSBASE, and WRGSBASE</code>) to read and write FS and GS registers at all privilege levels in 64-bit mode.<br /><br />Another future addition is the FMA instructions, which perform operations similar to A = + A * B + C, where either of the plus signs (+) on the right can be changed to a minus sign (?) and the three operands on the right can be in any order. There are also forms for interleaved addition and subtraction. Packed FMA instructions can perform eight single-precision FMA operations or four double-precision FMA operations with 256-bit vectors.<br /><br />FMA operations such as A = A * B + C are better than performing one step at a time, because intermediate results are treated as infinite precision, with rounding done on store, and thus are more accurate for computation. This single rounding is what gives the "fused" prefix. They are also faster than performing the computation in steps.<br /><br />Each instruction comes in three forms for the ordering of the operands A, B, and C, with the ordering corresponding to a three-digit extension: form <i>132</i> does A = AC + B, form <i>213</i> does A = BA + C, and form <i>231</i> does A = BC + A. The ordering number is just the order of the operands on the right side of the expression.<br /><br /><br />
<h2 class="sectionHeading">Availability and Support</h2>
Detecting availability of the Intel® AVX features in hardware requires using the <code>CPUID</code> instruction to query support in the CPU and in the operating system, as detailed later. Second-generation Intel® Core™ processors (Intel® microarchitecture code name Sandy Bridge), released in Q1, 2011, are the first from Intel supporting Intel® AVX technology. These processors will not have the new FMA instructions. For development and testing without hardware support, the free Intel® Software Development Emulator (see "For More Information" for a link) includes support for all these features, including Intel AVX, FMA, PCLMULQDQ, and AES instructions.<br /><br />To use the Intel AVX extensions reliably in most settings, the operating system must support saving and loading the new registers (with <code>XSAVE/XRSTOR</code>) on thread context switches to prevent data corruption. To help avoid such errors, operating systems supporting Intel AVX-aware context switches explicitly set a CPU bit enabling the new instructions; otherwise, an undefined opcode (<code>#UD</code>) exception is generated when Intel AVX instructions are used.<br /><br />Microsoft Windows* 7 with Service Pack 1 (SP1) and Microsoft Windows* Server 2008 R2 with SP1-both 32- and 64-bit versions-and later versions of Windows* support Intel AVX save and restore in thread and process switches. Linux* kernels from 2.6.30 (June 2009) and later support Intel AVX, as well.<br /><br /><br /><b><i>Detecting Availability and Support</i></b><br />Detection of support for the four areas-Intel® AVX, FMA, AES, and PCLMULQDQ-are similar and require similar steps consisting of checking for hardware and operating system support for the desired feature (see Table 1). These steps are (counting bits starting at bit 0):<br /><br /><ol>
<li>Verify that the operating system supports <code>XGETBV</code> using <code>CPUID.1:ECX.OSXSAVE bit 27 = 1</code>.</li>
<li>At the same time, verify that <code>CPUID.1:ECX bit 28=1</code> (Intel AVX supported) and/or bit <code>25=1</code> (AES supported) and/or <code>bit 12=1</code> (FMA supported) and/or <code>bit 1=1</code> (PCLMULQDQ) are supported.</li>
<li>Issue <code>XGETBV</code>, and verify that the feature-enabled mask at bits 1 and 2 are 11b (XMM state and YMM state enabled by the operating system).</li>
</ol><b>Table 1.</b> <i>Feature-detection Masks</i><br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Feature</td>
<td >Bits to check</td>
<td >Constant</td>
</tr>
<tr >
<td >Intel® AVX</td>
<td >28, 27</td>
<td ><code>018000000H</code></td>
</tr>
<tr >
<td >VAES</td>
<td >28, 27, and 25</td>
<td ><code>01A000000H</code></td>
</tr>
<tr >
<td >VPCLMULQDQ</td>
<td >28, 27, and 1</td>
<td ><code>018000002H</code></td>
</tr>
<tr >
<td >FMA</td>
<td >28, 27, and 12</td>
<td ><code>018001000H</code></td>
</tr>
</tbody>
</table>
<br />Example code implementing this process is provided in Listing 1, where the <code>CONSTANT</code> is the value from Table 1. A Microsoft* Visual Studio* C++ intrinsic version is given later.<br /><br /><b>Listing 1.</b> <i>Feature Detection</i><br />
<pre name="code" class="cpp">INT Supports_Feature()
   { 
   ; result returned in eax
   mov eax, 1
   cpuid
   and ecx, CONSTANT
   cmp ecx, CONSTANT; check desired feature flags
   jne not_supported 
   ; processor supports features
   mov ecx, 0; specify 0 for XFEATURE_ENABLED_MASK register
   XGETBV; result in EDX:EAX
   and eax, 06H
   cmp eax, 06H; check OS has enabled both XMM and YMM state support
   jne not_supported
   mov eax, 1; mark as supported
   jmp done
   NOT_SUPPORTED:
   mov eax, 0 ; // mark as not supported
   done:
   }<br /> </pre>
<h2 class="sectionHeading">Usage</h2>
At the lowest programming level, most common x86 assemblers now support Intel® AVX, FMA, AES, and the VPCLMULQDQ instructions, including Microsoft MASM* (Microsoft Visual Studio* 2010 version), NASM*, FASM*, and YASM*. See their respective documentation for details.<br /><br />For language compilers, Intel® C++ Compiler version 11.1 and later and Intel® Fortran Compilers support Intel® AVX through compiler switches, and both compilers support automatic vectorization of floating-point loops. The Intel C++ Compiler supports Intel AVX intrinsics (use <code>#include &lt;immintrin.h&gt;</code> to access intrinsics) and inline assembly and even supports Intel AVX intrinsics emulation using <code>#include "avxintrin_emu.h"</code>.<br /><br />Microsoft Visual Studio* C++ 2010 with SP1 and later has support for Intel AVX (see "For More Information") when compiling 64-bit code (use the <code>/arch:AVX</code> compiler switch). It supports intrinsics using the <code>&lt;immintrin.h&gt;</code> header but not inline assembly. Intel AVX support is also in MASM*, the disassembly view of code, and the debugger views of registers (giving full YMM support).<br /><br />In the GNU Compiler Collection* (GCC*), version 4.4 supports Intel AVX intrinsics through the same header, <code>&lt;immintrin.h&gt;</code>. Other GNU toolchain support is found in Binutils 2.20.51.0.1 and later, gdb 6.8.50.20090915 and later, recent GNU Assembler (GAS) versions, and <code>objdump</code>. If your compiler does not support Intel AVX, you can emit the required bytes under many circumstances, but first-class support makes your life easier.<br /><br />Each of the three C++ compilers mentioned supports the same intrinsic operations to simplify using Intel® AVX from C or C++ code. <i>Intrinsics</i> are functions that the compiler replaces with the proper assembly instructions. Most Intel AVX intrinsic names follow the following format:<br /><br />
<pre name="code" class="cpp">_mm256_op_suffix(data_type param1, data_type param2, data_type param3)
</pre>
where <code>_mm256</code> is the prefix for working on the new 256-bit registers; <code>_op</code> is the operation, like <code>add</code> for addition or <code>sub</code> for subtraction; and <code>_suffix</code> denotes the type of data to operate on, with the first letters denoting packed (p), extended packed (ep), or scalar (s). The remaining letters are the types in Table 2.<br /><br /><br /><b>Table 2.</b> <i>Intel® AVX Suffix Markings</i><br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Marking</td>
<td >Meaning</td>
</tr>
<tr >
<td ><code>[s/d]</code></td>
<td >Single- or double-precision floating point</td>
</tr>
<tr >
<td ><code>[i/u]nnn</code></td>
<td >Signed or unsigned integer of bit size <i>nnn</i>, where <i>nnn</i> is 128, 64, 32, 16, or 8</td>
</tr>
<tr >
<td ><code>[ps/pd/sd]</code></td>
<td >Packed single, packed double, or scalar double</td>
</tr>
<tr >
<td ><code>epi32</code></td>
<td >Extended packed 32-bit signed integer</td>
</tr>
<tr >
<td ><code>si256</code></td>
<td >Scalar 256-bit integer</td>
</tr>
</tbody>
</table>
<br />Data types are in Table 3. The first two parameters are source registers, and the third parameter (when present) is an integer mask, selector, or offset value.<br /><br /><b>Table 3.</b> <i>Intel® AVX Intrinsics Data Types</i><br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Type</td>
<td >Meaning</td>
</tr>
<tr >
<td ><code>__m256</code></td>
<td >256-bit as eight single-precision floating-point values, representing a YMM register or memory location</td>
</tr>
<tr >
<td ><code>__m256d</code></td>
<td >256-bit as four double-precision floating-point values, representing a YMM register or memory location</td>
</tr>
<tr >
<td ><code>__m256i</code></td>
<td >256-bit as integers, (bytes, words, etc.)</td>
</tr>
<tr >
<td ><code>__m128</code></td>
<td >128-bit single precision floating-point (32 bits each)</td>
</tr>
<tr >
<td ><code>__m128d</code></td>
<td >128-bit double precision floating-point (64 bits each)</td>
</tr>
</tbody>
</table>
<br />Some intrinsics are in other headers, such as the AES and PCLMULQDQ being in <code>&lt;wmmintrin.h&gt;</code>. Consult your compiler documentation or the web to track down where various intrinsics live.<br /><br /><br /><i><b>Microsoft Visual Studio* 2010</b></i><br />For conciseness, the rest of this article uses Microsoft Visual Studio* 2010 with SP1; similar code should work on the Intel® compiler or GCC*. Microsoft Visual Studio* 2010 with SP1 can automatically generate Intel® AVX code if you click <b>Project Properties &gt; Configuration &gt; Code Generation</b>, select <b>Not Set</b> under <b>Enable Enhanced Instruction Set</b>, and then manually add <code>/arch:AVX</code> to the command line under the <b>Command Line</b> entry. As an example of using intrinsics, Listing 2 offers an intrinsic-based Intel AVX feature-detection routine.<br /><br /><br /><b>Listing 2.</b> <i>Intrinsic-based Feature Detection</i><br />
<pre name="code" class="cpp">// get AVX intrinsics
#include &lt;immintrin.h&gt;
// get CPUID capability
#include &lt;intrin.h&gt;

// written for clarity, not conciseness
#define OSXSAVEFlag (1UL&lt;&lt;27)
#define AVXFlag     ((1UL&lt;&lt;28)|OSXSAVEFlag)
#define VAESFlag    ((1UL&lt;&lt;25)|AVXFlag|OSXSAVEFlag)
#define FMAFlag     ((1UL&lt;&lt;12)|AVXFlag|OSXSAVEFlag)
#define CLMULFlag   ((1UL&lt;&lt; 1)|AVXFlag|OSXSAVEFlag)
 
bool DetectFeature(unsigned int feature)
	{
	int CPUInfo[4], InfoType=1, ECX = 1;
	__cpuidex(CPUInfo, 1, 1);       // read the desired CPUID format
	unsigned int ECX = CPUInfo[2];  // the output of CPUID in the ECX register. 
	if ((ECX &amp; feature) != feature) // Missing feature 
		return false; 
	__int64 val = _xgetbv(0);       // read XFEATURE_ENABLED_MASK register
	if ((val&amp;6) != 6)               // check OS has enabled both XMM and YMM support.
		return false; 
	return true;
	}<br /> </pre>
<h2 class="sectionHeading">Mandelbrot Example</h2>
To demonstrate using the new instructions, compute Mandelbrot set images using straight C/C++ code (checking to ensure that the compiler did not convert the code to Intel® AVX instructions!) and the new Intel AVX instructions as intrinsics, comparing their performance. A Mandelbrot set is a computationally intensive operation on complex numbers, defined in pseudocode as shown in Listing 3.<br /><br /><br /><b>Listing 3.</b> <i>Mandelbrot Pseudocode</i><br />
<pre name="code" class="cpp">z,p are complex numbers
for each point p on the complex plane
	z = 0
	for count = 0 to max_iterations
		if abs(z) &gt; 2.0
			break
		z = z*z+p
	set color at p based on count reached
</pre>
The usual image is over the portion of the complex plane in the rectangle (<code>-2,-1</code>) to (<code>1,1</code>). Coloring can be done in many ways (not covered here). Raise the maximum iteration count to zoom into portions and determine whether a value "escapes" over time.<br /><br />To really stress the CPU, zoom in and draw the box (<code>0.29768, 0.48364</code>) to (<code>0.29778, 0.48354</code>), computing the grid of counts at multiple sizes and using a max iteration of 4096. The resulting grid of counts, when colored appropriately, is shown in Figure 4.<br /><br />
<p ><img src="http://software.intel.com/file/37209" /></p>
<div ><b>Figure 4.</b> <i>Mandelbrot set (<b>0.29768, 0.48364</b>) to (<b>0.29778, 0.48354</b>), with max iterations of <b>4096</b></i><br /><br /></div>
<br />A basic C++ implementation to compute the iteration counts is provided in Listing 4. The absolute value of the complex number compared to 2 is replaced with the norm compared to 4.0, almost doubling the speed by removing a square root. For all versions, use single-precision floats to pack as many elements into the YMM registers as possible, which is faster but loses precision compared to doubles when zooming in further.<br /><br /><br /><b>Listing 4.</b> <i>Simple Mandelbrot C++ Code</i><br />
<pre name="code" class="cpp">// simple code to compute Mandelbrot in C++
#include &lt;complex&gt;
void MandelbrotCPU(float x1, float y1, float x2, float y2, 
                   int width, int height, int maxIters, unsigned short * image)
{
	float dx = (x2-x1)/width, dy = (y2-y1)/height;
	for (int j = 0; j &lt; height; ++j)
		for (int i = 0; i &lt; width; ++i)
		{
			complex&lt;float&gt; c (x1+dx*i, y1+dy*j), z(0,0);
			int count = -1;
			while ((++count &lt; maxIters) &amp;&amp; (norm(z) &lt; 4.0))
				z = z*z+c;
			*image++ = count;
		}
}
</pre>
Test multiple versions for performance: the basic one in Listing 4, a similar CPU version made by expanding the complex types with floats, an intrinsic-based SSE version, and an intrinsic-based Intel® AVX version shown in Listing 5. Each version is tested on image sizes of 128×128, 256×256, 512×512, 1024×1024, 2048×2048, and 4096×4096. The performance of each implementation could likely be improved while retaining its underlying instruction set constraints with more work, but they should be representative of what you can obtain.<br /><br />The Intel AVX version has been carefully crafted to fit as much as possible into the 16 YMM registers. To help track how you want them to be allocated, the variables are names <code>ymm0</code> through <code>ymm15</code>. Of course, the compiler allocates registers as it sees fit, but by being careful, you can try to make all computations stay in registers this way. (Actually, from looking at the disassembly, the compiler does not allocate them nicely, and recasting this in assembly code would be a good exercise to anyone learning Intel AVX).<br /><br /><br /><b>Listing 5.</b> <i>Intel® AVX-intrinsic Mandelbrot Implementation</i><br />
<pre name="code" class="cpp">float dx = (x2-x1)/width;
float dy = (y2-y1)/height;
// round up width to next multiple of 8
int roundedWidth = (width+7) &amp; ~7UL; 
 
float constants[] = {dx, dy, x1, y1, 1.0f, 4.0f};
__m256 ymm0 = _mm256_broadcast_ss(constants);   // all dx
__m256 ymm1 = _mm256_broadcast_ss(constants+1); // all dy
__m256 ymm2 = _mm256_broadcast_ss(constants+2); // all x1
__m256 ymm3 = _mm256_broadcast_ss(constants+3); // all y1
__m256 ymm4 = _mm256_broadcast_ss(constants+4); // all 1's (iter increments)
__m256 ymm5 = _mm256_broadcast_ss(constants+5); // all 4's (comparisons)
 
float incr[8]={0.0f,1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f}; // used to reset the i position when j increases
__m256 ymm6 = _mm256_xor_ps(ymm0,ymm0); // zero out j counter (ymm0 is just a dummy)
 
for (int j = 0; j &lt; height; j+=1)
{
	__m256 ymm7  = _mm256_load_ps(incr);  // i counter set to 0,1,2,..,7
	for (int i = 0; i &lt; roundedWidth; i+=8)
	{
		__m256 ymm8 = _mm256_mul_ps(ymm7, ymm0);  // x0 = (i+k)*dx 
		ymm8 = _mm256_add_ps(ymm8, ymm2);         // x0 = x1+(i+k)*dx
		__m256 ymm9 = _mm256_mul_ps(ymm6, ymm1);  // y0 = j*dy
		ymm9 = _mm256_add_ps(ymm9, ymm3);         // y0 = y1+j*dy
		__m256 ymm10 = _mm256_xor_ps(ymm0,ymm0);  // zero out iteration counter
		__m256 ymm11 = ymm10, ymm12 = ymm10;        // set initial xi=0, yi=0
 
		unsigned int test = 0;
		int iter = 0;
		do
		{
			__m256 ymm13 = _mm256_mul_ps(ymm11,ymm11); // xi*xi
			__m256 ymm14 = _mm256_mul_ps(ymm12,ymm12); // yi*yi
			__m256 ymm15 = _mm256_add_ps(ymm13,ymm14); // xi*xi+yi*yi
 			
			// xi*xi+yi*yi &lt; 4 in each slot
			ymm15 = _mm256_cmp_ps(ymm15,ymm5, _CMP_LT_OQ);        
			// now ymm15 has all 1s in the non overflowed locations
			test = _mm256_movemask_ps(ymm15)&amp;255;      // lower 8 bits are comparisons
			ymm15 = _mm256_and_ps(ymm15,ymm4);
			// get 1.0f or 0.0f in each field as counters
			// counters for each pixel iteration
			ymm10 = _mm256_add_ps(ymm10,ymm15);        
 
			ymm15 = _mm256_mul_ps(ymm11,ymm12);        // xi*yi 
			ymm11 = _mm256_sub_ps(ymm13,ymm14);        // xi*xi-yi*yi
			ymm11 = _mm256_add_ps(ymm11,ymm8);         // xi &lt;- xi*xi-yi*yi+x0 done!
			ymm12 = _mm256_add_ps(ymm15,ymm15);        // 2*xi*yi
			ymm12 = _mm256_add_ps(ymm12,ymm9);         // yi &lt;- 2*xi*yi+y0	
 
			++iter;
		} while ((test != 0) &amp;&amp; (iter &lt; maxIters));
 
		// convert iterations to output values
		__m256i ymm10i = _mm256_cvtps_epi32(ymm10);
 
		// write only where needed
		int top = (i+7) &lt; width? 8: width&amp;7;
		for (int k = 0; k &lt; top; ++k)
			image[i+k+j*width] = ymm10i.m256i_i16[2*k];
 
		// next i position - increment each slot by 8
		ymm7 = _mm256_add_ps(ymm7, ymm5);
		ymm7 = _mm256_add_ps(ymm7, ymm5);
	}
	ymm6 = _mm256_add_ps(ymm6,ymm4); // increment j counter
}
</pre>
The full code for all versions and a Microsoft Visual Studio* 2010 with SP1 project, including a testing harness, is available at from the links in the "For More Information" section.<br /><br />The results are shown in Figures 5 and 6. To prevent tying numbers too much to a specific CPU speed, Figure 5 shows performance of each version relative the CPU version, which represents a straightforward non-SIMD C/C++ implementation of the algorithm. As expected, the Intel® SSE version performs almost 4 times as well, because it is doing 4 pixels per pass, and the Intel® AVX version performs almost 8 times as well as the CPU version. Because there is overhead from loops, memory access, less-than-perfect instruction ordering, and other factors, 4- and 8-fold improvements should be about the best possible, so this is pretty good for a first try.<br /><br /><img src="http://software.intel.com/file/37210" /><br /><b>Figure 5.</b> <i>Relative performance across sizes</i><br /><br />The second graph in Figure 6 shows that the pixels computed per millisecond are fairly constant over each size; again, the algorithms show almost quadrupling of performance from the CPU to Intel® SSE version and another doubling from the Intel SSE to Intel® AVX version.<br /><br /><img src="http://software.intel.com/file/37211" /><br /><b>Figure 6.</b> <i>Absolute performance across sizes</i><br /><br />
<h2 class="sectionHeading">Conclusion</h2>
This article provided a mid-level overview of the new Intel® Advanced Vector Extensions (Intel® AVX). These extensions are similar to previous Intel® SSE instructions but offer a much larger register space and add some new instructions. The Mandelbrot example shows performance gains over previous technology in the amount expected. For full details, be sure to check out the Intel Advanced Vector Extensions Programming Reference (see "For More Information" for a link).<br /><br />Happy hacking!<br /><br /><br />
<h2 class="sectionHeading">For More Information</h2>
Intel® Advanced Vector Extensions Programming Reference at <a href="http://software.intel.com/en-us/avx/">http://software.intel.com/en-us/avx/</a><br /><br />Federal Information Processing Standards Publication 197, "Announcing the Advanced Encryption Standard," at <a href="http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf" target="_blank">http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf</a><br /><br />The IEEE 754-2008 floating-point format standard at <a href="http://en.wikipedia.org/wiki/IEEE_754-2008" target="_blank">http://en.wikipedia.org/wiki/IEEE_754-2008</a><br /><br />Floating-Point Support for 64-Bit Drivers at <a href="http://msdn.microsoft.com/en-us/library/ff545910.aspx" target="_blank">http://msdn.microsoft.com/en-us/library/ff545910.aspx</a><br /><br />Wikipedia's entry on the Mandelbrot set at <a href="http://en.wikipedia.org/wiki/Mandelbrot_set" target="_blank">http://en.wikipedia.org/wiki/Mandelbrot_set</a><br /><br />Intel® Software Development Emulator at <a href="http://software.intel.com/en-us/articles/intel-software-development-emulator">http://software.intel.com/en-us/articles/intel-software-development-emulator</a><br /><br />The complete Mandelbrot Intel® AVX implementation for download at <a href="http://www.lomont.org" target="_blank">http://www.lomont.org</a><br /><br /><br />
<h2 class="sectionHeading">About the Author</h2>
Chris Lomont works as a research engineer at Cybernet Systems, working on projects as diverse as quantum computing algorithms, image processing for NASA, developing security hardware for United States Homeland Security, and computer forensics. Before that he obtained a PhD. in math from Purdue, three Bachelors degrees in physics, math, and computer science, worked as a game programmer, did brief stints in financial modeling, robotics work, and various consulting roles. The rest of his time is spent hiking with his wife, watching movies, giving talks, recreational programming, doing math research, learning more physics, playing music, and performing various experiments. Visit his website www.lomont.org or his electronic gadget site <a href="http://www.hypnocube.com" target="_blank">www.hypnocube.com</a>.<br /><br /><br />
<h2 class="sectionHeading">Appendix A: Instruction Set Reference</h2>
Many instructions come in packed or scalar form, meaning that they work on multiple parallel elements or on a single element in the register-a distinction marked as <i>[P/S]</i>. Entry lengths come in double or single precision for floating-point (<i>doubles</i> and <i>singles</i>, for brevity); marked <i>[D/S]</i>; and the integer forms byte, word, doubleword, and quadword, marked <i>[B/W/D/Q]</i>. Integer forms also sometimes come in signed or unsigned forms, marked <i>[S/U]</i>. Some instructions work on high or low portions of registers, marked as <i>[H/L]</i>; other optional components are in the tables. Instructions coming in Intel® SSE form and Intel® AVX form are prefixed with a (<i>V</i>) for the Intel® AVX form, allowing three operands and 256-bit register support. Entries in square brackets (<i>[]</i>) are required; entries in parentheses (<i>()</i>) are optional.<br /><br />Examples:<br /> 
<ul>
<li><code>(V)ADD[P/S][D/S]</code> is the addition of packed or scalar, double or single, with eight possible forms-<code>VADDPD, VADDPS, VADDSD, VADDSS</code>, and versions without the leading <code>V</code>.</li>
<li><code>(V)[MIN/MAX][P/S][D/S]</code> represents 16 different instructions for a <code>min</code> or <code>max</code> of packed or scalar of double or single precision.</li>
</ul>
The next table represents the multiple comparison types. VEX-prefixed instructions have 32 comparison types; non-VEX-prefixed comparisons only allow those eight types in parentheses. Each comparison type comes in multiple flavors, where <code>O</code> = ordered, <code>U</code> = unordered, <code>S</code> = signaling, and <code>Q</code> = non-signaling. Ordered/unordered tells whether the comparison is false or true if one operand is NaN (<i>Not-a-Number</i> in floating point, which happens when something failed during the computation, such as divide by 0 or the square root of a negative number). Signaling/non-signaling states whether an exception is fired when at least one operand is QNaN (<i>Quiet Not-a-Number</i>-useful for error trapping).<br /><br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Type</td>
<td >Flavors</td>
<td >Meaning</td>
</tr>
<tr >
<td ><code>EQ</code></td>
<td ><code>(OQ), UQ, OS, US</code></td>
<td >Equal</td>
</tr>
<tr >
<td ><code>LT </code></td>
<td ><code>(OS), OQ</code></td>
<td >Less than</td>
</tr>
<tr >
<td ><code>LE </code></td>
<td ><code>(OS), OQ</code></td>
<td >Less than or equal to</td>
</tr>
<tr >
<td ><code>UNORD </code></td>
<td ><code>(Q), S</code></td>
<td >Tests for unordered (NaN)</td>
</tr>
<tr >
<td ><code>NEQ </code></td>
<td ><code>(UQ), US, OQ, OS</code></td>
<td >Not equal</td>
</tr>
<tr >
<td ><code>NLT </code></td>
<td ><code>(US), UQ</code></td>
<td >Not less than</td>
</tr>
<tr >
<td ><code>NLE </code></td>
<td ><code>(US), UQ</code></td>
<td >Not less than or equal to</td>
</tr>
<tr >
<td ><code>ORD </code></td>
<td ><code>(Q), S</code></td>
<td >Tests for ordered (not NaN)</td>
</tr>
<tr >
<td ><code>NGE </code></td>
<td ><code>US, UQ</code></td>
<td >Not greater than or equal to</td>
</tr>
<tr >
<td ><code>NGT </code></td>
<td ><code>US, UQ</code></td>
<td >Not greater than</td>
</tr>
<tr >
<td ><code>FALSE </code></td>
<td ><code>OQ, OS</code></td>
<td >Comparison is always false</td>
</tr>
<tr >
<td ><code>GE </code></td>
<td ><code>OS, OQ</code></td>
<td >Greater than or equal to</td>
</tr>
<tr >
<td ><code>GT </code></td>
<td ><code>OS, OQ</code></td>
<td >Greater than</td>
</tr>
<tr >
<td ><code>TRUE</code></td>
<td ><code>UQ, US</code></td>
<td >Comparison is always true</td>
</tr>
</tbody>
</table>
<br />Finally, here are all the Intel® AVX instructions:<br /><br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Arithmetic</td>
<td >Description</td>
</tr>
<tr >
<td ><code>(V)[ADD/SUB/MUL/DIV][P/S][D/S]</code></td>
<td >Add/subtract/multiply/divide packed/scalar double/single</td>
</tr>
<tr >
<td ><code>(V)ADDSUBP[D/S]</code></td>
<td >Packed double/single add and subtract alternating indices</td>
</tr>
<tr >
<td ><code>(V)DPP[D/S]</code></td>
<td >Dot product, based on immediate mask</td>
</tr>
<tr >
<td ><code>(V)HADDP[D/S]</code></td>
<td >Horizontally add</td>
</tr>
<tr >
<td ><code>(V)[MIN/MAX][P/S][D/S]</code></td>
<td >Min/max packed/scalar double/single</td>
</tr>
<tr >
<td ><code>(V)MOVMSKP[D/S]</code></td>
<td >Extract double/single sign mask</td>
</tr>
<tr >
<td ><code>(V)PMOVMSKB</code></td>
<td >Make a mask consisting of the most significant bits</td>
</tr>
<tr >
<td ><code>(V)MPSADBW</code></td>
<td >Multiple sum of absolute differences</td>
</tr>
<tr >
<td ><code>(V)PABS[B/W/D]</code></td>
<td >Packed absolute value on bytes/words/doublewords</td>
</tr>
<tr >
<td ><code>(V)P[ADD/SUB][B/W/D/Q]</code></td>
<td >Add/subtract packed bytes/words/doublewords/quadwords</td>
</tr>
<tr >
<td ><code>(V)PADD[S/U]S[B/W]</code></td>
<td >Add packed signed/unsigned with saturation bytes/words</td>
</tr>
<tr >
<td ><code>(V)PAVG[B/W]</code></td>
<td >Average packed bytes/words</td>
</tr>
<tr >
<td ><code>(V)PCLMULQDQ</code></td>
<td >Carry-less multiplication quadword</td>
</tr>
<tr >
<td ><code>(V)PH[ADD/SUB][W/D]</code></td>
<td >Packed horizontal add/subtract word/doubleword</td>
</tr>
<tr >
<td ><code>(V)PH[ADD/SUB]SW</code></td>
<td >Packed horizontal add/subtract with saturation</td>
</tr>
<tr >
<td ><code>(V)PHMINPOSUW</code></td>
<td >Min horizontal unsigned word and position</td>
</tr>
<tr >
<td ><code>(V)PMADDWD</code></td>
<td >Multiply and add packed integers</td>
</tr>
<tr >
<td ><code>(V)PMADDUBSW</code></td>
<td >Multiply unsigned bytes and signed bytes into signed words</td>
</tr>
<tr >
<td ><code>(V)P[MIN/MAX][S/U][B/W/D]</code></td>
<td >Min/max of packed signed/unsigned integers</td>
</tr>
<tr >
<td ><code>(V)PMUL[H/L][S/U]W</code></td>
<td >Multiply packed signed/unsigned integers and store high/low result</td>
</tr>
<tr >
<td ><code>(V)PMULHRSW</code></td>
<td >Multiply packed unsigned with round and shift</td>
</tr>
<tr >
<td ><code>(V)PMULHW</code></td>
<td >Multiply packed integers and store high result</td>
</tr>
<tr >
<td ><code>(V)PMULL[W/D]</code></td>
<td >Multiply packed integers and store low result</td>
</tr>
<tr >
<td ><code>(V)PMUL(U)DQ</code></td>
<td >Multiply packed (un)signed doubleword integers and store quadwords</td>
</tr>
<tr >
<td ><code>(V)PSADBW</code></td>
<td >Compute sum of absolute differences of unsigned bytes</td>
</tr>
<tr >
<td ><code>(V)PSIGN[B/W/D]</code></td>
<td >Change the sign on each element in one operand based on the sign in the other operand</td>
</tr>
<tr >
<td ><code>(V)PS[L/R]LDQ</code></td>
<td >Byte shift left/right amount in operand</td>
</tr>
<tr >
<td ><code>(V)SL[L/AR/LR][W/D/Q]</code></td>
<td >Bit shift left/arithmetic right/logical right</td>
</tr>
<tr >
<td ><code>(V)PSUB(U)S[B/W]</code></td>
<td >Packed (un)signed subtract with (un)signed saturation</td>
</tr>
<tr >
<td ><code>(V)RCP[P/S]S</code></td>
<td >Compute approximate reciprocal of packed/scalar single precision</td>
</tr>
<tr >
<td ><code>(V)RSQRT[P/S]S</code></td>
<td >Compute approximate reciprocal of square root of packed/scalar single precision</td>
</tr>
<tr >
<td ><code>(V)ROUND[P/S][D/S]</code></td>
<td >Round packed/scalar double/single</td>
</tr>
<tr >
<td ><code>(V)SQRT[P/S][D/S]</code></td>
<td >Square root of packed/scalar double/single</td>
</tr>
<tr >
<td ><code>VZERO[ALL/UPPER]</code></td>
<td >Zero all/upper half of YMM registers</td>
</tr>
</tbody>
</table>
<br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Comparison</td>
<td >Description</td>
</tr>
<tr >
<td ><code>(V)CMP[P/S][D/S]</code></td>
<td >Compare packed/scalar double/single</td>
</tr>
<tr >
<td ><code>(V)COMIS[S/D]</code></td>
<td >Compare scalar double/single, set EFLAGS</td>
</tr>
<tr >
<td ><code>(V)PCMP[EQ/GT][B/W/D/Q]</code></td>
<td >Compare packed integers for equality/greater than</td>
</tr>
<tr >
<td ><code>(V)PCMP[E/I]STR[I/M]</code></td>
<td >Compare explicit/implicit length strings, return index/mask</td>
</tr>
</tbody>
</table>
<br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Control</td>
<td >Description</td>
</tr>
<tr >
<td ><code>V[LD/ST]MXCSR</code></td>
<td >Load/store MXCSR control/status register</td>
</tr>
<tr >
<td ><code>XSAVEOPT</code></td>
<td >Save processor extended states optimized</td>
</tr>
</tbody>
</table>
<br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Conversion</td>
<td >Description</td>
</tr>
<tr >
<td ><code>(V)CVTx2y</code></td>
<td >Convert type <i>x</i> to type <i>y</i>, where <i>x</i> and <i>y</i> are chosen from<br /><code>DQ and P[D/S],</code><br /><code>[P/S]S and [P/S]D, or</code><br /><code>S[D/S] and SI.</code></td>
</tr>
</tbody>
</table>
<br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Load/store</td>
<td >Description</td>
</tr>
<tr >
<td ><code>VBROADCAST[SS/SD/F128]</code></td>
<td >Load with broadcast (loads single value into multiple locations)</td>
</tr>
<tr >
<td ><code>VEXTRACTF128</code></td>
<td >Extract 128-bit floating-point values</td>
</tr>
<tr >
<td ><code>(V)EXTRACTPS</code></td>
<td >Extract packed single precision</td>
</tr>
<tr >
<td ><code>VINSERTF128</code></td>
<td >Insert packed floating-point values</td>
</tr>
<tr >
<td ><code>(V)INSERTPS</code></td>
<td >Insert packed single-precision values</td>
</tr>
<tr >
<td ><code>(V)PINSR[B/W/D/Q]</code></td>
<td >Insert integer</td>
</tr>
<tr >
<td ><code>(V)LDDQU</code></td>
<td >Move quad unaligned integer</td>
</tr>
<tr >
<td ><code>(V)MASKMOVDQU</code></td>
<td >Store selected bytes of double quadword with <code>NT Hint</code></td>
</tr>
<tr >
<td ><code>VMASKMOVP[D/S]</code></td>
<td >Conditional SIMD packed load/store</td>
</tr>
<tr >
<td ><code>(V)MOV[A/U]P[D/S]</code></td>
<td >Move aligned/unaligned packed double/single</td>
</tr>
<tr >
<td ><code>(V)MOV[D/Q]</code></td>
<td >Move doubleword/quadword</td>
</tr>
<tr >
<td ><code>(V)MOVDQ[A/U]</code></td>
<td >Move double to quad aligned/unaligned</td>
</tr>
<tr >
<td ><code>(V)MOV[HL/LH]P[D/S]</code></td>
<td >Move high-to-low/low-to-high packed double/single</td>
</tr>
<tr >
<td ><code>(V)MOV[H/L]P[D/S]</code></td>
<td >Move high/low packed double/single</td>
</tr>
<tr >
<td ><code>(V)MOVNT[DQ/PD/PS]</code></td>
<td >Move packed integers/doubles/singles using a non-temporal hint</td>
</tr>
<tr >
<td ><code>(V)MOVNTDQA</code></td>
<td >Move packed integers using a non-temporal hint, aligned</td>
</tr>
<tr >
<td ><code>(V)MOVS[D/S]</code></td>
<td >Move or merge scalar double/single</td>
</tr>
<tr >
<td ><code>(V)MOVS[H/L]DUP</code></td>
<td >Move single odd/even indexed singles</td>
</tr>
<tr >
<td ><code>(V)PACK[U/S]SW[B/W]</code></td>
<td >Pack with unsigned/signed saturation on bytes/words</td>
</tr>
<tr >
<td ><code>(V)PALIGNR</code></td>
<td >Byte align</td>
</tr>
<tr >
<td ><code>(V)PEXTR[B/W/D/Q]</code></td>
<td >Extract integer</td>
</tr>
<tr >
<td ><code>(V)PMOV[S/Z]X[B/W/D][W/D/Q]</code></td>
<td >Packed move with sign/zero extend (only <code>up in length, DD, DW,</code> etc. disallowed)</td>
</tr>
</tbody>
</table>
<br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Logical</td>
<td >Description</td>
</tr>
<tr >
<td ><code>(V)[AND/ANDN/OR]P[D/S]</code></td>
<td >Bitwise logical <code>AND/AND NOT/OR</code> of packed double/single values</td>
</tr>
<tr >
<td ><code>(V)PAND(N)</code></td>
<td >Logical <code>AND (NOT)</code></td>
</tr>
<tr >
<td ><code>(V)P[OR/XOR]</code></td>
<td >Bitwise <code>logical OR/exclusive OR</code></td>
</tr>
<tr >
<td ><code>(V)PTEST</code></td>
<td >Packed bit test, set zero flag if bitwise <code>AND</code> is all <code>0</code></td>
</tr>
<tr >
<td ><code>(V)UCOMIS[D/S]</code></td>
<td >Unordered compare scalar doubles/singles and set <code>EFLAGS</code></td>
</tr>
<tr >
<td ><code>(V)XORP[D/S]</code></td>
<td >Bitwise logical <code>XOR</code> of packed double/single</td>
</tr>
</tbody>
</table>
<br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >Shuffle</td>
<td >Description</td>
</tr>
<tr >
<td ><code>(V)BLENDP[D/S]</code></td>
<td >Blend packed double/single; selects elements based on mask</td>
</tr>
<tr >
<td ><code>(V)BLENDVP[D/S]</code></td>
<td >Blend values</td>
</tr>
<tr >
<td ><code>(V)MOVDDUP</code></td>
<td >Copies even values to all values</td>
</tr>
<tr >
<td ><code>(V)PBLENDVB</code></td>
<td >Variable blend packed bytes</td>
</tr>
<tr >
<td ><code>(V)PBLENDW</code></td>
<td >Blend packed words</td>
</tr>
<tr >
<td ><code>VPERMILP[D/S]</code></td>
<td >Permute double/single values</td>
</tr>
<tr >
<td ><code>VPERM2F128</code></td>
<td >Permute floating-point values</td>
</tr>
<tr >
<td ><code>(V)PSHUF[B/D]</code></td>
<td >Shuffle packed bytes/doublewords based on immediate value</td>
</tr>
<tr >
<td ><code>(V)PSHUF[H/L]W</code></td>
<td >Shuffle packed high/low words</td>
</tr>
<tr >
<td ><code>(V)PUNPCK[H/L][BW/WD/DQ/QDQ]</code></td>
<td >Unpack high/low data</td>
</tr>
<tr >
<td ><code>(V)SHUFP[D/S]</code></td>
<td >Shuffle packed double/single</td>
</tr>
<tr >
<td ><code>(V)UNPCK[H/L]P[D/S]</code></td>
<td >Unpack and interleave packed/scalar doubles/singles</td>
</tr>
</tbody>
</table>
<br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >AES</td>
<td >Description</td>
</tr>
<tr >
<td ><code>AESENC/AESENCLAST</code></td>
<td >Perform one round of AES encryption</td>
</tr>
<tr >
<td ><code>AESDEC/AESDECLAST</code></td>
<td >Perform one round of AES decryption</td>
</tr>
<tr >
<td ><code>AESIMC</code></td>
<td >Perform the AES <code>InvMixColumn</code> transformation</td>
</tr>
<tr >
<td ><code>AESKEYGENASSIST</code></td>
<td >AES Round Key Generation Assist</td>
</tr>
</tbody>
</table>
<br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
</tbody>
<tbody >
<tr >
<td >Future Instructions</td>
<td >Description</td>
</tr>
<tr >
<td ><code>[RD/WR][F/G]SBASE</code></td>
<td >Read/write FS/GS register</td>
</tr>
<tr >
<td ><code>RDRAND</code></td>
<td >Read random number (into r16, r32, r64)</td>
</tr>
<tr >
<td ><code>VCVTPH2PS</code></td>
<td >Convert 16-bit floats to single precision floating-point values</td>
</tr>
<tr >
<td ><code>VCVTPS2PH</code></td>
<td >Convert single-precision values to 16-bit floating-point values</td>
</tr>
</tbody>
</table>
<br /> 
<table width="100%" cellspacing="0" cellpadding="0" border="0" class="tableformat1">
<tbody >
<tr >
<td >FMA</td>
<td >Each [z] is the string 132 or 213 or 231, giving the order the operands A,B,C are used in:<br />132 is A=AC+B<br />213 is A=AB+C<br />231 is A=BC+A</td>
</tr>
<tr >
<td ><code>VFMADD[z][P/S][D/S]</code></td>
<td >Fused multiply add A = r1 * r2 + r3 for packed/scalar of double/single</td>
</tr>
<tr >
<td ><code>VFMADDSUB[z]P[D/S]</code></td>
<td >Fused multiply alternating add/subtract of packed double/single A = r1 * r2 + r3 for odd index, A = r1 * r2-r3 for even</td>
</tr>
<tr >
<td ><code>VFMSUBADD[z]P[D/S]</code></td>
<td >Fused multiply alternating subtract/add of packed double/single A = r1 * r2-r3 for odd index, A = r1 * r2+r3 for even</td>
</tr>
<tr >
<td ><code>VFMSUB[z][P/S][D/S]</code></td>
<td >Fused multiply subtract A = r1 * r2-r3 of packed/scalar double/single</td>
</tr>
<tr >
<td ><code>VFNMADD[z][P/S][D/S]</code></td>
<td >Fused negative multiply add of packed/scalar double/single A = -r1 * r2+r3</td>
</tr>
<tr >
<td ><code>VFNMSUB[z][P/S][D/S]</code></td>
<td >Fused negative multiply subtract of packed/scalar double/single A = -r1 * r2-r3</td>
</tr>
</tbody>
</table>
<div  id="vc-meta">
<div id="vc-meta-author">
<div>Chris Lomont</div>
</div>
<div id="vc-meta-pubdate">06-21-2011</div>
<div id="vc-meta-modificationdate">06-21-2011</div>
<div id="vc-meta-taxonomy">Tech Articles</div>
<div id="vc-meta-category-product"></div>
<div id="vc-meta-category">
<div>Intel® SSE</div>
<div>Intel® AVX</div>
</div>
<div id="vc-meta-thumb"></div>
<div id="vc-meta-thumb-tout"></div>
<div id="vc-meta-thumb-hero"></div>
<div id="vc-meta-tocenable">no</div>
<div id="vc-meta-abstract">Intel® AVX improves upon Intel® SSE by adding 256-bit registers and additional instructions. This article shows how to detect Intel AVX, and deconstructs Intel AVX function names and explains Intel AVX intrinsics. It showcases a fractal calculation that achieves nearly 2x performance speedup relative to the Intel SSE version.</div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions/</link>
      <pubDate>Tue, 21 Jun 2011 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions/</guid>
      <category>Visual Computing</category>
      <category>Intel® AVX</category>
    </item>
    <item>
      <title>Using Intel® Advanced Vector Extensions to Implement an Inverse Discrete Cosine Transform</title>
      <description><![CDATA[ <h1 class="sectionHeading">Download Article</h1>
Download <a href="http://software.intel.com/file/30921">Using Intel® Advanced Vector Extensions to Implement an Inverse Discrete Cosine Transform</a> [PDF 335KB]<br />Download <a href="http://software.intel.com/file/29048">IDCT source code</a> [Zip 22KB]<br /><br />
<h1 class="sectionHeading">Introduction</h1>
Transform coding is an important step of image and video processing applications. Pixels in an image have a level of correlation with their neighboring pixels. Adjacent pixels in successive frames show a very high correlation. These correlations can be used to predict the value of a pixel from its neighbor. The highly correlated spatial data is transformed into uncorrelated coefficients in the frequency domain. The transformed data, represented as coefficients, are independent, allowing them to be manipulated separately. The human eye is more perceptive to low-frequency changes than high-frequency changes. Encoders can achieve compression after setting the high-frequency coefficients to zero. An inverse transform is used in the decoder pipeline to reconstruct the source data.<br /><br /> Figure 1 and Figure 2 depict the components of a typical image encoder/decoder system.<br /><br />
<p ><img src="http://software.intel.com/file/28098" /></p>
<br />
<div ><b>Figure 1 - Source Encoder Block Diagram</b><br /></div>
<br />
<p ><img src="http://software.intel.com/file/28099" /></p>
<br />
<div ><b>Figure 2 - Source Decoder Block Diagram</b><br /></div>
<br /> The Discrete Cosine Transform (DCT) and Inverse Discrete Cosine Transform (IDCT) are widely used in the encoder and decoder pipelines of image processing systems such as MPEG, JPEG, and other standards. A fast and accurate IDCT transformation is crucial for the best user experience.<br /><br /> Intel® Advanced Vector Extensions (Intel® AVX), introduced with the new Intel® processor microarchitecture codenamed Sandy Bridge, extends the capabilities of Intel® Streaming SIMD Extensions (Intel® SSE) for floating point data and operations such as Inverse Discrete Cosine Transforms. Intel® AVX essentially doubles the width of the current XMM registers and adds new extensions that can operate on the wider data width. Intel® AVX significantly increases the floating-point performance density with improved power efficiency over previous 128-bit SIMD instruction set extensions. This document specifically examines how Intel® AVX and Sandy Bridge microarchitecture features such as wider 256-bit SIMD registers, non-destructive source operands, new data manipulation and arithmetic primitives, dual 128-bit load ports, and doubling of computational execution width can benefit the Inverse Discrete Cosine Transform (IDCT) operation. Intel® AVX improves the computational throughput of high performance precision-sensitive floating point transform applications. Integer implementations may also benefit from features such as non-destructive source operands. <br /><br /><br /> Intel also provides a set of Intel® AVX software development tools like the Intel® AVX-enabled Intel® C++ Compiler, the Intel® Software Development Emulator (SDE), and the Intel® Architecture Code Analyzer.  All of these tools were effectively used during the development of this kernel, and can be downloaded from the Intel® AVX website: <a href="http://software.intel.com/en-us/avx/">http://software.intel.com/en-us/avx/</a><br /><br />
<h1 class="sectionHeading">Testing Environment</h1>
The performance speedups stated in this paper are based on runs on actual Sandy Bridge microarchitecture-based pre-production silicon.  It can also be assumed that the test data is already in the first level processor cache prior to the computation of the IDCT algorithm.  Performance comparisons are made based on the relative performance of Intel® AVX versus corresponding Intel® SSE implementations using C intrinsic instructions, both run on the Sandy Bridge microarchitecture-based silicon. The code was compiled using the 64-bit Intel® C++ Compiler, version 11.1.038. The applications were compiled using the following command line options:<br /><br /> 
<ul>
<li>Intel® SSE: /QxSSE4.1 /O3</li>
<li>Intel® AVX: /QxAVX /O3</li>
</ul>
A comparison of short integer performance was performed by compiling the Intel<sup>®</sup> SSE implementation for different architectures. The compiler will generate VEX-encoded Intel<sup>®</sup> SSE instructions when the /QxAVX switch is specified. The compiler will generate Intel<sup>® </sup>SSE 4.1 instructions when the /QxSSE4.1 switch is specified. Non-destructive source operands are introduced with the VEX-encoded instructions. <br /><br /> The test application follows the 8x8 IDCT accuracy requirements as documented in section 3 of the IEEE standard 1180-1900 (Reference 5).  Randomly generated input data is processed by a reference DCT. The output of the DCT is rounded to the nearest integer and clipped such that -2048 &lt;= value &lt; 2047. The clipped values are inputs to both the reference IDCT and the IDCT under test. Their outputs are rounded to the nearest integer and clipped such that -256 &lt;= output &lt; 255. The two results are then compared and accuracy measurements are made.<br /><br /> In this application, the references are implemented using double precision floating point scalar C code. Several proposed IDCT implementations are tested; a short integer Intel® SSE version, a single precision floating point Intel® SSE version, and a single precision floating point Intel® AVX version. All the vector versions are implemented with C intrinsics.<br /><br />
<p ><img src="http://software.intel.com/file/28086" /></p>
<br />
<div ><b>Figure 3 - Test Application</b><br /></div>
<br />
<h1 class="sectionHeading">Discrete Cosine Transforms</h1>
The Discrete Cosine Transform (DCT) converts the spatial data of an image into the frequency domain. The mathematical operations are described in detail in Reference [1].<br /><br /> The two-dimensional (2D) DCT transforms 64 pixel values, an 8x8 block of pixels, producing 64 coefficients. The image's pixel values are transformed into coefficients derived from amplitudes of cosine basis functions.<br /><br />
<p ><img src="http://software.intel.com/file/28087" /></p>
<br /><br /> Where:<br /><br />
<p ><img src="http://software.intel.com/file/28088" /></p>
<br /> Reference work [2] summarizes the number of mathematical operations; the numbers are worth repeating.  Comparing the DCT and IDCT equations, it can be shown that the transforms require the same number of operations.  An analysis of Equation 2 shows that the 2D DCT requires 64 multiplications and 63 additions for each coefficient. Therefore 4096 multiplications and 4023 additions are required to transform an 8x8 block. Replacing the 2D DCT with 8 1D DCTs for the 8 rows and 8 1D DCTs for the 8 columns can reduce the number of operations. Equation 1 shows that the 1D DCT requires 64 multiplications and 56 additions producing 8 coefficients. Transforming the 8x8 block with 1D DCTs requires 1024 multiplications and 896 additions. References [1] and [4] discuss other DCT and IDCT algorithms.<br /><br /><br />
<h1 class="sectionHeading">Previous Work and Specifications</h1>
The following previous works and specifications should be consulted for background information:<br /> 
<ul>
<li>Reference [1] for the JPEG specification.</li>
<li>The IDCT algorithm used in the Intel® AVX implementation discussed in this whitepaper follows the optimized algorithm discussed in Reference [2].</li>
<li>Reference [3] provides additional background information.</li>
</ul>
<br />
<h1 class="sectionHeading">Intel® AVX Implementation of IDCT</h1>
As in the previous works, the algorithm in this whitepaper performs 8 1D IDCT transforms on the rows of an 8x8 block. Then 8 1D IDCT is performed on the columns of those results. Two different 1D IDCT transforms are used, and neither transform requires a transpose.<br /><br /> The AVX implementation operates on two rows of data simultaneously. It begins by loading four floats from each row into separate 128-bit registers. The reason for performing 128-bit loads will become clearer in a moment. It is important to remember that the 128-bit XMM registers overlay the lower 128-bits of the corresponding 256-bit YMM register. The upper 128-bits of the YMM register is loaded with the next set of four floats from those rows via the _mm256_insertf128_ps instruction.  The resultant YMM register contains the first set of four floats from one row in the lower 128-bits, and the second set of four floats from the second row in the upper 128-bits. This seems like a lot of work to load eight floats, so why take this approach? This is known as a strided load (because the next load will read from an area of memory that is more than a unit stride away from the current memory location).<br /><br /> The strided load is effective in certain applications.<br /><br /> 
<ul>
<li>Eight products need to be summed in the IDCT and it is more straightforward to sum the products if they are in the same 128-bit lane of two YMM registers. Two registers are built with the contents from one row in the bottom 128-bits of both registers, and the contents from the second row in the upper 128-bits of both register. This is depicted in Figure 4 through Figure 6.</li>
<li>Another benefit of the strided load approach is that data movement, or placement of the 128-bit memory operand into the upper 128 bits, can take place on a different execution port than the shuffle port. This reduces the utilization of the port that performs shuffles, and allocates the work more evenly across the execution ports.</li>
<li>One last benefit of placing floats 4-7 of two rows into one register allows for a reduction in shuffles that takes place near the end of row processing.</li>
</ul>
<p ><img src="http://software.intel.com/file/28089" /></p>
<br />
<div ><b>Figure 4 - Loading Four Floats from Row 0</b><br /></div>
<br /> The combination of the <span ><i>_mm_load_ps</i></span> and <span ><i>_mm256_castps128_ps256</i></span> instructions loads 128-bits into the least significant 128-bits of a ymm register.<br /><br /> Figure 5 depicts the strided load of Row 4 data, and the insertion into the most significant 128-bits of ymm0.<br /><br />
<p ><img src="http://software.intel.com/file/28090" /></p>
<br />
<div ><b>Figure 5 - Strided Load of Second Row</b><br /></div>
<br /> A second YMM register is populated in a similar manner; it contains the second set of four floats (floats 4-7) from both rows.  The results are shown in Figure 6.<br /><br />
<p ><img src="http://software.intel.com/file/28091" /></p>
<br />
<div ><b>Figure 6 - Second Set of Floats from Two Rows</b><br /></div>
<br /> The input data must be multiplied by different cosine terms.  A single float from each row is broadcasted via the <span ><i>_mm256_shuffle_ps</i></span> to prepare for the multiplication, as shown in Figure 7. In this case float 0 is duplicated.<br /><br />
<p ><img src="http://software.intel.com/file/28092" /></p>
<br />
<div ><b>Figure 7 - Preparing for Cosine Multiplication</b><br /></div>
<br /> Two <span ><i>_mm256_mul_ps</i></span> instructions perform the necessary cosine multiplications.  This completes one of the seven sums of product terms.<br /><br /> This sequence of instructions is repeated three times, operating on a different pair of input values and cosine terms each time. At this point there are four registers containing sums of products. One register contains sums of products for floats 0 and 2, another contains sums of products for floats 1 and 3, 4 and 6, and finally 5 and 7.<br /><br /> Three more sums are necessary to complete the sum of products. The sums of products of all even terms are added, and the sums of products of all odd terms are added. That is only two of the three sums.<br /><br />
<pre name="code" class="cpp">      ymm_even = _mm256_add_ps(r_ymm02, r_ymm46);<br />ymm_odd = _mm256_add_ps(r_ymm13, r_ymm57);<br /></pre>
The seventh and final sum produces the least significant four floats of the output. A subtraction is necessary to produce the most significant four floats; however, they are not in the correct order. <br /><br />
<pre name="code" class="cpp">     ymm_diff = _mm256_sub_ps(ymm_even, ymm_odd);<br />ymm_sum = _mm256_add_ps(ymm_even, ymm_odd); <br /></pre>
The order of the most significant floats of two rows can be properly adjusted using a single <span ><i>_mm256_shuffle_ps</i></span> instruction, as shown in Figure 8. The benefit of placing the most significant four floats from each input row is more apparent.<br /><br />
<p ><img src="http://software.intel.com/file/28093" /></p>
<br />
<div ><b>Figure 8 - Reversing the Order of Differences</b><br /></div>
<br /> The final step for the 1D IDCT for these two rows is to recombine the results for each row into the same 256-bit register using two <span ><i>_mm256_ permute2f128</i></span> instructions. <br /><br />
<p ><img src="http://software.intel.com/file/28094" /></p>
<br />
<div ><b>Figure 9 - Recombining Row 0</b><br /></div>
<br />
<p ><img src="http://software.intel.com/file/28095" /></p>
<br />
<div ><b>Figure 10 - Recombining Row 4</b><br /></div>
<br /> That completes the 1D IDCT for two sets of rows. The sequence is repeated for rows 1 and 7, 2 and 6, and finally 3 and 5 to complete the 1D IDCT for eight rows.<br /><br /> Now that the 1D IDCT of the eight rows is completed, the 1D IDCT of the eight columns can begin. The algorithm is similar, yet simpler because the sums of products are already in the proper position for the addition. The 1D IDCT of the columns does not require any shuffles.<br /><br /> It was possible to eliminate four add instructions in the floating point IDCT of the columns.<br /><br />
<pre name="code" class="cpp">      //row5*tangent + row5<br />r_ymm0 = _mm256_mul_ps(row5, tangent);<br />r_ymm0 = _mm256_add_ps(row5, r_ymm0);<br /></pre>
A single multiply of the row times the tangent term plus 1 produces identical results in the floating point implementation.<br /><br />
<pre name="code" class="cpp">Tangent_p1 = tan(x) + 1<br /></pre>
<pre name="code" class="cpp">//row5*(tangent+1)<br />r_ymm0 = _mm256_mul_ps(row5, tangent_p1);<br />//row3*(tangent+1)<br />r_ymm1 = _mm256_mul_ps(row3, tangent_p1);<br /></pre>
This optimization could not be applied to the short integer implementation due to the scaling of numbers in the short algorithm.<br /><br /><br />
<h1 class="sectionHeading">Using Intel® Architecture Code Analyzer</h1>
The Intel® Architecture Code Analyzer is an excellent tool that identifies the critical path in a basic block, execution port utilization, and instruction usage alternatives. There is much more Intel® Architecture Code Analyzer information available at the <a href="http://software.intel.com/en-us/avx/">Intel® AVX developer site</a>.<br /><br /> The basic block analysis in Figure 11 shows that the vshufps (invoked via the <span ><i>_mm256_shuffle_ps</i></span> intrinsic) executes on port 5. The analysis also shows that the vinsertf128 instruction (<span ><i>_mm256_insertf128_ps</i></span> intrinsic) offers the programmer choices for port execution. The "X" in the Intel® Architecture Code Analyzer report indicates that an instruction can be executed on the port, but in this particular algorithm it was executed elsewhere. In this case, the vinsertf128 was executed on port 0 and port 3 rather than port 3 and port 5 because the second source operand was loaded from memory. The insert takes place on port 0, thus reducing the pressure on port 5.<br /><br />
<p ><img src="http://software.intel.com/file/28096" /></p>
<br />
<div ><b>Figure 11 - Basic Block Analysis</b><br /></div>
<br /> The Intel® Architecture Code Analyzer is an effective optimization tool for software developers who want to fine tune and improve their application's performance.<br /><br /><br />
<h1 class="sectionHeading">Results</h1>
The 128-bit code was compiled for the Intel® microarchitecture codenamed Nehalem to generate Intel® SSE code, and executed on the Sandy Bridge microarchitecture-based silicon.  The corresponding 256-bit Intel® AVX-enabled code was compiled for the Sandy Bridge microarchitecture and executed on the Sandy Bridge microarchitecture-based silicon. Data was aligned on 16-Byte boundaries for the Intel® SSE code and 32-Byte boundaries for the Intel® AVX code. Both applications were compiled using the 64-bit version of the Intel® C++ Compiler Professional Edition, version 11.1.038.  The speedups listed here are for 256-bit code relative to the128-bit code.<br /><br /> <b>Table 1 - Performance Results</b><br /><br /> 
<table class="tableFormat1" border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody >
<tr >
<td  width="33%"><b>Algorithm</b></td>
<td  width="33%"><b>Speedup</b></td>
<td  width="33%"><b>Parameters</b></td>
</tr>
<tr >
<td >Intel® AVX floating point implementation<br /></td>
<td >0.94x</td>
<td >Compared to Intel® SSE short integer implementation</td>
</tr>
<tr >
<td >Intel® SSE short integer implementation compiled with /QxAVX<br /></td>
<td >1.07x</td>
<td >Compared to Intel® SSE short integer implementation compiled with /QxSSE4.1<br /></td>
</tr>
<tr >
<td >Intel® AVX floating point implementation<br /></td>
<td >1.78x</td>
<td >Compared to Intel® SSE floating point implementation</td>
</tr>
</tbody>
</table>
<br /> Table 2 summarizes the overall mean error results for each implementation.<br /><br /> <b>Table 2 - Overall Mean Error Results</b><br /><br /> 
<table class="tableFormat1" border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody >
<tr >
<td  width="25%">L, H (Reference 5)</td>
<td  width="25%">Intel® SSE Short</td>
<td  width="25%">Intel® SSE Single Precision Floating Point</td>
<td  width="25%">Intel® AVX Single Precision Floating Point</td>
</tr>
<tr >
<td >256, 255</td>
<td >3.44e-5</td>
<td >6.25e-6</td>
<td >6.25e-6</td>
</tr>
<tr >
<td >5, 5</td>
<td >2.58e-4</td>
<td >1.56e-6</td>
<td >1.56e-6</td>
</tr>
<tr >
<td >300, 300</td>
<td >4.69e-6</td>
<td >6.25e-6</td>
<td >6.25e-6</td>
</tr>
<tr >
<td >-255, 256</td>
<td >7.53e-4</td>
<td >3.13e-6</td>
<td >3.13e-6</td>
</tr>
<tr >
<td >-5, 5</td>
<td >0</td>
<td >0</td>
<td >0</td>
</tr>
<tr >
<td >-300, 300</td>
<td >0</td>
<td >0</td>
<td >0</td>
</tr>
<tr >
<td >Input = 0</td>
<td >0</td>
<td >0</td>
<td >0</td>
</tr>
</tbody>
</table>
<br /> Both the short integer and single precision floating point implementations meet the error requirements of IEEE 1180-1900 (Reference 5). The floating point implementations produce lower overall mean error in many of the tests; 5x, 165x, and 240x lower than the short integer implementation.<br /><br /><br />
<h1 class="sectionHeading">Conclusion</h1>
The results for the IDCT of 10,000 8x8 blocks shows that the Intel® AVX version outperformed the Intel® SSE single precision floating point implementation by 1.78x. The accuracy is excellent when compared to the reference IDCT.<br /><br /> The results also show the Intel<sup>®</sup> SSE short integer version compiled with the /QxAVX option outperformed the exact same code compiled with the /QxSSE4.1 option by 1.07x. The assembly language produced when compiling with /QxSSE4.1 has 22 register-to-register moves. The code produced with the /QxAVX switch did not have any register-to-register moves. The Intel<sup>®</sup> AVX non-destructive source instructions reduce the need for register copies in this application. There can be benefits to using Intel AVX for integer-based algorithms today.<br /><br /> Although the Intel® AVX implementation is slightly slower than the Intel® SSE short version, the Intel® AVX single precision floating point version is more accurate. The Intel® SSE short version required several adjustments to improve the accuracy and minimize rounding errors.  Those adjustments were not necessary in either floating point implementation, resulting in a cleaner and more accurate implementation.<br /><br /><br />
<h1 class="sectionHeading">Source Code for IDCT</h1>
The source code for the IDCT can be downloaded <a href="http://software.intel.com/file/29048">here</a><br /><br /> Partial source code listing is provided in the following section.<br /><br />
<pre name="code" class="cpp">void idctAVX(void)	{<br />__m128 r_xmm0, r_xmm2, r_xmm1;<br />__m256 r_ymm02, r_ymm46, r_ymm13, r_ymm57;<br />__m256 r_ymm0, r_ymm1, r_ymm2, r_ymm3, r_ymm4, r_ymm5, r_ymm6, r_ymm7;<br />__m256 row0, row1, row2, row3, row4, row5, row6, row7;<br />__m256 ymm_even, ymm_odd, ymm_sum, ymm_diff;<br />__m256 temp3, temp7;<br />__m256 tangent_1, tangent_2, tangent_3, cos_4;<br /><br />tangent_1 = AVX_tg_1_16;<br />tangent_2 = AVX_tg_2_16;<br />tangent_3 = AVX_tg_3p1_16;<br />cos_4 = AVX_cos_4p1_16;<br /><br />const float * pInput;<br />float * pOutput;<br />float * pFTab_i_04 = float_tab_i_04;<br />float * pFTab_i_26 = float_tab_i_26;<br />float * pFTab_i_17 = float_tab_i_17;<br />float * pFTab_i_35 = float_tab_i_35;<br />const int blockSize = 8*8;<br /><br />//Transform all the blocks N times<br />//The iteration count is a command line option<br />const int maxLoopCount = g_loopCount;<br />long startTime = getTimestamp();<br /><br />for(int loopCount = 0; loopCount &lt; maxLoopCount; loopCount++)	{<br />	<br />   //Operate on all the blocks<br />   for(int i = 0; i &lt; g_blockCount; i++)	{<br /><br />      //Get pointers for this input and output<br />      pInput = &amp;dctData[i*blockSize];				<br />      pOutput = &amp;kernelResults[i*blockSize];<br />      pFTab_i_04 = float_tab_i_04;<br />      pFTab_i_26 = float_tab_i_26;<br />      pFTab_i_17 = float_tab_i_17;<br />      pFTab_i_35 = float_tab_i_35;<br />//    IACA_START<br /><br />      //Rows 0 and 4<br />      //Process the first four floats of these two rows<br />      //Read input data from row 0, read eight floats via two 128-bit loads<br />      r_ymm0 = _mm256_castps128_ps256(_mm_load_ps(pInput));<br />      r_ymm1 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[4]));<br /> <br />      //Insert data from row 4 into the upper lane<br />      r_ymm0 = _mm256_insertf128_ps(r_ymm0, _mm_load_ps(&amp;pInput[4*8]), 1);<br />      r_ymm4 = _mm256_insertf128_ps(r_ymm1, _mm_load_ps(&amp;pInput[4*8+4]), 1);<br /><br />      //Broadcast float 0 <br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x00);<br />   <br />      //Multiply by the coefficients<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) pFTab_i_04));<br /><br />      //Broadcast float 2 and multiply by coefficients<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[8]));<br /><br />      //Add<br />      r_ymm02 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Broadcast float 1 and multiply by coefficients<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[16]));<br /><br />      //Broadcast float 3 and multiply by coefficients<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[24]));<br />   <br />      //Add<br />      r_ymm13 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process the second four floats of these two rows<br />      //Broadcast float 4 <br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x00);<br /><br />      //Multiply by the coefficients<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[32]));<br /><br />      //Broadcast float 6 and multiply by coefficients<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[40]));<br /><br />      //Add<br />      r_ymm46 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Broadcast float 5 and multiply by coefficients<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[48]));<br /><br />      //Broadcast float 7 and multiply by coefficients<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[56]));<br /><br />      //Add to create this partial sum, <br />      //then create final sums and differences.<br />      r_ymm57 = _mm256_add_ps(r_ymm2, r_ymm3);<br />      ymm_even = _mm256_add_ps(r_ymm02, r_ymm46);<br />      ymm_odd = _mm256_add_ps(r_ymm13, r_ymm57);<br />      ymm_diff = _mm256_sub_ps(ymm_even, ymm_odd);<br />      ymm_sum = _mm256_add_ps(ymm_even, ymm_odd);<br /><br />      //Reverse the order of the differences, then build outputs 0 and 4<br />      ymm_diff = _mm256_shuffle_ps(ymm_diff, ymm_diff, 0x1b);<br />      row0 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x20);<br />      row4 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x31);<br />      //End of Rows 0 and 4 processing<br /><br />      //Rows 1 and 7<br />      //Process the first four floats of these two rows<br />      //Read input data from row 1, read eight floats via two 128-bit loads<br />      r_ymm0 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[8]));<br />      r_ymm1 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[8+4]));<br /><br />      //Insert data from row 7 into the upper lane<br />      r_ymm0 = _mm256_insertf128_ps(r_ymm0, _mm_load_ps(&amp;pInput[7*8]), 1);<br />      r_ymm4 = _mm256_insertf128_ps(r_ymm1, _mm_load_ps(&amp;pInput[7*8+4]), 1);<br />      <br />      //Process floats 0 and 2 <br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x00);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) pFTab_i_17));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[8]));<br />      r_ymm02 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />       //Process floats 1 and 3<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[16]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[24]));<br />      r_ymm13 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process the second four floats of these two rows<br />      //Process floats 4 and 6<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x00);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[32]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[40]));<br />      r_ymm46 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process floats 5 and 7<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[48]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[56]));<br />      r_ymm57 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Create final sums and differences<br />      ymm_even = _mm256_add_ps(r_ymm02, r_ymm46);<br />      ymm_odd = _mm256_add_ps(r_ymm13, r_ymm57);<br />      ymm_diff = _mm256_sub_ps(ymm_even, ymm_odd);<br />      ymm_sum = _mm256_add_ps(ymm_even, ymm_odd);<br />    <br />      //Reverse the order of the differences, then build outputs 1 and 7<br />      ymm_diff = _mm256_shuffle_ps(ymm_diff, ymm_diff, 0x1b);<br />      row1 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x20);<br />      row7 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x31);<br />      //End of Rows 1 and 7 processing<br /><br />      //Rows 2 and 6<br />      //Process the first four floats of these two rows<br />      //Read input data from row 2, read eight floats via two 128-bit loads<br />      r_ymm0 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[2*8]));<br />      r_ymm1 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[2*8+4]));<br /><br />      //Insert data from row 6 into the upper lane<br />      r_ymm0 = _mm256_insertf128_ps(r_ymm0, _mm_load_ps(&amp;pInput[6*8]), 1);<br />      r_ymm4 = _mm256_insertf128_ps(r_ymm1, _mm_load_ps(&amp;pInput[6*8+4]), 1);   <br />   <br />      //Process floats 0 and 2 <br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x00);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) pFTab_i_26));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[8]));<br />      r_ymm02 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process floats 1 and 3<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[16]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[24]));<br />      r_ymm13 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process the second four floats of these two rows<br />      //Process floats 4 and 6<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x00);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[32]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[40]));<br />      r_ymm46 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process floats 5 and 7<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[48]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[56]));<br />      r_ymm57 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Create the final sum and difference<br />      ymm_even = _mm256_add_ps(r_ymm02, r_ymm46);<br />      ymm_odd = _mm256_add_ps(r_ymm13, r_ymm57);<br />      ymm_diff = _mm256_sub_ps(ymm_even, ymm_odd);<br />      ymm_sum = _mm256_add_ps(ymm_even, ymm_odd);<br />   <br />      //Reverse the order of the differences, then build outputs 2 and 6<br />      ymm_diff = _mm256_shuffle_ps(ymm_diff, ymm_diff, 0x1b);<br />      row2 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x20);<br />      row6 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x31);<br />      //End of Rows 2 and 6 processing<br /><br />      //Rows 3 and 5<br />      //Process the first four floats of these two rows<br />      //Read input data from row 3, read eight floats via two 128-bit loads<br />      r_ymm0 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[3*8]));<br />      r_ymm1 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[3*8+4]));<br /><br />      //Insert data from row 5 into the upper lane<br />      r_ymm0 = _mm256_insertf128_ps(r_ymm0, _mm_load_ps(&amp;pInput[5*8]), 1);<br />      r_ymm4 = _mm256_insertf128_ps(r_ymm1, _mm_load_ps(&amp;pInput[5*8+4]), 1);<br />    <br />      //Process floats 0 and 2 <br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x00);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) pFTab_i_35));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[8]));<br />      r_ymm02 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process floats 1 and 3<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[16]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[24]));<br />      r_ymm13 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process the second four floats of these two rows<br />      //Process floats 4 and 6<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x00);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[32]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[40]));<br />      r_ymm46 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process floats 5 and 7<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[48]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[56]));<br />      r_ymm57 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Create the final sum and difference<br />      ymm_even = _mm256_add_ps(r_ymm02, r_ymm46);<br />      ymm_odd = _mm256_add_ps(r_ymm13, r_ymm57);<br />      ymm_diff = _mm256_sub_ps(ymm_even, ymm_odd);<br />      ymm_sum = _mm256_add_ps(ymm_even, ymm_odd);<br /><br />      //Reverse the order of the differences, then build outputs 3 and 5<br />      ymm_diff = _mm256_shuffle_ps(ymm_diff, ymm_diff, 0x1b);<br />      row3 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x20);<br />      row5 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x31);<br />      //End of Rows 3 and 5 processing<br /><br />      //******************************<br />      //perform 1D IDCT on the columns<br />   <br />      //Multiply several rows by the appropriate tangent value<br />      //row5*(tangent3+1)<br />      r_ymm0 = _mm256_mul_ps(row5, tangent_3);<br />      //row3*(tangent3+1)<br />      r_ymm1 = _mm256_mul_ps(row3, tangent_3);	<br />      //row7*tangent1<br />      r_ymm4 = _mm256_mul_ps(row7, tangent_1);	<br />      //row1*tangent1<br />      r_ymm5 = _mm256_mul_ps(row1, tangent_1);<br /><br />      //Begin to create results	<br />      //[row5*tangent3 + row5] + row3<br />      r_ymm0 = _mm256_add_ps(r_ymm0, row3);<br />      //row5 - [row1*tangent3 + row1]		<br />      r_ymm2 = _mm256_sub_ps(row5, r_ymm1);<br />      //row6*tangent2<br />      r_ymm7 = _mm256_mul_ps(row6, tangent_2);	<br />      //row2*tangent2<br />      r_ymm3 = _mm256_mul_ps(row2, tangent_2);<br />      //row1*tangent1 - row7	<br />      r_ymm5 = _mm256_sub_ps(r_ymm5, row7);<br />      //row7*tangent1 + row1		<br />      r_ymm4 = _mm256_add_ps(r_ymm4, row1);		<br /><br />      //Save intermediate row 7 results, used as an input later<br />      //[row7*tangent1 + row1] + [row5*tangent3 + row3]<br />      temp7 = _mm256_add_ps(r_ymm4, r_ymm0);		<br /><br />      //Save intermediate row 3 results, used as an input later<br />      //[row1*tangent1 - row7] + [row5 - [row1*tangent3 + row1]]<br />      temp3 = _mm256_add_ps(r_ymm5, r_ymm2);		<br /><br />      //[row7*tangent1 + row1] - [row5*tangent3 + row3]			<br />      r_ymm4 = _mm256_sub_ps(r_ymm4, r_ymm0);<br />      //[row1*tangent1 - row7] - [row5 - [row1*tangent3 + row1]]<br />      r_ymm5 = _mm256_sub_ps(r_ymm5, r_ymm2);	<br />      //{[row7*tangent1 + row1] - [row5*tangent3 + row3]} - <br />      //{[row1*tangent1 - row7] - [row5 - [row1*tangent3 + row1]]}	<br />      r_ymm1 = _mm256_sub_ps(r_ymm4, r_ymm5);	<br />      //{[row7*tangent1 + row1] - [row5*tangent3 + row3]} + <br />      //{[row1*tangent1 - row7] - [row5 - [row1*tangent3 + row1]]}	<br />      r_ymm4 = _mm256_add_ps(r_ymm4, r_ymm5);<br />      //multiply by cos_4+1		<br />      r_ymm4 = _mm256_mul_ps(r_ymm4, cos_4);<br />      //row6*tangent2 + row2<br />      r_ymm7 = _mm256_add_ps(r_ymm7, row2);<br />      //row2*tangent2 - row6		<br />      r_ymm3 = _mm256_sub_ps(r_ymm3, row6);<br />      //multiply by cos_4+1		<br />      r_ymm0 = _mm256_mul_ps(r_ymm1, cos_4);<br />      //row0 + row4<br />      r_ymm5 = _mm256_add_ps(row0, row4);	<br />      //row0 - row4<br />      r_ymm6 = _mm256_sub_ps(row0, row4);	<br />      //[row0 + row4] - [row6*tangent2 + row2]		<br />      r_ymm2 = _mm256_sub_ps(r_ymm5, r_ymm7);<br />      //[row0 + row4] + [row6*tangent2 + row2]<br />      r_ymm5 = _mm256_add_ps(r_ymm5, r_ymm7);		<br />      //[row0 - row4] - [row2*tangent2 - row6]<br />      r_ymm1 = _mm256_sub_ps(r_ymm6, r_ymm3);<br />      //[row0 - row4] + [row2*tangent2 - row6]	<br />      r_ymm6 = _mm256_add_ps(r_ymm6, r_ymm3);		<br />      //[[row7*tangent1 + row1] + [row5*tangent3 + row3]] + <br />      //[[row0 + row4] + [row6*tangent2 + row2]]<br />      r_ymm7 = _mm256_add_ps(temp7, r_ymm5);		<br />			<br />      //Store row 0 results (store 1 of 8)<br />      _mm256_store_ps(pOutput, r_ymm7);<br />      //[[row0 - row4] + [row2*tangent2 - row6]] - <br />      //cos4*{[row7*tangent1 + row1] - [row5*tangent3 + row3]} +<br />      //{[row1*tangent1 - row7] - [row5 - [row1*tangent3 + row1]]}<br />      r_ymm3 = _mm256_sub_ps(r_ymm6, r_ymm4);<br />      //[[row0 - row4] + [row2*tangent2 - row6]] + <br />      //cos4*{[row7*tangent1 + row1] - [row5*tangent3 + row3]} + <br />      //{[row1*tangent1 - row7] - [row5 - [row1*tangent3 + row1]]}	<br />      r_ymm6 = _mm256_add_ps(r_ymm6, r_ymm4);		<br /><br />      //Store row 1 results (store 2 of 8)<br />      _mm256_store_ps(&amp;pOutput[1*8], r_ymm6); <br /><br />      r_ymm7 = _mm256_sub_ps(r_ymm1, r_ymm0);<br />      r_ymm1 = _mm256_add_ps(r_ymm1, r_ymm0);<br /><br />      r_ymm6 = _mm256_add_ps(r_ymm2, temp3);<br />      r_ymm2 = _mm256_sub_ps(r_ymm2, temp3);	 <br /><br />      r_ymm5 = _mm256_sub_ps(r_ymm5, temp7); <br /><br />      //Store final results<br />      _mm256_store_ps(&amp;pOutput[2*8], r_ymm1); <br />      _mm256_store_ps(&amp;pOutput[3*8], r_ymm6); <br />      _mm256_store_ps(&amp;pOutput[4*8], r_ymm2);<br />      _mm256_store_ps(&amp;pOutput[5*8], r_ymm7);<br />      _mm256_store_ps(&amp;pOutput[6*8], r_ymm3);<br />      _mm256_store_ps(&amp;pOutput[7*8], r_ymm5); <br /><br />//      IACA_END<br />      }	<br />   }<br />   long duration = getTimestamp() - startTime;<br />   cout &lt;&lt; "AVX Timestamp = " &lt;&lt; duration &lt;&lt; endl;<br />}<br /><br /><br /></pre>
<h1 class="sectionHeading">References</h1>
The following documents are referenced in this application note, and provide background or supporting information for understanding the topics presented in this document.<br /><br /> 1.	 Pennebaker and Mitchell, JPEG: Still Image Data Compression Standard, Van Nostrand Reinhold, New York, 1993, pp. 29-64.<br /> 2.	 A Fast Precise Implementation of 8x8 Discrete Cosine Transform Using the Streaming SIMD Extensions and MMXTM Instructions, Intel Application Note, AP-922, Copyright 1999<br /> 3.	 Using Streaming SIMD Extensions 2 (SSE2) to Implement and Inverse Discrete Cosine Transform, Intel Application Note, AP-945, Copyright 2000<br /> 4.	 Rao and Yip, Discrete Cosine Transform Algorithms, Advantages, Applications, Academic Press, Inc., Boston, 1990, Appendix A.2<br /> 5.	 IEEE Standard Specifications for the Implementations of 8x8 Inverse Discrete Cosine Transform, IEEE Std 1180-1990.<br /><br /><br />
<h1 class="sectionHeading">About the Author</h1>
Richard Hubbard is a Senior Software Engineer and member of the SSG Apple enabling team, working on optimizing Mac OS X* applications for power and performance.  Richard holds a Masters degree in Electrical Engineering from Stevens Institute of Technology and a Bachelors in Computer Engineering from New Jersey Institute of Technology. ]]></description>
      <link>http://software.intel.com/en-us/articles/using-intel-advanced-vector-extensions-to-implement-an-inverse-discrete-cosine-transform/</link>
      <pubDate>Mon, 07 Jun 2010 21:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/using-intel-advanced-vector-extensions-to-implement-an-inverse-discrete-cosine-transform/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/using-intel-advanced-vector-extensions-to-implement-an-inverse-discrete-cosine-transform/</guid>
      <category>Intel® AVX</category>
    </item>
  </channel></rss>
