<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Mon, 21 May 2012 04:46:43 -0700 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/articles/avx/type/code/feed/" rel="self" type="application/rss+xml" />
    <title>Intel Software Network articles Feed</title>
    <link>http://software.intel.com/en-us/articles/avx/type/code/</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>Intel® AVX C/C++ Intrinsics Emulation</title>
      <description><![CDATA[ <p>Intel® AVX instruction set extension <a target="_blank" href="http://software.intel.com/en-us/avx/">[1]</a> will appear in the next generation Intel microarchitecture codename ‘Sandy Bridge'. We chose to announce AVX early to get as much support from software vendors as possible by the hardware launch time. Now, most software development platforms are supporting Intel AVX, examples are compilers and assemblers from Intel, Microsoft and GCC as well as UNIX binutils.</p>
<p>For early adopters we introduced support of AVX in Intel® Software Development Emulator <a target="_blank" href="http://software.intel.com/en-us/articles/intel-software-development-emulator/">[2]</a>, it allows you to run and check functional correctness of the code with the actual AVX instructions before hardware is available.</p>
<p>Today we are adding another useful piece to help those who may not be able to use new tools supporting AVX in their current development environment but plan to migrate in the future or are using a software platform which is not supported by Intel SDE. These software developers can still start programming with Intel AVX using intrinsics.</p>
<p>Here we are providing the C and C++ header file which emulates Intel AVX intrinsics. The AVX emulation header file uses intrinsics for the prior Intel instruction set extensions up to Intel SSE4.2. SSE4.2 support in your development environment as well as hardware is required in order to use the AVX emulation header file. <br /><br />To use simply have this file included:</p>
<p>#include "avxintrin_emu.h"</p>
<p>Instead of usual:</p>
<p>#include &lt;immintrin.h&gt;</p>
<p><br />One can also create alternative immintrin.h file (which in turn includes avxintrin_emu.h) to avoid an intrusive change to the source base and then simply switch between real AVX code generation and emulation via alternating the path to include directories.</p>
<p>Emulation header is primarily targeting UNIX type of environments, and was tested on such with GCC and Intel C/C++ compilers. We have a strong support with other tools (compilers, assemblers and SDE) on Microsoft Windows platform, but this header file can still be used on Windows, if desired, with Intel Compiler.</p>
<p>Note that the AVX emulation header file is designed to allow functional correctness of an AVX implementation and not recommended for long-term usage or release in a final product. Once your development environment and hardware supports AVX, we recommend that you switch to the real AVX intrinsic header file.<br /><br />Although we did our best to debug it, this file must <em>not</em> be considered a reference functional implementation of AVX instructions or even bug-free. Please see the current version's limitations and caveats in the beginning of the file. Please let us know about the issues you faced using it.</p>
<p><b><br />Example</b></p>
<pre name="code" class="cpp:nogutter:nocontrols">#include "avxintrin_emu.h"  // #include &lt;immintrin.h&gt;

void saxpy( float a, const float* x, const float* y, float* __restrict z, size_t len )
{
    size_t i = 0;
    __m256 a_ = _mm256_set1_ps( a );

    for ( size_t len16_ = len &amp; -16; i + 16 &lt;= len16_; i += 16 )
    {
        __m256 x1_ = _mm256_loadu_ps( x + i );
        __m256 x2_ = _mm256_loadu_ps( x + i + 8 );

        __m256 y1_ = _mm256_loadu_ps( y + i );
        __m256 y2_ = _mm256_loadu_ps( y + i + 8 );

        x1_ = _mm256_mul_ps( x1_, a_ );
        x2_ = _mm256_mul_ps( x2_, a_ );

        x1_ = _mm256_add_ps( x1_, y1_ );
        x2_ = _mm256_add_ps( x2_, y2_ );

        _mm256_storeu_ps( z + i     , x1_ );
        _mm256_storeu_ps( z + i + 8 , x2_ );
    }

    for ( ; i &lt; len; ++i )
        z[i] = x[i] * a + y[i];
}</pre>
<p><br /><strong><br />References </strong></p>
<p>[1] Intel AVX - <a target="_blank" href="http://software.intel.com/en-us/avx/">http://software.intel.com/en-us/avx/</a></p>
<p>[2] Intel Software Development Emulator - <a target="_blank" href="http://software.intel.com/en-us/articles/intel-software-development-emulator/">http://software.intel.com/en-us/articles/intel-software-development-emulator/</a></p> ]]></description>
      <link>http://software.intel.com/en-us/articles/avx-emulation-header-file/</link>
      <pubDate>Wed, 23 Jun 2010 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/avx-emulation-header-file/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/avx-emulation-header-file/</guid>
      <category>Parallel Programming</category>
      <category>Open Source</category>
      <category>What If Experimental Software</category>
      <category>Tools</category>
      <category>Intel® AVX</category>
      <category>Software News</category>
      <category>Code &amp; Downloads</category>
    </item>
    <item>
      <title>Using Intel® Advanced Vector Extensions to Implement an Inverse Discrete Cosine Transform</title>
      <description><![CDATA[ <h1 class="sectionHeading">Download Article</h1>
Download <a href="http://software.intel.com/file/30921">Using Intel® Advanced Vector Extensions to Implement an Inverse Discrete Cosine Transform</a> [PDF 335KB]<br />Download <a href="http://software.intel.com/file/29048">IDCT source code</a> [Zip 22KB]<br /><br />
<h1 class="sectionHeading">Introduction</h1>
Transform coding is an important step of image and video processing applications. Pixels in an image have a level of correlation with their neighboring pixels. Adjacent pixels in successive frames show a very high correlation. These correlations can be used to predict the value of a pixel from its neighbor. The highly correlated spatial data is transformed into uncorrelated coefficients in the frequency domain. The transformed data, represented as coefficients, are independent, allowing them to be manipulated separately. The human eye is more perceptive to low-frequency changes than high-frequency changes. Encoders can achieve compression after setting the high-frequency coefficients to zero. An inverse transform is used in the decoder pipeline to reconstruct the source data.<br /><br /> Figure 1 and Figure 2 depict the components of a typical image encoder/decoder system.<br /><br />
<p ><img src="http://software.intel.com/file/28098" /></p>
<br />
<div ><b>Figure 1 - Source Encoder Block Diagram</b><br /></div>
<br />
<p ><img src="http://software.intel.com/file/28099" /></p>
<br />
<div ><b>Figure 2 - Source Decoder Block Diagram</b><br /></div>
<br /> The Discrete Cosine Transform (DCT) and Inverse Discrete Cosine Transform (IDCT) are widely used in the encoder and decoder pipelines of image processing systems such as MPEG, JPEG, and other standards. A fast and accurate IDCT transformation is crucial for the best user experience.<br /><br /> Intel® Advanced Vector Extensions (Intel® AVX), introduced with the new Intel® processor microarchitecture codenamed Sandy Bridge, extends the capabilities of Intel® Streaming SIMD Extensions (Intel® SSE) for floating point data and operations such as Inverse Discrete Cosine Transforms. Intel® AVX essentially doubles the width of the current XMM registers and adds new extensions that can operate on the wider data width. Intel® AVX significantly increases the floating-point performance density with improved power efficiency over previous 128-bit SIMD instruction set extensions. This document specifically examines how Intel® AVX and Sandy Bridge microarchitecture features such as wider 256-bit SIMD registers, non-destructive source operands, new data manipulation and arithmetic primitives, dual 128-bit load ports, and doubling of computational execution width can benefit the Inverse Discrete Cosine Transform (IDCT) operation. Intel® AVX improves the computational throughput of high performance precision-sensitive floating point transform applications. Integer implementations may also benefit from features such as non-destructive source operands. <br /><br /><br /> Intel also provides a set of Intel® AVX software development tools like the Intel® AVX-enabled Intel® C++ Compiler, the Intel® Software Development Emulator (SDE), and the Intel® Architecture Code Analyzer.  All of these tools were effectively used during the development of this kernel, and can be downloaded from the Intel® AVX website: <a href="http://software.intel.com/en-us/avx/">http://software.intel.com/en-us/avx/</a><br /><br />
<h1 class="sectionHeading">Testing Environment</h1>
The performance speedups stated in this paper are based on runs on actual Sandy Bridge microarchitecture-based pre-production silicon.  It can also be assumed that the test data is already in the first level processor cache prior to the computation of the IDCT algorithm.  Performance comparisons are made based on the relative performance of Intel® AVX versus corresponding Intel® SSE implementations using C intrinsic instructions, both run on the Sandy Bridge microarchitecture-based silicon. The code was compiled using the 64-bit Intel® C++ Compiler, version 11.1.038. The applications were compiled using the following command line options:<br /><br /> 
<ul>
<li>Intel® SSE: /QxSSE4.1 /O3</li>
<li>Intel® AVX: /QxAVX /O3</li>
</ul>
A comparison of short integer performance was performed by compiling the Intel<sup>®</sup> SSE implementation for different architectures. The compiler will generate VEX-encoded Intel<sup>®</sup> SSE instructions when the /QxAVX switch is specified. The compiler will generate Intel<sup>® </sup>SSE 4.1 instructions when the /QxSSE4.1 switch is specified. Non-destructive source operands are introduced with the VEX-encoded instructions. <br /><br /> The test application follows the 8x8 IDCT accuracy requirements as documented in section 3 of the IEEE standard 1180-1900 (Reference 5).  Randomly generated input data is processed by a reference DCT. The output of the DCT is rounded to the nearest integer and clipped such that -2048 &lt;= value &lt; 2047. The clipped values are inputs to both the reference IDCT and the IDCT under test. Their outputs are rounded to the nearest integer and clipped such that -256 &lt;= output &lt; 255. The two results are then compared and accuracy measurements are made.<br /><br /> In this application, the references are implemented using double precision floating point scalar C code. Several proposed IDCT implementations are tested; a short integer Intel® SSE version, a single precision floating point Intel® SSE version, and a single precision floating point Intel® AVX version. All the vector versions are implemented with C intrinsics.<br /><br />
<p ><img src="http://software.intel.com/file/28086" /></p>
<br />
<div ><b>Figure 3 - Test Application</b><br /></div>
<br />
<h1 class="sectionHeading">Discrete Cosine Transforms</h1>
The Discrete Cosine Transform (DCT) converts the spatial data of an image into the frequency domain. The mathematical operations are described in detail in Reference [1].<br /><br /> The two-dimensional (2D) DCT transforms 64 pixel values, an 8x8 block of pixels, producing 64 coefficients. The image's pixel values are transformed into coefficients derived from amplitudes of cosine basis functions.<br /><br />
<p ><img src="http://software.intel.com/file/28087" /></p>
<br /><br /> Where:<br /><br />
<p ><img src="http://software.intel.com/file/28088" /></p>
<br /> Reference work [2] summarizes the number of mathematical operations; the numbers are worth repeating.  Comparing the DCT and IDCT equations, it can be shown that the transforms require the same number of operations.  An analysis of Equation 2 shows that the 2D DCT requires 64 multiplications and 63 additions for each coefficient. Therefore 4096 multiplications and 4023 additions are required to transform an 8x8 block. Replacing the 2D DCT with 8 1D DCTs for the 8 rows and 8 1D DCTs for the 8 columns can reduce the number of operations. Equation 1 shows that the 1D DCT requires 64 multiplications and 56 additions producing 8 coefficients. Transforming the 8x8 block with 1D DCTs requires 1024 multiplications and 896 additions. References [1] and [4] discuss other DCT and IDCT algorithms.<br /><br /><br />
<h1 class="sectionHeading">Previous Work and Specifications</h1>
The following previous works and specifications should be consulted for background information:<br /> 
<ul>
<li>Reference [1] for the JPEG specification.</li>
<li>The IDCT algorithm used in the Intel® AVX implementation discussed in this whitepaper follows the optimized algorithm discussed in Reference [2].</li>
<li>Reference [3] provides additional background information.</li>
</ul>
<br />
<h1 class="sectionHeading">Intel® AVX Implementation of IDCT</h1>
As in the previous works, the algorithm in this whitepaper performs 8 1D IDCT transforms on the rows of an 8x8 block. Then 8 1D IDCT is performed on the columns of those results. Two different 1D IDCT transforms are used, and neither transform requires a transpose.<br /><br /> The AVX implementation operates on two rows of data simultaneously. It begins by loading four floats from each row into separate 128-bit registers. The reason for performing 128-bit loads will become clearer in a moment. It is important to remember that the 128-bit XMM registers overlay the lower 128-bits of the corresponding 256-bit YMM register. The upper 128-bits of the YMM register is loaded with the next set of four floats from those rows via the _mm256_insertf128_ps instruction.  The resultant YMM register contains the first set of four floats from one row in the lower 128-bits, and the second set of four floats from the second row in the upper 128-bits. This seems like a lot of work to load eight floats, so why take this approach? This is known as a strided load (because the next load will read from an area of memory that is more than a unit stride away from the current memory location).<br /><br /> The strided load is effective in certain applications.<br /><br /> 
<ul>
<li>Eight products need to be summed in the IDCT and it is more straightforward to sum the products if they are in the same 128-bit lane of two YMM registers. Two registers are built with the contents from one row in the bottom 128-bits of both registers, and the contents from the second row in the upper 128-bits of both register. This is depicted in Figure 4 through Figure 6.</li>
<li>Another benefit of the strided load approach is that data movement, or placement of the 128-bit memory operand into the upper 128 bits, can take place on a different execution port than the shuffle port. This reduces the utilization of the port that performs shuffles, and allocates the work more evenly across the execution ports.</li>
<li>One last benefit of placing floats 4-7 of two rows into one register allows for a reduction in shuffles that takes place near the end of row processing.</li>
</ul>
<p ><img src="http://software.intel.com/file/28089" /></p>
<br />
<div ><b>Figure 4 - Loading Four Floats from Row 0</b><br /></div>
<br /> The combination of the <span ><i>_mm_load_ps</i></span> and <span ><i>_mm256_castps128_ps256</i></span> instructions loads 128-bits into the least significant 128-bits of a ymm register.<br /><br /> Figure 5 depicts the strided load of Row 4 data, and the insertion into the most significant 128-bits of ymm0.<br /><br />
<p ><img src="http://software.intel.com/file/28090" /></p>
<br />
<div ><b>Figure 5 - Strided Load of Second Row</b><br /></div>
<br /> A second YMM register is populated in a similar manner; it contains the second set of four floats (floats 4-7) from both rows.  The results are shown in Figure 6.<br /><br />
<p ><img src="http://software.intel.com/file/28091" /></p>
<br />
<div ><b>Figure 6 - Second Set of Floats from Two Rows</b><br /></div>
<br /> The input data must be multiplied by different cosine terms.  A single float from each row is broadcasted via the <span ><i>_mm256_shuffle_ps</i></span> to prepare for the multiplication, as shown in Figure 7. In this case float 0 is duplicated.<br /><br />
<p ><img src="http://software.intel.com/file/28092" /></p>
<br />
<div ><b>Figure 7 - Preparing for Cosine Multiplication</b><br /></div>
<br /> Two <span ><i>_mm256_mul_ps</i></span> instructions perform the necessary cosine multiplications.  This completes one of the seven sums of product terms.<br /><br /> This sequence of instructions is repeated three times, operating on a different pair of input values and cosine terms each time. At this point there are four registers containing sums of products. One register contains sums of products for floats 0 and 2, another contains sums of products for floats 1 and 3, 4 and 6, and finally 5 and 7.<br /><br /> Three more sums are necessary to complete the sum of products. The sums of products of all even terms are added, and the sums of products of all odd terms are added. That is only two of the three sums.<br /><br />
<pre name="code" class="cpp">      ymm_even = _mm256_add_ps(r_ymm02, r_ymm46);<br />ymm_odd = _mm256_add_ps(r_ymm13, r_ymm57);<br /></pre>
The seventh and final sum produces the least significant four floats of the output. A subtraction is necessary to produce the most significant four floats; however, they are not in the correct order. <br /><br />
<pre name="code" class="cpp">     ymm_diff = _mm256_sub_ps(ymm_even, ymm_odd);<br />ymm_sum = _mm256_add_ps(ymm_even, ymm_odd); <br /></pre>
The order of the most significant floats of two rows can be properly adjusted using a single <span ><i>_mm256_shuffle_ps</i></span> instruction, as shown in Figure 8. The benefit of placing the most significant four floats from each input row is more apparent.<br /><br />
<p ><img src="http://software.intel.com/file/28093" /></p>
<br />
<div ><b>Figure 8 - Reversing the Order of Differences</b><br /></div>
<br /> The final step for the 1D IDCT for these two rows is to recombine the results for each row into the same 256-bit register using two <span ><i>_mm256_ permute2f128</i></span> instructions. <br /><br />
<p ><img src="http://software.intel.com/file/28094" /></p>
<br />
<div ><b>Figure 9 - Recombining Row 0</b><br /></div>
<br />
<p ><img src="http://software.intel.com/file/28095" /></p>
<br />
<div ><b>Figure 10 - Recombining Row 4</b><br /></div>
<br /> That completes the 1D IDCT for two sets of rows. The sequence is repeated for rows 1 and 7, 2 and 6, and finally 3 and 5 to complete the 1D IDCT for eight rows.<br /><br /> Now that the 1D IDCT of the eight rows is completed, the 1D IDCT of the eight columns can begin. The algorithm is similar, yet simpler because the sums of products are already in the proper position for the addition. The 1D IDCT of the columns does not require any shuffles.<br /><br /> It was possible to eliminate four add instructions in the floating point IDCT of the columns.<br /><br />
<pre name="code" class="cpp">      //row5*tangent + row5<br />r_ymm0 = _mm256_mul_ps(row5, tangent);<br />r_ymm0 = _mm256_add_ps(row5, r_ymm0);<br /></pre>
A single multiply of the row times the tangent term plus 1 produces identical results in the floating point implementation.<br /><br />
<pre name="code" class="cpp">Tangent_p1 = tan(x) + 1<br /></pre>
<pre name="code" class="cpp">//row5*(tangent+1)<br />r_ymm0 = _mm256_mul_ps(row5, tangent_p1);<br />//row3*(tangent+1)<br />r_ymm1 = _mm256_mul_ps(row3, tangent_p1);<br /></pre>
This optimization could not be applied to the short integer implementation due to the scaling of numbers in the short algorithm.<br /><br /><br />
<h1 class="sectionHeading">Using Intel® Architecture Code Analyzer</h1>
The Intel® Architecture Code Analyzer is an excellent tool that identifies the critical path in a basic block, execution port utilization, and instruction usage alternatives. There is much more Intel® Architecture Code Analyzer information available at the <a href="http://software.intel.com/en-us/avx/">Intel® AVX developer site</a>.<br /><br /> The basic block analysis in Figure 11 shows that the vshufps (invoked via the <span ><i>_mm256_shuffle_ps</i></span> intrinsic) executes on port 5. The analysis also shows that the vinsertf128 instruction (<span ><i>_mm256_insertf128_ps</i></span> intrinsic) offers the programmer choices for port execution. The "X" in the Intel® Architecture Code Analyzer report indicates that an instruction can be executed on the port, but in this particular algorithm it was executed elsewhere. In this case, the vinsertf128 was executed on port 0 and port 3 rather than port 3 and port 5 because the second source operand was loaded from memory. The insert takes place on port 0, thus reducing the pressure on port 5.<br /><br />
<p ><img src="http://software.intel.com/file/28096" /></p>
<br />
<div ><b>Figure 11 - Basic Block Analysis</b><br /></div>
<br /> The Intel® Architecture Code Analyzer is an effective optimization tool for software developers who want to fine tune and improve their application's performance.<br /><br /><br />
<h1 class="sectionHeading">Results</h1>
The 128-bit code was compiled for the Intel® microarchitecture codenamed Nehalem to generate Intel® SSE code, and executed on the Sandy Bridge microarchitecture-based silicon.  The corresponding 256-bit Intel® AVX-enabled code was compiled for the Sandy Bridge microarchitecture and executed on the Sandy Bridge microarchitecture-based silicon. Data was aligned on 16-Byte boundaries for the Intel® SSE code and 32-Byte boundaries for the Intel® AVX code. Both applications were compiled using the 64-bit version of the Intel® C++ Compiler Professional Edition, version 11.1.038.  The speedups listed here are for 256-bit code relative to the128-bit code.<br /><br /> <b>Table 1 - Performance Results</b><br /><br /> 
<table class="tableFormat1" border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody >
<tr >
<td  width="33%"><b>Algorithm</b></td>
<td  width="33%"><b>Speedup</b></td>
<td  width="33%"><b>Parameters</b></td>
</tr>
<tr >
<td >Intel® AVX floating point implementation<br /></td>
<td >0.94x</td>
<td >Compared to Intel® SSE short integer implementation</td>
</tr>
<tr >
<td >Intel® SSE short integer implementation compiled with /QxAVX<br /></td>
<td >1.07x</td>
<td >Compared to Intel® SSE short integer implementation compiled with /QxSSE4.1<br /></td>
</tr>
<tr >
<td >Intel® AVX floating point implementation<br /></td>
<td >1.78x</td>
<td >Compared to Intel® SSE floating point implementation</td>
</tr>
</tbody>
</table>
<br /> Table 2 summarizes the overall mean error results for each implementation.<br /><br /> <b>Table 2 - Overall Mean Error Results</b><br /><br /> 
<table class="tableFormat1" border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody >
<tr >
<td  width="25%">L, H (Reference 5)</td>
<td  width="25%">Intel® SSE Short</td>
<td  width="25%">Intel® SSE Single Precision Floating Point</td>
<td  width="25%">Intel® AVX Single Precision Floating Point</td>
</tr>
<tr >
<td >256, 255</td>
<td >3.44e-5</td>
<td >6.25e-6</td>
<td >6.25e-6</td>
</tr>
<tr >
<td >5, 5</td>
<td >2.58e-4</td>
<td >1.56e-6</td>
<td >1.56e-6</td>
</tr>
<tr >
<td >300, 300</td>
<td >4.69e-6</td>
<td >6.25e-6</td>
<td >6.25e-6</td>
</tr>
<tr >
<td >-255, 256</td>
<td >7.53e-4</td>
<td >3.13e-6</td>
<td >3.13e-6</td>
</tr>
<tr >
<td >-5, 5</td>
<td >0</td>
<td >0</td>
<td >0</td>
</tr>
<tr >
<td >-300, 300</td>
<td >0</td>
<td >0</td>
<td >0</td>
</tr>
<tr >
<td >Input = 0</td>
<td >0</td>
<td >0</td>
<td >0</td>
</tr>
</tbody>
</table>
<br /> Both the short integer and single precision floating point implementations meet the error requirements of IEEE 1180-1900 (Reference 5). The floating point implementations produce lower overall mean error in many of the tests; 5x, 165x, and 240x lower than the short integer implementation.<br /><br /><br />
<h1 class="sectionHeading">Conclusion</h1>
The results for the IDCT of 10,000 8x8 blocks shows that the Intel® AVX version outperformed the Intel® SSE single precision floating point implementation by 1.78x. The accuracy is excellent when compared to the reference IDCT.<br /><br /> The results also show the Intel<sup>®</sup> SSE short integer version compiled with the /QxAVX option outperformed the exact same code compiled with the /QxSSE4.1 option by 1.07x. The assembly language produced when compiling with /QxSSE4.1 has 22 register-to-register moves. The code produced with the /QxAVX switch did not have any register-to-register moves. The Intel<sup>®</sup> AVX non-destructive source instructions reduce the need for register copies in this application. There can be benefits to using Intel AVX for integer-based algorithms today.<br /><br /> Although the Intel® AVX implementation is slightly slower than the Intel® SSE short version, the Intel® AVX single precision floating point version is more accurate. The Intel® SSE short version required several adjustments to improve the accuracy and minimize rounding errors.  Those adjustments were not necessary in either floating point implementation, resulting in a cleaner and more accurate implementation.<br /><br /><br />
<h1 class="sectionHeading">Source Code for IDCT</h1>
The source code for the IDCT can be downloaded <a href="http://software.intel.com/file/29048">here</a><br /><br /> Partial source code listing is provided in the following section.<br /><br />
<pre name="code" class="cpp">void idctAVX(void)	{<br />__m128 r_xmm0, r_xmm2, r_xmm1;<br />__m256 r_ymm02, r_ymm46, r_ymm13, r_ymm57;<br />__m256 r_ymm0, r_ymm1, r_ymm2, r_ymm3, r_ymm4, r_ymm5, r_ymm6, r_ymm7;<br />__m256 row0, row1, row2, row3, row4, row5, row6, row7;<br />__m256 ymm_even, ymm_odd, ymm_sum, ymm_diff;<br />__m256 temp3, temp7;<br />__m256 tangent_1, tangent_2, tangent_3, cos_4;<br /><br />tangent_1 = AVX_tg_1_16;<br />tangent_2 = AVX_tg_2_16;<br />tangent_3 = AVX_tg_3p1_16;<br />cos_4 = AVX_cos_4p1_16;<br /><br />const float * pInput;<br />float * pOutput;<br />float * pFTab_i_04 = float_tab_i_04;<br />float * pFTab_i_26 = float_tab_i_26;<br />float * pFTab_i_17 = float_tab_i_17;<br />float * pFTab_i_35 = float_tab_i_35;<br />const int blockSize = 8*8;<br /><br />//Transform all the blocks N times<br />//The iteration count is a command line option<br />const int maxLoopCount = g_loopCount;<br />long startTime = getTimestamp();<br /><br />for(int loopCount = 0; loopCount &lt; maxLoopCount; loopCount++)	{<br />	<br />   //Operate on all the blocks<br />   for(int i = 0; i &lt; g_blockCount; i++)	{<br /><br />      //Get pointers for this input and output<br />      pInput = &amp;dctData[i*blockSize];				<br />      pOutput = &amp;kernelResults[i*blockSize];<br />      pFTab_i_04 = float_tab_i_04;<br />      pFTab_i_26 = float_tab_i_26;<br />      pFTab_i_17 = float_tab_i_17;<br />      pFTab_i_35 = float_tab_i_35;<br />//    IACA_START<br /><br />      //Rows 0 and 4<br />      //Process the first four floats of these two rows<br />      //Read input data from row 0, read eight floats via two 128-bit loads<br />      r_ymm0 = _mm256_castps128_ps256(_mm_load_ps(pInput));<br />      r_ymm1 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[4]));<br /> <br />      //Insert data from row 4 into the upper lane<br />      r_ymm0 = _mm256_insertf128_ps(r_ymm0, _mm_load_ps(&amp;pInput[4*8]), 1);<br />      r_ymm4 = _mm256_insertf128_ps(r_ymm1, _mm_load_ps(&amp;pInput[4*8+4]), 1);<br /><br />      //Broadcast float 0 <br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x00);<br />   <br />      //Multiply by the coefficients<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) pFTab_i_04));<br /><br />      //Broadcast float 2 and multiply by coefficients<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[8]));<br /><br />      //Add<br />      r_ymm02 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Broadcast float 1 and multiply by coefficients<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[16]));<br /><br />      //Broadcast float 3 and multiply by coefficients<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[24]));<br />   <br />      //Add<br />      r_ymm13 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process the second four floats of these two rows<br />      //Broadcast float 4 <br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x00);<br /><br />      //Multiply by the coefficients<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[32]));<br /><br />      //Broadcast float 6 and multiply by coefficients<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[40]));<br /><br />      //Add<br />      r_ymm46 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Broadcast float 5 and multiply by coefficients<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[48]));<br /><br />      //Broadcast float 7 and multiply by coefficients<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_04[56]));<br /><br />      //Add to create this partial sum, <br />      //then create final sums and differences.<br />      r_ymm57 = _mm256_add_ps(r_ymm2, r_ymm3);<br />      ymm_even = _mm256_add_ps(r_ymm02, r_ymm46);<br />      ymm_odd = _mm256_add_ps(r_ymm13, r_ymm57);<br />      ymm_diff = _mm256_sub_ps(ymm_even, ymm_odd);<br />      ymm_sum = _mm256_add_ps(ymm_even, ymm_odd);<br /><br />      //Reverse the order of the differences, then build outputs 0 and 4<br />      ymm_diff = _mm256_shuffle_ps(ymm_diff, ymm_diff, 0x1b);<br />      row0 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x20);<br />      row4 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x31);<br />      //End of Rows 0 and 4 processing<br /><br />      //Rows 1 and 7<br />      //Process the first four floats of these two rows<br />      //Read input data from row 1, read eight floats via two 128-bit loads<br />      r_ymm0 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[8]));<br />      r_ymm1 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[8+4]));<br /><br />      //Insert data from row 7 into the upper lane<br />      r_ymm0 = _mm256_insertf128_ps(r_ymm0, _mm_load_ps(&amp;pInput[7*8]), 1);<br />      r_ymm4 = _mm256_insertf128_ps(r_ymm1, _mm_load_ps(&amp;pInput[7*8+4]), 1);<br />      <br />      //Process floats 0 and 2 <br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x00);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) pFTab_i_17));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[8]));<br />      r_ymm02 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />       //Process floats 1 and 3<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[16]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[24]));<br />      r_ymm13 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process the second four floats of these two rows<br />      //Process floats 4 and 6<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x00);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[32]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[40]));<br />      r_ymm46 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process floats 5 and 7<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[48]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_17[56]));<br />      r_ymm57 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Create final sums and differences<br />      ymm_even = _mm256_add_ps(r_ymm02, r_ymm46);<br />      ymm_odd = _mm256_add_ps(r_ymm13, r_ymm57);<br />      ymm_diff = _mm256_sub_ps(ymm_even, ymm_odd);<br />      ymm_sum = _mm256_add_ps(ymm_even, ymm_odd);<br />    <br />      //Reverse the order of the differences, then build outputs 1 and 7<br />      ymm_diff = _mm256_shuffle_ps(ymm_diff, ymm_diff, 0x1b);<br />      row1 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x20);<br />      row7 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x31);<br />      //End of Rows 1 and 7 processing<br /><br />      //Rows 2 and 6<br />      //Process the first four floats of these two rows<br />      //Read input data from row 2, read eight floats via two 128-bit loads<br />      r_ymm0 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[2*8]));<br />      r_ymm1 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[2*8+4]));<br /><br />      //Insert data from row 6 into the upper lane<br />      r_ymm0 = _mm256_insertf128_ps(r_ymm0, _mm_load_ps(&amp;pInput[6*8]), 1);<br />      r_ymm4 = _mm256_insertf128_ps(r_ymm1, _mm_load_ps(&amp;pInput[6*8+4]), 1);   <br />   <br />      //Process floats 0 and 2 <br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x00);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) pFTab_i_26));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[8]));<br />      r_ymm02 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process floats 1 and 3<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[16]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[24]));<br />      r_ymm13 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process the second four floats of these two rows<br />      //Process floats 4 and 6<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x00);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[32]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[40]));<br />      r_ymm46 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process floats 5 and 7<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[48]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_26[56]));<br />      r_ymm57 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Create the final sum and difference<br />      ymm_even = _mm256_add_ps(r_ymm02, r_ymm46);<br />      ymm_odd = _mm256_add_ps(r_ymm13, r_ymm57);<br />      ymm_diff = _mm256_sub_ps(ymm_even, ymm_odd);<br />      ymm_sum = _mm256_add_ps(ymm_even, ymm_odd);<br />   <br />      //Reverse the order of the differences, then build outputs 2 and 6<br />      ymm_diff = _mm256_shuffle_ps(ymm_diff, ymm_diff, 0x1b);<br />      row2 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x20);<br />      row6 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x31);<br />      //End of Rows 2 and 6 processing<br /><br />      //Rows 3 and 5<br />      //Process the first four floats of these two rows<br />      //Read input data from row 3, read eight floats via two 128-bit loads<br />      r_ymm0 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[3*8]));<br />      r_ymm1 = _mm256_castps128_ps256(_mm_load_ps(&amp;pInput[3*8+4]));<br /><br />      //Insert data from row 5 into the upper lane<br />      r_ymm0 = _mm256_insertf128_ps(r_ymm0, _mm_load_ps(&amp;pInput[5*8]), 1);<br />      r_ymm4 = _mm256_insertf128_ps(r_ymm1, _mm_load_ps(&amp;pInput[5*8+4]), 1);<br />    <br />      //Process floats 0 and 2 <br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x00);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) pFTab_i_35));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[8]));<br />      r_ymm02 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process floats 1 and 3<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[16]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm0, r_ymm0, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[24]));<br />      r_ymm13 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process the second four floats of these two rows<br />      //Process floats 4 and 6<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x00);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[32]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xaa);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[40]));<br />      r_ymm46 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Process floats 5 and 7<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0x55);<br />      r_ymm2 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[48]));<br />      r_ymm1 = _mm256_shuffle_ps(r_ymm4, r_ymm4, 0xff);<br />      r_ymm3 = _mm256_mul_ps(r_ymm1, *((__m256 *) &amp;pFTab_i_35[56]));<br />      r_ymm57 = _mm256_add_ps(r_ymm2, r_ymm3);<br /><br />      //Create the final sum and difference<br />      ymm_even = _mm256_add_ps(r_ymm02, r_ymm46);<br />      ymm_odd = _mm256_add_ps(r_ymm13, r_ymm57);<br />      ymm_diff = _mm256_sub_ps(ymm_even, ymm_odd);<br />      ymm_sum = _mm256_add_ps(ymm_even, ymm_odd);<br /><br />      //Reverse the order of the differences, then build outputs 3 and 5<br />      ymm_diff = _mm256_shuffle_ps(ymm_diff, ymm_diff, 0x1b);<br />      row3 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x20);<br />      row5 = _mm256_permute2f128_ps(ymm_sum, ymm_diff, 0x31);<br />      //End of Rows 3 and 5 processing<br /><br />      //******************************<br />      //perform 1D IDCT on the columns<br />   <br />      //Multiply several rows by the appropriate tangent value<br />      //row5*(tangent3+1)<br />      r_ymm0 = _mm256_mul_ps(row5, tangent_3);<br />      //row3*(tangent3+1)<br />      r_ymm1 = _mm256_mul_ps(row3, tangent_3);	<br />      //row7*tangent1<br />      r_ymm4 = _mm256_mul_ps(row7, tangent_1);	<br />      //row1*tangent1<br />      r_ymm5 = _mm256_mul_ps(row1, tangent_1);<br /><br />      //Begin to create results	<br />      //[row5*tangent3 + row5] + row3<br />      r_ymm0 = _mm256_add_ps(r_ymm0, row3);<br />      //row5 - [row1*tangent3 + row1]		<br />      r_ymm2 = _mm256_sub_ps(row5, r_ymm1);<br />      //row6*tangent2<br />      r_ymm7 = _mm256_mul_ps(row6, tangent_2);	<br />      //row2*tangent2<br />      r_ymm3 = _mm256_mul_ps(row2, tangent_2);<br />      //row1*tangent1 - row7	<br />      r_ymm5 = _mm256_sub_ps(r_ymm5, row7);<br />      //row7*tangent1 + row1		<br />      r_ymm4 = _mm256_add_ps(r_ymm4, row1);		<br /><br />      //Save intermediate row 7 results, used as an input later<br />      //[row7*tangent1 + row1] + [row5*tangent3 + row3]<br />      temp7 = _mm256_add_ps(r_ymm4, r_ymm0);		<br /><br />      //Save intermediate row 3 results, used as an input later<br />      //[row1*tangent1 - row7] + [row5 - [row1*tangent3 + row1]]<br />      temp3 = _mm256_add_ps(r_ymm5, r_ymm2);		<br /><br />      //[row7*tangent1 + row1] - [row5*tangent3 + row3]			<br />      r_ymm4 = _mm256_sub_ps(r_ymm4, r_ymm0);<br />      //[row1*tangent1 - row7] - [row5 - [row1*tangent3 + row1]]<br />      r_ymm5 = _mm256_sub_ps(r_ymm5, r_ymm2);	<br />      //{[row7*tangent1 + row1] - [row5*tangent3 + row3]} - <br />      //{[row1*tangent1 - row7] - [row5 - [row1*tangent3 + row1]]}	<br />      r_ymm1 = _mm256_sub_ps(r_ymm4, r_ymm5);	<br />      //{[row7*tangent1 + row1] - [row5*tangent3 + row3]} + <br />      //{[row1*tangent1 - row7] - [row5 - [row1*tangent3 + row1]]}	<br />      r_ymm4 = _mm256_add_ps(r_ymm4, r_ymm5);<br />      //multiply by cos_4+1		<br />      r_ymm4 = _mm256_mul_ps(r_ymm4, cos_4);<br />      //row6*tangent2 + row2<br />      r_ymm7 = _mm256_add_ps(r_ymm7, row2);<br />      //row2*tangent2 - row6		<br />      r_ymm3 = _mm256_sub_ps(r_ymm3, row6);<br />      //multiply by cos_4+1		<br />      r_ymm0 = _mm256_mul_ps(r_ymm1, cos_4);<br />      //row0 + row4<br />      r_ymm5 = _mm256_add_ps(row0, row4);	<br />      //row0 - row4<br />      r_ymm6 = _mm256_sub_ps(row0, row4);	<br />      //[row0 + row4] - [row6*tangent2 + row2]		<br />      r_ymm2 = _mm256_sub_ps(r_ymm5, r_ymm7);<br />      //[row0 + row4] + [row6*tangent2 + row2]<br />      r_ymm5 = _mm256_add_ps(r_ymm5, r_ymm7);		<br />      //[row0 - row4] - [row2*tangent2 - row6]<br />      r_ymm1 = _mm256_sub_ps(r_ymm6, r_ymm3);<br />      //[row0 - row4] + [row2*tangent2 - row6]	<br />      r_ymm6 = _mm256_add_ps(r_ymm6, r_ymm3);		<br />      //[[row7*tangent1 + row1] + [row5*tangent3 + row3]] + <br />      //[[row0 + row4] + [row6*tangent2 + row2]]<br />      r_ymm7 = _mm256_add_ps(temp7, r_ymm5);		<br />			<br />      //Store row 0 results (store 1 of 8)<br />      _mm256_store_ps(pOutput, r_ymm7);<br />      //[[row0 - row4] + [row2*tangent2 - row6]] - <br />      //cos4*{[row7*tangent1 + row1] - [row5*tangent3 + row3]} +<br />      //{[row1*tangent1 - row7] - [row5 - [row1*tangent3 + row1]]}<br />      r_ymm3 = _mm256_sub_ps(r_ymm6, r_ymm4);<br />      //[[row0 - row4] + [row2*tangent2 - row6]] + <br />      //cos4*{[row7*tangent1 + row1] - [row5*tangent3 + row3]} + <br />      //{[row1*tangent1 - row7] - [row5 - [row1*tangent3 + row1]]}	<br />      r_ymm6 = _mm256_add_ps(r_ymm6, r_ymm4);		<br /><br />      //Store row 1 results (store 2 of 8)<br />      _mm256_store_ps(&amp;pOutput[1*8], r_ymm6); <br /><br />      r_ymm7 = _mm256_sub_ps(r_ymm1, r_ymm0);<br />      r_ymm1 = _mm256_add_ps(r_ymm1, r_ymm0);<br /><br />      r_ymm6 = _mm256_add_ps(r_ymm2, temp3);<br />      r_ymm2 = _mm256_sub_ps(r_ymm2, temp3);	 <br /><br />      r_ymm5 = _mm256_sub_ps(r_ymm5, temp7); <br /><br />      //Store final results<br />      _mm256_store_ps(&amp;pOutput[2*8], r_ymm1); <br />      _mm256_store_ps(&amp;pOutput[3*8], r_ymm6); <br />      _mm256_store_ps(&amp;pOutput[4*8], r_ymm2);<br />      _mm256_store_ps(&amp;pOutput[5*8], r_ymm7);<br />      _mm256_store_ps(&amp;pOutput[6*8], r_ymm3);<br />      _mm256_store_ps(&amp;pOutput[7*8], r_ymm5); <br /><br />//      IACA_END<br />      }	<br />   }<br />   long duration = getTimestamp() - startTime;<br />   cout &lt;&lt; "AVX Timestamp = " &lt;&lt; duration &lt;&lt; endl;<br />}<br /><br /><br /></pre>
<h1 class="sectionHeading">References</h1>
The following documents are referenced in this application note, and provide background or supporting information for understanding the topics presented in this document.<br /><br /> 1.	 Pennebaker and Mitchell, JPEG: Still Image Data Compression Standard, Van Nostrand Reinhold, New York, 1993, pp. 29-64.<br /> 2.	 A Fast Precise Implementation of 8x8 Discrete Cosine Transform Using the Streaming SIMD Extensions and MMXTM Instructions, Intel Application Note, AP-922, Copyright 1999<br /> 3.	 Using Streaming SIMD Extensions 2 (SSE2) to Implement and Inverse Discrete Cosine Transform, Intel Application Note, AP-945, Copyright 2000<br /> 4.	 Rao and Yip, Discrete Cosine Transform Algorithms, Advantages, Applications, Academic Press, Inc., Boston, 1990, Appendix A.2<br /> 5.	 IEEE Standard Specifications for the Implementations of 8x8 Inverse Discrete Cosine Transform, IEEE Std 1180-1990.<br /><br /><br />
<h1 class="sectionHeading">About the Author</h1>
Richard Hubbard is a Senior Software Engineer and member of the SSG Apple enabling team, working on optimizing Mac OS X* applications for power and performance.  Richard holds a Masters degree in Electrical Engineering from Stevens Institute of Technology and a Bachelors in Computer Engineering from New Jersey Institute of Technology. ]]></description>
      <link>http://software.intel.com/en-us/articles/using-intel-advanced-vector-extensions-to-implement-an-inverse-discrete-cosine-transform/</link>
      <pubDate>Mon, 07 Jun 2010 21:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/using-intel-advanced-vector-extensions-to-implement-an-inverse-discrete-cosine-transform/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/using-intel-advanced-vector-extensions-to-implement-an-inverse-discrete-cosine-transform/</guid>
      <category>Intel® AVX</category>
    </item>
  </channel></rss>
