<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Fri, 25 May 2012 07:09:51 -0700 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/articles/optimize-for-intel-avx-using-intel-math-kernel-librarys-basic-linear-algebra-subprograms-blas-with-dgemm-routine/feed/" rel="self" type="application/rss+xml" />
    <title>Intel Software Network Comments Feed</title>
    <link>http://software.intel.com/en-us/articles/optimize-for-intel-avx-using-intel-math-kernel-librarys-basic-linear-algebra-subprograms-blas-with-dgemm-routine</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>By Rongqiu Yang</title>
      <description><![CDATA[ Good, I am also doing BLAS optimization ,not on Intel processor but on Loongson processor. ]]></description>
      <link>http://software.intel.com/en-us/articles/optimize-for-intel-avx-using-intel-math-kernel-librarys-basic-linear-algebra-subprograms-blas-with-dgemm-routine/#comment-23675</link>
      <pubDate>Tue, 05 May 2009 01:25:46 -0700</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/optimize-for-intel-avx-using-intel-math-kernel-librarys-basic-linear-algebra-subprograms-blas-with-dgemm-routine/#comment-23675</guid>
    </item>
    <item>
      <title>By Sarnath</title>
      <description><![CDATA[ In the code snippet below, I am assuming "vmm0" stands for "ymm0" (typo error, I guess).
But the article calims that the "mul" and "add" will finish off in 1 cycle (2 separate pipelines).
However there is a RAW dependency out there.. Does it finish off in 1 cycle inspite of the RAW dependency?? Can some1 shed some light? Thanks!

"vmovapd ymm0, [rax] 
vbroadcastsd ymm1, [rbx] 
vmulpd ymm0, ymm0, ymm1 
vaddpd ymm2, ymm2, vmm0 
"

Best Regards,
Sarnath,HCL Tech ]]></description>
      <link>http://software.intel.com/en-us/articles/optimize-for-intel-avx-using-intel-math-kernel-librarys-basic-linear-algebra-subprograms-blas-with-dgemm-routine/#comment-33555</link>
      <pubDate>Thu, 29 Oct 2009 23:54:17 -0700</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/optimize-for-intel-avx-using-intel-math-kernel-librarys-basic-linear-algebra-subprograms-blas-with-dgemm-routine/#comment-33555</guid>
    </item>
    <item>
      <title>By heinz</title>
      <description><![CDATA[ Thank you for this interesting article. 
I linked it to our developer forums at Lunatics(http://lunatics.kwsn.net/index.php) as  Re: FFT's and other important formulas 
« Reply #10 on: 03 May 2009, 05:31:13 pm » 
in the closed area(still open for developers)

Regards


 ]]></description>
      <link>http://software.intel.com/en-us/articles/optimize-for-intel-avx-using-intel-math-kernel-librarys-basic-linear-algebra-subprograms-blas-with-dgemm-routine/#comment-33783</link>
      <pubDate>Sun, 01 Nov 2009 16:40:30 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/optimize-for-intel-avx-using-intel-math-kernel-librarys-basic-linear-algebra-subprograms-blas-with-dgemm-routine/#comment-33783</guid>
    </item>
    <item>
      <title>By Nick</title>
      <description><![CDATA[ I've found register level blocking to be very useful also.  Why do the intel processors provide such a small number of registers?  I'm wondering what is limiting us to only 16 registers and not a much larger number like 64 or 256 registers.  You said that the processors current as of this article can do 2 floating point operations per cycle (I'm assuming you're referring to the superscalar nature of these processors), and I assume that in the future we will be able to do 4 or more operations per cycle.  In this case, having a limited amount of Register "L0" cache is a huge limitation, due to the relative amount of loads we'd have to do.

Also, The trend has been to create larger registers for computation (AVX vs SSE).  Is it possible to design instructions that do computations that operates on a range of registers instead?  Something like "multiply all the registers from xmm0 to xmm7 with all the registers from xmm8 to xmm15 and store the results in xmm0 to xmm7".  This seems like it would speed up certain computations dramatically.  It would essentially be like a large SIMD instruction with the length specified by the user. ]]></description>
      <link>http://software.intel.com/en-us/articles/optimize-for-intel-avx-using-intel-math-kernel-librarys-basic-linear-algebra-subprograms-blas-with-dgemm-routine/#comment-71441</link>
      <pubDate>Mon, 12 Mar 2012 09:05:49 -0700</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/optimize-for-intel-avx-using-intel-math-kernel-librarys-basic-linear-algebra-subprograms-blas-with-dgemm-routine/#comment-71441</guid>
    </item>
  </channel></rss>
