<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Fri, 25 May 2012 11:17:58 -0700 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/articles/making-your-cache-go-further-in-these-troubled-times/feed/" rel="self" type="application/rss+xml" />
    <title>Intel Software Network Comments Feed</title>
    <link>http://software.intel.com/en-us/articles/making-your-cache-go-further-in-these-troubled-times</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>By thiamchunkoh</title>
      <description><![CDATA[ Matrix Multiplication Algorithms that more effectively used the cache than an algorithm that might be more intuitive to a mathematician or physicist. No doubt, the intuitive triply-nested loop is the preferred solution of many software engineers. algorithm accessed all of the same memory addresses the same number of times as the intuitive algorithm, his function caused fewer cache misses. The original function caused the computer to spend more time loading and storing cache lines than executing the program. computer's cache is divided into lines. When your CPU accesses a certain memory address, if it isn't in the cache, it will fetch a line from the next level out rather than a single word. This is a slow process, but if subsequent accesses to memory are nearby, there is a high probability that what the CPU needs is already in the cache. However, certain structures are unlikely to fit in the cache all at once, or at least they may be spread across many lines. The former function is used to multiply a matrix by a vector, and the latter apparently was copied and pasted and the indices were reversed to perform the multiplication of a matrix transpose by a vector. Sequential accesses in the first function point to adjacent values in memory. Thus, accesses are quick. And the only time it misses the cache (on matrix accesses) is when it is done with the old line and will never load it again.

In the second function, rather than iterating through sequential addresses, every access is on a different line. In short, every access misses the cache. Furthermore, by the time the outer loop completes the first iteration, the second iteration's memory accesses are looking for lines that have long since been flushed. In other words, where the first function pulls each line into the cache once, the second pulls each line into the cache for every iteration of the outer loop.

With this in mind, a simple modification to MikeNet achieved ~3.4x improvement to the overall program on a roughly 500x500 matrix without using Cilk++ at all! 
identifying parallelism is a key part of program analysis, but it isn't the only thing you should consider when thinking about performance. Especially for some of these large data structures (e.g., matrices), taking some time to look critically at how they interact with your computer's cache is just as significant. ]]></description>
      <link>http://software.intel.com/en-us/articles/making-your-cache-go-further-in-these-troubled-times/#comment-37656</link>
      <pubDate>Sun, 03 Jan 2010 16:48:44 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/making-your-cache-go-further-in-these-troubled-times/#comment-37656</guid>
    </item>
    <item>
      <title>By 
    Twitter Trackbacks for
     
    Making Your Cache Go Further in These Troubled Times - Intel® Software Network 
    [intel.com]
    on Topsy.com
  </title>
      <description><![CDATA[ n/a ]]></description>
      <link>http://software.intel.com/en-us/articles/making-your-cache-go-further-in-these-troubled-times/#comment-53135</link>
      <pubDate>Wed, 08 Dec 2010 08:03:32 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/making-your-cache-go-further-in-these-troubled-times/#comment-53135</guid>
    </item>
  </channel></rss>
