Memory cache

Create Cache-Data Blocks


Challenge

Take advantage of data-cache locality with cache-data blocking. Loops with frequent iterations over large data arrays should be restructured such that the large array is subdivided into smaller blocks, or tiles. Each data element in the array is therefore reused within the data block, so that the block of data fits within the data cache, before operating on the next block or tile.

  • Memory cache
  • performance optimization
  • How-To
  • Computação paralela
  • Align and Organize Data for Better Performance


    Challenge

    Minimize performance losses due to unaligned data. Unaligned data can be a potentially serious performance problem. It is important to remember to focus on data elements in the most CPU-intensive parts of your program.


    Solution

    Align data on natural operand size address boundaries. If the data will be accessed with vector instruction loads and stores, align the data on 16-byte boundaries. For best performance, align data as follows:

  • Memory cache
  • performance optimization
  • How-To
  • Computação paralela
  • Optimize Prefetch on 32-Bit Intel® Architecture


    Challenge

    Optimize the use of prefetches in code for the Intel® Pentium® 4 processor and the Pentium® M processor. The performance of most applications can be considerably improved if the data they require can be fetched from the processor caches, rather than from main memory.

  • Memory cache
  • performance optimization
  • How-To
  • Strip Mining to Optimize Memory Use on 32-Bit Intel® Architecture


    Challenge

    Improve memory utilization by means of strip mining. Strip mining, also known as loop sectioning, is a loop-transformation technique for enabling SIMD-encodings of loops, as well as providing a means of improving memory performance. First introduced for vectorizers, this technique consists of the generation of code when each vector operation is done for a size less than or equal to the maximum vector length on a given vector machine. Consider the following pseudocode, as it exists before strip mining:

  • Memory cache
  • performance optimization
  • How-To
  • Solve Prefetch Performance Issues


    Challenge

    Avoid performance penalties associated with excessive software prefetching. Prefetch instructions are not completely free in terms of bus cycles, machine cycles, and other resources, even though they require minimal clocks and memory bandwidth. Excessive prefetching may lead to performance penalties because of issue penalties in the front-end of the machine and/or resource contention in the memory sub-system. This effect may be severe in cases where the target loops are small and/or cases where the target loop is issue-bound.

  • Memory cache
  • performance optimization
  • How-To
  • Computação paralela
  • Deswizzle Data from SoA Format to AoS


    Challenge

    Rearrange (deswizzle) data from SoA (Structure of Arrays) format to AoS (Array of Structures) format. In the deswizzle operation, we want to arrange the data so the xxxx, yyyy, zzzz are rearranged and stored in memory as xyz.


    Solution

    Use the unpcklps/unpckhps instructions to regenerate the xyxy layout and then store each half (xy) into its corresponding memory location using movlps/movhps followed by another movlps/movhps to store the z component. The following code illustrates the deswizzle function:

  • Memory cache
  • performance optimization
  • How-To
  • Computação paralela
  • Data-Access Pattern Alignment & Contiguity on 32-Bit Intel® Architecture


    Challenge

    Ensure alignment and contiguity of data-access patterns. The new 64-bit packed data types defined by MMX™ technology and the 128-bit packed data types for Streaming SIMD Extensions and Streaming SIMD Extensions 2 create more potential for misaligned data accesses. The data-access patterns of many algorithms are inherently misaligned when using MMX technology and Streaming SIMD Extensions. Assembly code with an unaligned access is significantly slower than an aligned access.

  • Memory cache
  • performance optimization
  • How-To
  • Choose between Hardware and Software Prefetch on 32-Bit Intel® Architecture


    Challenge

    Determine the effectiveness of software-controlled versus hardware-controlled data prefetch for memory optimization. The Pentium® 4 processor has two mechanisms for data prefetch: software-controlled prefetch and an automatic hardware prefetch.

    Software-controlled prefetch is enabled using the four prefetch instructions introduced with Streaming SIMD Extensions (SSE) instructions. These instructions are hints to bring a cache line of data in to various levels and modes in the cache hierarchy.

  • Memory cache
  • performance optimization
  • How-To
  • Páginas

    Assine o Memory cache