performance optimization

Solve Prefetch Performance Issues


Challenge

Avoid performance penalties associated with excessive software prefetching. Prefetch instructions are not completely free in terms of bus cycles, machine cycles, and other resources, even though they require minimal clocks and memory bandwidth. Excessive prefetching may lead to performance penalties because of issue penalties in the front-end of the machine and/or resource contention in the memory sub-system. This effect may be severe in cases where the target loops are small and/or cases where the target loop is issue-bound.

  • Memory cache
  • performance optimization
  • Computação paralela
  • Deswizzle Data from SoA Format to AoS


    Challenge

    Rearrange (deswizzle) data from SoA (Structure of Arrays) format to AoS (Array of Structures) format. In the deswizzle operation, we want to arrange the data so the xxxx, yyyy, zzzz are rearranged and stored in memory as xyz.


    Solution

    Use the unpcklps/unpckhps instructions to regenerate the xyxy layout and then store each half (xy) into its corresponding memory location using movlps/movhps followed by another movlps/movhps to store the z component. The following code illustrates the deswizzle function:

  • Memory cache
  • performance optimization
  • Computação paralela
  • Data-Access Pattern Alignment & Contiguity on 32-Bit Intel® Architecture


    Challenge

    Ensure alignment and contiguity of data-access patterns. The new 64-bit packed data types defined by MMX™ technology and the 128-bit packed data types for Streaming SIMD Extensions and Streaming SIMD Extensions 2 create more potential for misaligned data accesses. The data-access patterns of many algorithms are inherently misaligned when using MMX technology and Streaming SIMD Extensions. Assembly code with an unaligned access is significantly slower than an aligned access.

  • Memory cache
  • performance optimization
  • Choose between Hardware and Software Prefetch on 32-Bit Intel® Architecture


    Challenge

    Determine the effectiveness of software-controlled versus hardware-controlled data prefetch for memory optimization. The Pentium® 4 processor has two mechanisms for data prefetch: software-controlled prefetch and an automatic hardware prefetch.

    Software-controlled prefetch is enabled using the four prefetch instructions introduced with Streaming SIMD Extensions (SSE) instructions. These instructions are hints to bring a cache line of data in to various levels and modes in the cache hierarchy.

  • Memory cache
  • performance optimization
  • Cache Splits with Streaming SIMD Extensions 3 Instructions


    Challenge

    Avoid cache splits on 128-bit unaligned memory accesses with SSE3 Instructions. The Streaming SIMD Extensions (SSE) provides the MOVDQU instruction for loading memory from addresses that are not aligned on 16-byte boundaries. Code sequences that use MOVDQU frequently encounter situations where the source spans across a 64-byte boundary (or cache-line boundary). Loading from a memory address that spans across a cache line boundary causes a hardware stall and degrades software performance.

  • Memory cache
  • performance optimization
  • Computação paralela
  • Latency of Floating Point-to-Integer Conversions


    Challenge

    Minimize the latency associated with converting a floating-point number to a 32-bit integer on the Intel® Pentium® 4 and Intel Xeon® processors. This is a common task, which, according to the ANSI C/C++ definitions, should be handled by simply truncating the fractional portion of the number.

  • Memory cache
  • performance optimization
  • Computação paralela
  • Align Data Structures on Cache Boundaries


    Challenge

    Ensure that each synchronization variable is alone on a cache line. After padding synchronization structures to be the size of a cache line, as discussed in a separate item on False Sharing, it is also necessary to ensure that synchronization structures are aligned on the cache boundary.

  • Memory cache
  • performance optimization
  • Computação paralela
  • Assine o performance optimization