Increasing Memory Throughput With Intel® Streaming SIMD Extensions 4 (Intel® SSE4) Streaming Load

by Ashish Jha and Darren Yee
Intel Software and Solutions Group
April 2007


Abstract

Intel® SSE4 is a new set of Single Instruction Multiple Data (SIMD) instructions that will be introduced in the 45nm Next Generation Intel® Core™2 Processor Family (Penryn) and improve the performance and energy efficiency of a broad range of applications.

Intel SSE4 includes the MOVNTDQA instruction for "streaming loads" from devices mapped to Uncacheable Write Combining (USWC) memory to the CPU. Streaming loads enhance the ability to read from USWC-mapped IO devices, providing speed-ups over 7.5x compared to traditional loads (such as MOVDQA).

This white paper provides an overview of the streaming load instruction, describes performance benefits of streaming loads, and provides guidelines for utilizing streaming loads to optimize applications.


Introduction

Memory Mapped I/O Devices such as graphics and video are typically mapped as Uncacheable Write Combining (USWC) Memory. UC memory is uncacheable memory (the data is not stored into any of the processor caches) and is typically employed for memory mapped I/O devices. USWC memory is an extension to UC memory and contains uncacheable data that is typically subject to sequential write operations, such as frame buffer memory. Write combining allows all writes to a cache line to be combined before being finally written out to memory. [1]

Streaming SIMD Extensions 2 (SSE2) introduced the MOVNTDQ instruction for "streaming writes", which improves write throughput by streaming non-cacheable writes to devices memory mapped to USWC memory. The streaming write instruction is an explicit means for signaling to the processor that the target of the data is intended to be written directly to external memory and not be cached into any level of the processor cache hierarchy. Streaming writes allow writes to the same cache line (typically 64-bytes) to be combined before going out to memory, as opposed to writing to memory in 16-byte chunks. This allows streaming writes to substantially improve the write throughput to these devices.

Although the USWC memory type is primarily employed for memory subject to write operations (such as frame buffer memory) load operations may also occur from USWC memory. Sophisticated graphics devices today could efficiently do rendering while the CPU is busy doing other operations. Once done, the CPU would load in the data for further processing and pass it off for final display. However, SSE load instructions operate on a maximum of 16-byte chunks and have limited throughput when accessing USWC memory, requiring two separate front-side bus transactions with a total throughput of four bus clocks.

Intel® Streaming SIMD Extensions 4 (Intel® SSE4) introduces the MOVNTDQA instruction for "streaming loads". This instruction improves read throughput by streaming non-cacheable reads from USWC memory. As with streaming writes, the streaming load instruction allows data to be moved in full cache line quantities, substantially improving bus transaction efficiency and resultant data throughput. Compared to traditional 16-byte loads (such as MOVDQA), the MOVNTDQA streaming loads enhances the ability to read from USWC-mapped I/O devices, providing over 7.5x speed-up.


Streaming Load Instruction

The streaming load (MOVNTDQA) instruction is a single instruction, which allows a 16-byte chunk to be fetched within an aligned cache line region (a streaming cache line) of USWC memory. The instruction allows fetching a complete 64-byte cache line as opposed to a 16-byte chunk. This 64-byte cache line is not stored in any of the processor caches. Instead, it is held in a small set of temporary “streaming load buffers.” By fetching a complete 64-byte cache line and temporarily holding the contents in a streaming load buffer, subsequent loads from the same line address can be supplied directly from the streaming load buffer. These subsequent loads will occur much faster and improve throughput, as shown in the example below.

 

; This load retrieves a full cache line that is stored in a temporary streaming load
; buffer
; USWC_Memory is a pointer to the system allocated memory of type USWC
MOVNTDQA xmm0, USWC_Memory+0
; Subsequent 16-byte loads from the same cache line are supplied from the streaming
; load buffer and occur much faster 
MOVNTDQA xmm1, USWC_Memory+16
MOVNTDQA xmm2, USWC_Memory+32
MOVNTDQA xmm3, USWC_Memory+48


Figure 2-1. Loading a Full Cache Line using MOVNTDQA

 

 

For streaming across multiple line addresses, loads of all four chunks of a given line should be batched together for better performance. The loading of each chunk (via discrete MOVNTDQA instructions) does not have to be in the order shown in the above example. However, a streaming load of a given chunk will cause a new streaming load buffer allocation if it currently does not exist. Given a finite number of streaming load buffers for a micro-architectural implementation, grouping them together should improve the overall utilization.

Below are additional guidelines that should be followed when using the streaming load instruction:

  • Streaming loads must be 16-byte aligned.
  • Streaming loads from the same cache line should be grouped together and not be interleaved with:
    • Writes or non-streaming loads
    • Streaming loads from other cache lines (strided accesses)
  • Avoid using streaming load to re-read a given 16-byte chunk from the same cache line. This may cause the cache line to be re-fetched and can impact streaming load performance.

 

The streaming load instruction is intended to accelerate data transfers from the USWC memory type. For other memory types such as cacheable (WB) or Uncacheable (UC), the instruction behaves as a typical 16-byte MOVDQA load instruction. However, future processors may use the streaming load instruction for other memory types (such as WB) as a hint that the intended cache line should be streamed from memory directly to the core while minimizing cache pollution.


Programming Models of Streaming Loads

There are two common programming models for using streaming loads in an application. These models differ in when the data retrieved via streaming loads is operated on. Below is a summary of each programming model, followed by an example and implementation guidelines for the preferred programming model, Bulk Load and Operate.

  • Bulk Load and Operate
    In this model, the application would load the data using streaming loads and copy the data to a temporary cacheable WB buffer (i.e. bulk load). After all data has been loaded and copied, the CPU would operate on the temporary buffer and finally send the data back to memory.
  • Incremental Load and Operate
    In this model, the application would load a single cache line using streaming load, operate on the data, and then write it back to memory. This model operates on the data as it is loaded, as opposed to loading a large amount of data first (as in the Bulk Load and Operate model).
    The need for this programming model could arise when there is a producer-consumer relationship between CPU and the memory mapped I/O device. For example, for graphics devices mapped to USWC memory, the CPU could use the streaming write operation to pass data to the GPU for rendering, use the streaming load operation to get the rendered data from the GPU, and finally use the streaming write operation to pass the final data back to the GPU for display.

 

The Bulk Load and Operate programming model is preferred because it will generate more consistent performance gains. In the Incremental Load and Operate programming model, the data operations that are performed between streaming loads could potentially interfere with streaming load performance (due to contention for streaming load buffers and other processor resources). The Bulk Load and Operate programming model reduces this possibility by performing data loading and data operation in separate batches.


Bulk Load and Operate Implementation

Figure 3-1 shows a sample implementation of the Bulk Load and Operate programming model. This section will walk through the implementation and provide additional guidelines for implementing this model in your application.

 

; Initialize pointers to start of the USWC memory
mov esi, Pointer_to_USWC_memory
mov edx, Pointer_to_USWC_memory

; Initialize pointer to end of the USWC memory
add edx, Memory_Length

; Initialize pointer to start of the cacheable WB buffer
mov edi, Pointer_to_WB_memory

; Start of Bulk Load loop
inner_start:
; Load data from USWC Memory using Streaming Load
MOVNTDQA xmm0, xmmword ptr [esi]
MOVNTDQA xmm1, xmmword ptr [esi+16]
MOVNTDQA xmm2, xmmword ptr [esi+32]
MOVNTDQA xmm3, xmmword ptr [esi+48]

; Copy data to buffer
MOVDQA xmmword ptr [edi], xmm0
MOVDQA xmmword ptr [edi+16], xmm1
MOVDQA xmmword ptr [edi+32], xmm2
MOVDQA xmmword ptr [edi+48], xmm3

; Increment pointers by cache line size and test for end of loop
add esi, 040h
add edi, 040h
cmp si, edx
jne inner_start

; End of Bulk Load loop

; Bulk load completed. Now operate on data in buffer . . . 


Figure 3-1. Sample Implementation of Bulk Load and Operate Programming Model

 

 

Initialization Before Bulk Load

Lines 1–9 initialize pointers that will be used for the Bulk Load. ESI and EDX point to the beginning and end of the USWC memory that will be loaded. Memory_Length represents the amount of memory being copied, which should be a multiple of 64 bytes for optimal performance.

EDI points to the cacheable WB buffer where the data loaded from USWC memory will be copied. It is important that this buffer be stored in the first level cache of the processor. Using higher level caches can cause contention for streaming load buffers and other processor resources, which could impact streaming load performance.

To ensure that the WB buffer is stored in the first level cache, follow the steps below:

  • Allocate an aligned WB buffer using aligned malloc. The size of the buffer should be at most 4KB and a multiple of 64 bytes:
    void * _aligned_malloc(size_t size, size_t alignment)
  • Clear the memory with memset, which will bring the buffer to the first level cache:
    void * memset (void* ptr, int value, size_t num)

 

Bulk Load and Copy

Lines 10-29 load the data using streaming load and copy the data to the cacheable WB buffer. Lines 14-17 are the MOVNTDQA instructions that will perform the streaming loads. The first MOVNTDQA allocates a streaming load buffer and loads a complete 64-byte cache line into the buffer. The next 3 MOVNTDQA get their data directly from the streaming load buffer. Since each MOVNTDQA loads 16 bytes, 4 calls are needed in order to load one cache line.

This example loads just one cache line per loop iteration. However, if additional registers are available (i.e. xmm4 – xmm7, xmm8 – xmm15 with Intel® Extended Memory 64 Technology) multiple cache lines could be loaded per loop iteration.

The streaming loads in this example are done sequentially by memory offset (i.e. +0, +16, +32, +48). Streaming loads do not have to be ordered in this manner. However, all streaming loads from a single cache line should be grouped together, and re-loads of a given chunk should be avoided. This ensures optimal utilization of streaming load buffers.

Lines 20-23 copy the data to the cacheable WB buffer using MOVDQA. Data is copied one cache line per loop iteration, and the data copy happens after the full cache line is loaded. Data loads and copies should be done in separately and not interleaved. Interleaving (also known as striding) loads and copies (i.e. load and copy chunk 1, load and copy chunk 2) is not recommended because it can cause contention for streaming load buffers and other processor resources, that could impact streaming load performance.

Line 26-29 control the loop iteration by incrementing the appropriate pointers and checking if the entire USWC memory segment has been loaded.


Streaming Load Performance

To measure the performance of streaming loads from local system memory, four implementations of the “Bulk Load and Operate” programming model were used to load 4KB of data from USWC memory and copy the data to a cacheable WB buffer. One implementation did not use streaming loads (i.e. memory loads were performed using the MOVDQA instruction). Another implementation used streaming loads (i.e. memory loads were performed using the MOVNTDQA instruction). The other two implementations were dual-threaded versions that split the USWC segment into two parts, and performed the load and copy in two separate threads (each thread bound to a separate core). See Appendix A for the source code listing.

For each implementation, the 4KB load and copy loop was executed over numerous (~10,000) iterations and by measuring the total time required to execute these iterations the average memory throughput was calculated. [2]

Figure 4-1 shows the memory throughput that was achieved in each implementation. In the single-threaded implementation, utilizing streaming loads increased memory throughput by over 5x. In the dual-threaded implementation, utilizing streaming loads increased memory throughput by over 7.5x.


Figure 4-1. Memory Throughput from System Memory Using Streaming Load [3]
NOTE: Str Ld = Implementation uses Streaming Load

Figure 4-1 also shows the theoretical peak throughput that could be achieved on the system (1067FSB * 8 MBytes/sec = 8.53 GBytes/sec). The dual-threaded implementation is very close to the theoretical peak throughput, while the single-threaded implementation is about half. This is due to the limited streaming load buffers that are available to each core. By utilizing both cores, the dual-threaded implementation has access to more stre aming load buffers, and has nearly double the performance.


Streaming Load Performance with Contention for Streaming Load Buffers

Section 3.1 contained several recommendations to prevent contention for streaming load buffers and other processor resources, which could impact streaming load performance (e.g. storing the cacheable WB buffer in the first level cache, avoiding strided loads and copies). To demonstrate streaming load performance with contention for streaming load buffers and other processor resources, the single-threaded streaming load implementation used in the previous section was modified to store the cacheable WB buffer in the second level cache of the processor. [4]

Figure 4-2 compares the original single-threaded streaming load implementation with the modified version. By introducing contention for streaming load buffers and other processor resources, memory throughput dropped by over 30%.


Figure 4-2. Memory Throughput from System Memory using Streaming Load with Contention for Streaming Load Buffers and Other Processor Resources


Conclusion

Intel SSE4 includes the MOVNTDQA instruction for "streaming loads" from devices mapped to USWC memory to the CPU. Using streaming loads in conjunction with streaming writes provide tremendous opportunities for increasing data throughput with memory mapped I/O devices.

To take advantage of streaming loads, memory mapped I/O devices should be mapped to USWC memory. Implementing the Bulk Load and Operate programming model and performing streaming loads across multiple threads will ensure peak performance using streaming load. The examples in this whitepaper show how streaming loads can provide over 7.5x speedups in read throughput from local system memory, getting very close to the theoretical maximum bus speed!

Future generations of Intel processors may contain optimizations and enhancements for streaming loads, such as increased utilization of the streaming load buffers and support for additional memory types, creating even more opportunities for software developers to increase the performance and energy-efficiency of their applications.


About the Authors

Ashish Jha is a Senior Architecture Performance Engineer in the Software and Solutions Group. He has 10 years of industry experience with over 6 years at Intel working on Java Virtual Machine performance optimization integrating Intel’s newest processor capabilities. His current areas of work include processor architecture analysis and performance optimizations across current and future processor generations. Ashish has a bachelor’s degree in Electronics and Communications Engineering from Birla Institute of Technology in Mesra, India.

Darren Yee is a Senior Technical Marketing Engineer in the Software and Solutions Group. In his 9 years at Intel, Darren has held various positions in engineering and technical marketing, working on a variety of technologi es including Java, e-commerce systems, networking, and Intel® Viiv™ technology. Darren holds a patent in the area of e-commerce and has a bachelor’s degree in Computer Science from Cornell University.

 


References

 

[1] Because USWC memory is not cacheable, software cannot rely on cache coherency mechanisms to provide any data coherency with respect to other CPUs or bus agents. In order to ensure consistency of USWC memory accesses between producers and consumers, software should employ proper serialization mechanisms such as the memory fence instruction (i.e. MFENCE).

[2] Specifically, Memory Throughput = (Processor Frequency * Number of Iterations * Data Copied Per Iteration) / Total Execution Time (in clock cycles). For dual-threaded implementations, the memory throughput was calculated for each thread and then added together.

[3] The test system consisted of a 45nm Intel® dual core desktop processor (Wolfdale), Intel D975XBX2KR motherboard, 2 GB DDR2 RAM PC2-5300 (667 MHz), Windows* XP Professional with Service Pack 2.

[4] This was done be reading 64KB of data at the start of each streaming load loop iteration, causing the WB buffer to be sent to the second level cache.


Appendix A: Code Listing for Streaming Load Performance Measurements

 

//__declspec(noinline) void BulkLoadAndOperate (void) {

//_asm {


; Get pointer to start of the USWC memory
mov esi, Pointer_to_USWC_memory
mov edx, Pointer_to_USWC_memory

; Get pointer to end of the USWC memory
add edx, Memory_Length

; Get pointer to start of the cacheable WB buffer
mov edi, Pointer_to_WB_memory

; Get the Loop Iteration Count
mov ebx, Loop_Iteration_Count

; Take Start Timestamp
rdtsc
mov DWORD PTR [StartClockticks], eax
mov DWORD PTR [StartClockticks+4], edx

; Start of outer Iteration Loop
align 16
//
//
//iter_loop: 
mov esi, Pointer_to_USWC_memory
mov edi, Pointer_to_WB_memory
; Start of Copy loop
align 16
copy_loop:
; Stream a cache line from USWC Memory
MOVNTDQA xmm0, xmmword ptr [esi]
MOVNTDQA xmm1, xmmword ptr [esi+16]
MOVNTDQA xmm2, xmmword ptr [esi+32]
MOVNTDQA xmm3, xmmword ptr [esi+48]

; Copy/Store the streamed cache-line to a Cacheable ; WB buffer
MOVDQA xmmword ptr [edi], xmm0
MOVDQA xmmword ptr [edi+16], xmm1
MOVDQA xmmword ptr [edi+32], xmm2
MOVDQA xmmword ptr [edi+48], xmm3

; Increment by cache line size
; Test for end of copy loop
add esi, 040h
add edi, 040h
cmp esi, edx
jne copy_loop 
; Test for end for loop iteration
add ecx, 1
cmp ecx, ebx
jne iter_loop
; Take End Timestamp
rdtsc
mov DWORD PTR [EndClockticks], eax
mov DWORD PTR [EndClockticks+4], edx
}
return;
}

 


Etiquetas:
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.