A trivial and proven method to implement application performance optimizations with Intel® Streaming SIMD Extensions 4.1 instructions

Submit New Article

July 20, 2009 1:00 AM PDT


Download PDF

A trivial and proven method to implement application performance optimizations with Intel® Streaming SIMD Extensions 4.1 instructions (170kb)

Abstract

This paper will show how to easily improve the runtime performance of applications with intensive bit width data conversion workloads.  The Intel® Streaming SIMD Extensions 4.1 (Intel® SSE4.1) family of instructions together with the procedures discussed here could improve an application performance by up to 46% on top of gains achieved by vectorization of computation intensive code. Data type conversion code is very common and found in many applications.  For example, video codec(s) typically store data as 8-bit or 16-bit values. But during computation data values are usually converted to higher number of bits such as 32-bits for compression or decompression and then data values are scaled back to their native size after computation.

12 instructions in the Intel® SSE4.1 family were designed to efficiently expedite the computation of data values bit width conversions. There are several known code patterns in which these instructions can be utilized to improve the runtime performance of an application. This document presents a straight forward and proven method with procedures that demonstrate how to find known code patterns where these instructions can be utilized to optimize the performance of an application.

Introduction

The purpose of this document is to share with software developers a simple and proven method to easily achieve optimization wins with the use of instructions from the Intel® SSE4.1 instructions. The PMOV<x> instruction set which is part of the Intel® SSE4.1 family of instructions consist of 12 new instructions that provide an array of bit width conversion operations for packed integer data. Six (6) of the instructions provide sign-extended bit width conversion operations while the other 6 provide zero-extended bit width conversions. In general, the Intel® SSE4.1 instructions are available in penryn and newer processors. The code named penryn processor is in the 45nm Intel® Core™ processor family which is available in Dual and in Quad core packages.

Table 1 contains the instructions which are part of the PMOV<x> set and their intrinsic function equivalent.

 

PMOV<x> Instruction Set and Intrinsic Equivalent

PMOVSXBW

_mm_ cvtepi8_epi16 ( __m128i a);

PMOVSXBD

_mm_ cvtepi8_epi32 ( __m128i a);

PMOVSXBQ

_mm_ cvtepi8_epi64 ( __m128i a);

PMOVSXWD

_mm_ cvtepi16_epi32 ( __m128i a);

PMOVSXWQ

_mm_ cvtepi16_epi64 ( __m128i a);

PMOVSXDQ

_mm_ cvtepi32_epi64 ( __m128i a);

 

 

PMOVZXBW

_mm_ cvtepu8_epi16 ( __m128i a);

PMOVZXBD

_mm_ cvtepu8_epi32 ( __m128i a);

PMOVZXBQ

_mm_ cvtepu8_epi64 ( __m128i a);

PMOVZXWD

_mm_ cvtepu16_epi32 ( __m128i a);

PMOVZXWQ

_mm_ cvtepu16_epi64 ( __m128i a);

PMOVZXDQ

_mm_ cvtepu32_epi64 ( __m128i a);

Table 1: List of instructions in the PMOV<x>

There are several known code patterns in which instructions from PMOV<x> set can be utilized to potentially improve the runtime performance of an application. This whitepaper describes the known code patterns and how to find them. The document also describes which of the PMOV<x> instructions were used, tested, and proven to improve the performance of an application as well as gains that were achieved.

Overview

Audio, video and media processing applications with workloads that need fast and efficient bit width conversions are excellent candidates for optimizations with the method as outlined within this document. Specifically, applications that convert 8 and/or 16 bit audio data to single and/or double precision floating points are known to undergo excellent runtime performance gains.

The method to optimize with PMOV<x> instructions can be done in three simple steps. These steps are:

  1. finding known code patterns in your application code
  2. assessing found code patterns functionality for optimization, and
  3. coding optimization and evaluating the application performance with optimized code.

 

The following are the principles topics which outline information and procedure details for each step.

Step 1:

  • Description and understanding known code patterns with sample code
  • Matching and finding known code patterns

 

Step 2:

  • Assessing pattern code for optimization
  • Description of the Intel® SSE4.1 « pmovsx » and « pmovzx » instruction set

 

Step 3:

  • Coding appropriate instruction (intrinsic or inline assembly)
  • Evaluation of performance improvement

Code Patterns

 

The following patterns are some of the known code patterns which have been proven to undergo runtime benefits when optimized with PMOV<x> instructions. These code patterns can be found in MMX™, Intel® SSE, or even Intel® SSE2 instruction code.

Pattern 1:

mov<*> mm#, <register>  // i.e. 8 sample (word wide; 16Bit)
punpck<*> mm#, mm#
psrad mm#, imm8

Pattern 2:

mov<*> mm#, <register>  // i.e. 8 sample (word wide; 16Bit)
punpck<*> mm#, mm#
pshuf<*> mm#, mm#

Pattern 3:

mov<*> xmm#, <register>  // i.e. 8 sample (word wide; 16Bit)
punpck<*> xmm#, xmm#
psrad xmm#, imm8

Pattern 4:

mov<*> xmm#, <register>  // i.e. 8 sample (word wide; 16Bit)
punpck<*> xmm#, xmm#
pshuf<*> xmm#, xmm#

<*> - Indicates pattern takes into account the different flavors of the pattern instructions (i.e., movdqu, punpcklw, pshufps).

# - Indicates certain MMX™ or XMM register number is being used. The register number must correspond across the instructions of the code pattern for the code pattern to be a match.

Another pattern which should be added to the list of known patterns to optimize with PMOV<x> instructions is:

PXOR XMM1, XMM1
PUNPCKLBW XMM0, XMM1

This code pattern is typically used to convert single byte values to shorts (16-bit) values. The “pmovzxbw xmm0, m64” instruction from the set can be utilized not only to improve the performance but it also reduces the number of instructions and reduces register pressure.

This is true since the xmm register (xmm1) in the pattern is only needed to zero extend the byte values. The pmovzxbw instruction zero extend 8 packed 8-bit integers in the low 8 bytes of m64 to 8 packed 16-bit integers in xmm0 there by reducing instruction count and using only one register instead of two as illustrated in the pattern.

Sample Code

The code patterns in which PMOV<x> instructions can be of use, may be found in compiler generated code, code with intrinsic functions as well as in hand-coded inline assembly code. This section discusses sample code to illustrate how known code patterns can come to existence in SIMD optimized code.

Scalar Code

The Foo1 function converts 16-bit data to float values using a simple scalar loop. This type of loop/function may be vectorized by compilers that provide auto vectorization capabilities. Compiler auto vectorized code may contain some of the known code patterns. The known code patterns could also be found in hand-coded functions. And so to illustrate how the known code patterns could come to exist, the sample Foo1 function will be coded with Intel® SSE2 intrinsic functions. But first, here is the Foo1 with a serial loop. Basically the function just converts shorts to float data values. The code converts incoming 16-bit data values to double by typecasting to double data type and by multiplying by 1.0 and then typecasting to float before storing to target. This simple function/loop when compiled generates not so simple assembly code; i.e., compiled code generates lots of instructions to carry the operations and computations of the loop making the loop tasks orders of magnitude less efficient when compared to vectorized code.

 

#define NUM 4096
short 16BitShorts[NUM];
float fltStore[NUM];
typedef short *SRC_16BIT;
typedef float *WP_FLOAT;
.. …

pfTarget = &fltStore[0];
psOrig = &16BitShorts[0];
long num2Copy = …;
float fGain = …;
.. …


// Converts shorts to floats with gain scaling
void Foo1(WP_FLOAT pfTarget, SRC_16BIT psOrig, long num2Copy, float fGain)
{
    WP_FLOAT		pTgt = pfTarget;
    SRC_16BIT		pOrg = psOrig;
    long 		x;

    for (x = 0; x < num2Copy; x++)
	{
		double ss;
		ss = 1.0 * (double)(*pOrg++);
		ss *= fGain;
		pTgt[x] = (float)ss;
	}
}

SIMD Code (Intel® SSE2)

 

The function below is the Intel® SSE2 equivalent of the Foo1 function. It is now programmed to process 8 data samples per loop iteration, effectively improving the throughput by over 5 times (>5x). Typically converting serial loops to vector processing and optimizing with SIMD yields orders of magnitude in performance gains. Software developers should always strive to optimize with SIMD whenever possible.

For now the objective of this document is to demonstrate how SIMD code can be further optimized with Intel® SSE4.1 PMOV<x> instructions and, as such, no details on serial to SIMD optimizations will be discussed. The Foo1_SSE2 function provides a sample conversion of a serial loop to SIMD code. The conversion sample is presented with the sole purpose to show how the code patterns can come to exist in SIMD code.

 

void Foo1_SSE2(WPFLOAT pfTarget, SRC_16BIT psOrig, long num2Copy, float fGain)
{
    WP_FLOAT		pTgt = pfTarget;
    SRC_16BIT		pOrg = psOrig;
    long 		x;
    
    xmmMul = _mm_set1_ps(fGain) //Sets 4 single-precision FPs with value of fGain

}

 

Function Foo1_SSE2 is the equivalent of the Foo1 sample function but optimized with Intel® SSE2 instructions. The table shows the function using Intel® SSE2 intrinsic functions on left side. On the right side, it has assembly code which is the loop equivalent. Intrinsic functions are built-in to the compiler (that supports intrinsics) and are usually a 1:1 equivalent of the assembly instruction.

Although the inline assembly code performs two load operations to load the same data, the loop performance is comparable to that of the loop with intrinsic calls. This distinction is being purposely made to further highlight how code patterns can be identified and where the PMOV<x> instructions can be used. And thus, the “Inline Assembly” code highlights two matches of pattern #3 in the list of the “Code Patterns” section.

Also note, that the sample code should not be construed as complete as its main purpose is simply to illustrate the code patterns discussed in this document.

SIMD Code (Intel® SSE4.1)

Function Foo1_SSE4 is the equivalent of Foo1_SSE2 function but has been further optimized with the pmovsxwd Intel® SSE4.1 instruction. The table shows Foo1_SSE4 using Intel® SSE2 and Intel® SSE4.1 intrinsic calls and assembly instructions. Note that both intrinsic and inline assembly loop implementations have fewer instructions to perform the same of work than the Intel® SSE2 function counterpart. Fewer and efficient instructions usually account for the performance boost achieved in loops with heavy computation workloads.

 

void Foo1_SSE4(WPFLOAT pfTarget, SRC_16BIT psOrig, long num2Copy, float fGain)
{
    WP_FLOAT		pTgt = pfTarget;
    SRC_16BIT		pOrg = psOrig;
    long 		x;
    
    xmmMul = _mm_set1_ps(fGain) //Sets 4 single-precision FPs with value of fGain

 

}

Currently, there are no intrinsic functions provided to load data from memory for the PMOV<x> instructions. All PMOV<x> instruction equivalent intrinsic functions expect a XMM register as an argument. Thus, a load operation (intrinsic function) must be used to load data into an XMM register before being able to use PMOV<x> intrinsic functions for bit width conversation operations. The PMOV<x> instructions in inline assembly code do support loading data from a memory reference.

Matching Code Patterns

There are basically two key pieces of information which can be used to find and match the code patterns. The first key piece of information is the assembly instruction sequence itself. The second key piece of information is the operands with which the instructions of the pattern are invoked.

The instructions of the pattern do not have to be contiguous but the instruction pattern must exist within a 10 instruction block to successfully match a pattern. If the pattern is not matched within the 10 instructions block, then instruction count must reset and the pattern matching will move to the next mov<*> instruction. The ten (10) instruction bound begins from the location where the “mov<*>” instruction is found. The “10 instructions” bound was selected solely based on trial and error, going beyond the 10 instructions bound could lead to false pattern matches. Also, as long as the instructions are being invoked with correct operands, the instructions in the pattern do not necessarily have to execute back to back.

For example, in the pattern:

movdqu xmm1, [esi]
punpcklwd xmm1, xmm1
psrad xmm1, 0x10

the destination operand of the “movdqu” instruction must be an MMX™ or a XMM register. Second, the source and destination operands for the “punpcklwd” instruction must be the same register as the one used in destination operand of the “movdqu” instruction. Also the destination operand of the “psrad” instruction must be the same register as the one used in the destination operand of the “punpcklwd” instruction.

Closely inspect the instruction code to ensure other instructions do not explicitly or implicitly change the contents of the operands being used by the instruction in the pattern. The pattern match is invalid if operand contents undergo implicit and/or explicit changes by other instructions. In addition be extremely careful when matching patterns, particularly if code has branch instructions (e.g., jz) within the pattern.

Finding Code Patterns

Finding the code patterns assumes one would have access to the assembly code of the application or API. If the assembly code is not available, tools such the Microsoft* Visual Studio’s dumpbin utility can be utilized to disassemble object code.

An algorithm is included in Addendum A with a procedural flow which could be used as a base to write a utility tool to automate the finding of the code patterns.

Assessing whether to optimize the code where code patterns are found seems to be an insurmountable task, but in reality it is quite trivial. For instance, the Intel® VTune™ or Intel® Performance Tuning Utility tools can be used to gather performance sampling data. The sampling data can be utilized to correlate the application hot spots in the code where code patterns are found.  Matched patterns that correlate to application code hotspots would be the ideal candidates to optimize with PMOV<x> and will most definitively achieve excellent performance wins.

Assessing Code for Optimization

The application or API debug symbols will be instrumental in identifying the source code (file and function) where patterns are found. Without the code symbols to correlate source file or function names of the code where code patterns are found, it will be difficult to determine if an Intel® SSE4.1 instruction would fit or benefit the algorithm. So application performance sampling data would be most useful to determine if code of matched pattern is being executed, and how often and deep the code is executed. The code for pattern matches that are heavy hitters and which correlate to code hotspots should be top priority and should achieve excellent performance wins. Also, if code pattern match exists within foror while loops or if it simply exists in a heavily called function, these code aspects can also be good candidates for optimization with a PMOV<x> instruction.

In general, if code where known code patterns are found is highly engaged code, then the use of the Intel® SSE4.1 instructions could still provide a performance benefit. The performance benefit could be achieved for the simple fact that one highly efficient PMOX<x> instruction can be used to replace the 3 instructions to perform the same amount of work.  Even if the code where patterns are found is not a hot spot, still consider taken a closer look to explore the benefits if any.

Intel® SSE4.1 Packed Integer Conversion Instructions

There are 12 instructions in the PMOV<x> family set, any of which can be utilized to optimize code where known code patterns are found. The PMOV<x> instructions provide the capability to convert from a smaller packed integer type to larger integer type, and typical operations of media (audio and video) applications usually involve bit width conversions of packed integers.

For example the pmov[z|s]xwd would be utilized to convert 16bit shorts to 32bit integers as zero or sign extended (Word to DWord).

Figure 1 below shows all of the PMOV<x> instructions as well as sample code where the pmovsxbd instruction is utilized. The pmovsxwd and/orpmovzxwd instructions are at the forefront in terms of applicability on optimizations Actual media based applications.

Figure 1: Instructions for bit width conversions of packed integers

 

The source operand to PMOV<x> instructions is from either an XMM register or memory; the destination is always an XMM register. The number of elements which can be converted and width of memory reference is illustrated in Table 2. The alignment requirement is shown in parenthesis.

 

Source Type

 

 

Byte

Word

Dword

Destination Type

Word

8 (64 bits)

 

 

Dword

4 (32 bits)

4 (64 bits)

 

Qword

2 (16 bits)

2 (32 bits)

2 (64 bits)

Table 2: Number of elements to process and alignment necessary (when required)

Note: when accessing memory, if alignment checking is turned on, then all conversions must be aligned to the width of the memory being used.

Coding Intel® SSE4.1 Instructions

Assuming that optimization opportunities have been found and that you want to use the Intel® SSE4.1 instructions to optimize your code but do not want to add inline assembly, well the PMOV<x> instruction set also comes with their corresponding intrinsic functions. However, as of the time of this document, the intrinsic functions only support an XMM register (not a memory reference) as the source argument on the intrinsic function. This means that an additional operation must be used to load XMM registers. Still using intrinsic functions similarly observes excellent performance gains.

The code samples in the following two tables illustrate how the intrinsic functions and inline assembly have been used in studies and in actual application optimizations which have shown excellent performance gains.

 

 

Table 3: Inline assembly sample code

The inline assembly code as shown in Table 3 was used to convert 16 data (audio) samples per loop iteration. A test harness written based on an actual application hot loop where thousands of 16 bit integers are converted to single precision floating points showed a 46% speed up. The 46% speed up at the loop level translated to an effective performance gain of 18% at the application level. Also similar loops showed a 5.47x performance boost when their scalar/serial processing was translated to SIMD and in which the pmovsxwd instruction is used.

Table 4: Sample code using intrinsic functions

Table 4 shows intrinsic code that was used to convert 4 data samples per loop iteration of an audio application. A test harness instrumented with Intel® SSE4.1 code observed a 40% speed up at the loop level, while the application showed a 14% performance benefit with this code.

Conclusion


In summary, this paper describes a procedure in which Intel® SSE4.1 instructions or their intrinsic functional equivalents can be utilized to improve the runtime performance of applications that implement a large number of bit-width conversions. This document discussed the assembly instruction patterns where the use of Intel® SSE4.1 instructions can potentially improve the application performance. It also outlined procedural steps to find known instruction patterns and highlighted which of the Intel® SSE4.1 instruction from the PMOV<x> set of instructions to use to optimize the code for these patterns.

The Intel® SSE4.1 instruction set delivers new and powerful instructions that bring tremendous benefits to the runtime performance of media based applications. However, performance benefits do not and will not automatically manifest themselves. Software developers need to proactively look and seek software optimizations to take advantage of the benefits the new instruction set presents.

It is certain that the method as described in this document have helped the author identify and achieve optimizations with the PMOV<x> instructions.

This paper was compiled to share this information with software developers seeking to boost performance of their applications. Certainly, there are many other ways to optimize applications for performance. The instructions included in the PMOV<x> instruction set have proven to be very beneficial, especially to applications with heavy processing of bit width conversions.

 

Author:

Eli Hernandez, Application Engineer (Software Performance)


Addendum A

This is a procedural flow which might be used as a base to write a utility tool to automate the finding of the code patterns. The tool utility would examine each assembly instruction in an effort to match known code patterns.

The tool utility would log and report the location of the code for patterns matched. For example, when parsing through assembly code that expands multiple functions, it is recommended to capture function name of the assembly instructions being parsed. When and if a pattern is matched, the function name and assembly offset would be logged as the location of the code pattern. Such information will be useful to further assess optimization feasibility.

  1. Open disassembly code file
  2. While not EOF
  3. Search for the mov<*> instruction and examine the destination argument
  4. If destination argument is NOT an MMX™ or XMM register
    • Move to next line, and go to step 3 to find next mov<*> instruction
  5. When/if a mov<*> instruction is matched, within the next 10 instructions bound
  6. Find the punpck<*> instruction and check if source argument of punpck<*> is same as the destination argument in the mov<*> instruction
  7. If source argument is NOT the same register
    • Continue to check for punpck<*> until bound limit-2 is overrun
    • If bound is overrun, move to next line and go to step C to find next mov<*> instruction
  8. Else check for psrad<*> or pshuf<*> instruction
  9. If psrad<*> or pshuf<*> instruction is NOT found
    • Continue to check for prad<*> or pshuf<*> until bound limit is overrun
    • If bound limit is overrun, move to next line and go to step C to find next mov<*> instruction
  10. Else for the psrad<*> instruction pattern match
    • The destination argument of psrad<*> must be the same register as the one used in the destination argument of the punpck<*> instruction
    • If destination argument is not a match, go to step 4a to continue
    • Else if destination argument is matched, Bingo a known SIMD code pattern has been found. Log the instruction RVA location for later examination for potential optimization with Intel® SSE4.1
    • Move to next line and go to step C to find next mov<*> instruction
  11. Else for the pshuf<*> instruction pattern match
    • The source 1 argument of pshuf<*> must be the same register as the one used in the destination argument of the mov<*> instruction
    • If source 1 argument is not a match, go to step 4a to continue
    • Else if source 1 argument is matched, bingo a known SIMD code pattern has been found. Log the instruction RVA location for later examination for potential optimization with Intel® SSE4.1
    • Move to next line and go to step C to find next mov<*> instruction