Intel® Streaming SIMD Extensions 4 (Intel® SSE4) is a new set of Single Instruction Multiple Data (SIMD) instructions designed to improve the performance of various applications, such as video encoders, image processing, 3D games, and string/text processing. Intel SSE4 builds upon the Intel® 64 and IA-32 instruction set, the most popular and broadly used computer architecture for developing 32-bit and 64-bit applications. Intel SSE4 will be introduced in the 45nm Next Generation Intel® Core™2 processor family.
Intel SSE4.2 introduces four instructions for "string and text processing", which can accelerate many string and text related applications, typically, the functionalities of finding characters in a set, checking characters in a range set, comparing two strings and finding substring from a string.
XML related applications such as Intel XML parsing become more and more common and popular in current information system, especially web service and SOA environment. In such environments, string and text processing is critical performance bottleneck, since many operations such as sub-string searching and regular expression checking need many complex algorithms.
This white paper will describe how Intel XML parsing can benefit from the Intel SSE4 instructions, achieving great performance speedups using string and text processing instructions.
String and Text Processing Instructions in Intel® SSE4
String and text processing instructions in INTEL SSE4.2 allocates four instructions to provide a rich set of string and text processing capabilities that traditionally required many more instructions. These four instructions use XMM registers to process string or text elements of up to 128-bits (16 bytes or 8 words). Each instruction uses an immediate byte to support a rich set of programmable controls. A string-processing INTEL SSE4.2 instruction returns the result of processing each pair of string elements using either an index or a mask.
The capabilities of the string/text processing instructions include:
- Handling string/text fragments consisting of bytes or words, either signed or unsigned
- Support for partial string or fragments less than 16 bytes in length, using either explicit length or implicit null-termination
- Four types of string compare operations on word/byte elements
- Up to 256 compare operations performed in a single instruction on all string/text element pairs
- Built-in aggregation of intermediate results from comparisons
- Programmable control of processing on intermediate results
- Programmable control of output formats in terms of an index or mask
- Bi-directional support for the index format
- Support for two mask formats: bit or natural element width
- Not requiring 16-byte alignment for memory operand
The four INTEL SSE4.2 instructions that process text/string fragments are:
- PCMPESTRI - Packed compare explicit-length strings, return index in ECX/RCX
- PCMPESTRM - Packed compare explicit-length strings, return mask in XMM0
- PCMPISTRI - Packed compare implicit-length strings, return index in ECX/RCX
- PCMPISTRM - Packed compare implicit-length strings, return mask in XMM0
All four require the use of an immediate byte to control operation. The two source operands can be XMM registers or a combination of XMM register and memory address. The immediate byte provides programmable control with the following attributes:
- Input data format.
- Compare operation mode.
- Intermediate result processing.
- Output selection.
Depending on the output format associated with the instruction, the text/string processing instructions implicitly uses either a general-purpose register (ECX/RCX) or an XMM register (XMM0) to return the final result.
Two of the four text-string processing instructions specify string length explicitly. They use two general-purpose registers (EDX, EAX) to specify the number of valid data elements (either word or byte) in the source operands. The other two instructions specify valid string elements using null termination. A data element is considered valid only if it has a lower index than the least significant null data element.
There are four common programming models for using string and text processing in an application. These models differ in how the string comparison operation is. Below is a summary of each programming model, followed by an example and implementation guidelines for the preferred programming model, ranges.
- Equal any
In this model, the application would find characters from a set. Operand 1 is defined less than 16 bytes or eight words as the target set, while, operand 2 is the input bytes or words stream for checking. Each unit in operand 2 is checked whether equal with any unit in operand 1, then, the result is calculated according to the control byte.
In this model, the application would find characters from ranges. Operand 1 is defined less than 16 bytes or 8 words as the target range set, while, operand 2 is the input bytes or words stream for checking. Each unit in operand 2 is checked whether is located in any range defined in operand 1, then, the result is calculated according to the control byte.
- Equal each
In this model, the application would compare whether two strings are equal. Operand 1 is defined less than 16 bytes or eight words as one input string, while, operand 2 is defined as another input string. Each unit in operand 2 is checked whether is equal with the same location unit in operand 1, then, the result is calculated according to the control byte.
- Equal ordered
In this model, the application would search the substring from input string. Operand 1 is defined less than 16 bytes or eight words as the target substring, while, operand 2 is the input bytes or words stream for checking. Each sequence in operand 2 is checked whether is equal with the substring in operand 1, then, the result is calculated according to the control byte.
Character Range Checking using String and Text Processing Instructions
XML documents are made up of storage units called entities, like Character Data, Element, Comment, CDATA Section, etc. Each type of entity has its own well-formed definition that is a serious of character range rules. The main work of Intel XML parsing is to recognize these entities and their logic structures.
From Intel XML Parsing Accelerator, we found that character checking loop occupies more than 60% CPU cycles of the whole parsing process, depending on the property of benchmark. There are two kinds of important behavior in this loop, read bytes and check whether it is legal for its corresponding enity type. Without any parallel instructions for string comparison, this process must be implemented in serializing mode.
STTNI provides a parallel solution for byte or word comparison. It matches the optimization of bottleneck in XML parser. Therefore, the basic idea is implementing the parallel character check using Range model. While the enity rules are generally very complex which are defined inside Unicode standard, we cannot apply all of these range rules with STTNI, considering the overhead of the STTNI initialization and execution. The best situation is the one byte encoding range (like ASCII), so one STTNI instruction can check all the character in XMM register, and the worst should be characters in the register compose an encoding sequence, each item in which has its own different range rule. In addition, STTNI instructions can define up to 16 bytes characters for parallel check, so we had better provide characters as many as possible to extremely utilize the STTNI power.
So the principal of accelerate Intel XML parsing using STTNI could be concluded as following:
Principle 1: The character to be checked should be the longer the better.
Principle 2: The character rule logic should be simple and cover most of the popular XML documents.
According to Principle 1, we choose the CharData, Comments, CDATA Sections, Element/ Attribute Name, and Attribute Value to apply the STTNI optimization, since normally their content length is not very short. According to Principle2, we simplify the entity rules according to the concrete type of rule characteristic, to ensure using the least STTNI instructions to check the most part of the rule.
Figure 1 shows an example of Character Data rules simplification. We selected full 1 and 2 bytes encoding part, major part of 3 bytes encoding part, and eliminate the complex and infrequently used 4 bytes encoding part. In the simplified 2 and 3 bytes encoding parts, we should well utilize the common range [x80-xBE], only when it fails we need to check the range [x80-xBF], which can extremely save the STTNI execution times. The detail is shown by the code in Figure 3.
Figure 1. Sample Character Data Rules Simplification
Figure 2 shows the sample implementation of the Character Data parsing acceleration using PCMPESTRI, PCMPESTRM instructions with Range/Equal Any Models. Firstly, we locate the Character Data part in the XMM register with the nonCharaData characters. Then we check whether they all hit the 1 byte encoding range (ASCII range), if so continue to process the next data, otherwise invoke recogUnicodeRange (Figure 3) function to check the non hint part with the 2 and 3 bytes encoding short-cut, if it still doesn’t hit, fall back to the normal checking logic. When checking the rules that more than 2 bytes, we should first ensure that characters in XMM register doesn’t contain the incomplete encoding character, as shown in Figure 4.
Figure 2. Sample Implementation of Character Data Rules Check with STTNI
Figure 3. Character Data Rules Check in UTF-8 Except ASCII Range
Figure 4. Incomplete Encoding Bytes Check in XMM Register
We use the Intel XML parsing benchmark to measure the performance of string and text processing for applications. The optimized character range checking implementation described in the previous section is used.
Intel XML parsing benchmark is composed of 31 XML files. Some of them are public available for benchmarking , some of them are typical customer cases. The file size varies from 0.5K to 5.9M. Most of them are composed of the characters in ASCII range.
After applying the STTNI optimization, Intel XML parser gets overall 25% performance improvement. For some cases, the performance gains can be achieved up to 70%. The longer the entity content which hit the STTNI shortcut is and the less STTNI execution times are, the better the performance will be.
The use of String and Text Processing Instructions in Intel SSE4 was shown to improve the performance of the character range checking in Intel XML parsing. For files ranging from 0.5k to 5.9M the STTNI optimization yielded an overall performance improvement of 1.25X with some optimizations getting a 1.7X improvement in performance.
About the Author
Zhai Lei is a Senior Software Engineer in the Software Solutions Group at Intel Corporation working for the XML Engineering team. His current focus is on Intel XML parsing acceleration, on both algorithm level and instruction level, such as Intel SSE optimizations for XML processing. His email is email@example.com.
 For more details on XML documents, see the W3C XML spec Extensible Markup Language (XML) 1.0 (Fourth Edition) and Namespaces in XML 1.0 (Second Edition).
 0x0A’, ‘xOD’ have special new Line processing, so we don’t put them in the common Character Data rules.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.