Intel® Streaming SIMD Extensions 4 (Intel® SSE4) is a new set of Single Instruction Multiple Data (SIMD) instructions designed to improve the performance of various applications, such as video encoders, image processing, 3D games, and string/text processing. Intel SSE4 builds upon the Intel® 64 and IA-32 instruction set, the most popular and broadly used computer architecture for developing 32-bit and 64-bit applications. Intel SSE4 will be introduced in the 45nm Next Generation Intel® Core™2 processor family.
Intel SSE4.2 introduces four instructions for "string and text processing", which can accelerate many string and text related applications, typically, the functionalities of finding characters in a set, checking characters in a range set, comparing two strings and finding substring from a string.
XML related applications such as Intel XML parsing become more and more common and popular in current information system, especially web service and SOA environment. In such environments, string and text processing is critical performance bottleneck, since many operations such as sub-string searching and regular expression checking need many complex algorithms.
This white paper will describe how Intel XML parsing can benefit from the Intel SSE4 instructions, achieving great performance speedups using string and text processing instructions.
String and text processing instructions in INTEL SSE4.2 allocates four instructions to provide a rich set of string and text processing capabilities that traditionally required many more instructions. These four instructions use XMM registers to process string or text elements of up to 128-bits (16 bytes or 8 words). Each instruction uses an immediate byte to support a rich set of programmable controls. A string-processing INTEL SSE4.2 instruction returns the result of processing each pair of string elements using either an index or a mask.
The capabilities of the string/text processing instructions include:
The four INTEL SSE4.2 instructions that process text/string fragments are:
All four require the use of an immediate byte to control operation. The two source operands can be XMM registers or a combination of XMM register and memory address. The immediate byte provides programmable control with the following attributes:
Depending on the output format associated with the instruction, the text/string processing instructions implicitly uses either a general-purpose register (ECX/RCX) or an XMM register (XMM0) to return the final result.
Two of the four text-string processing instructions specify string length explicitly. They use two general-purpose registers (EDX, EAX) to specify the number of valid data elements (either word or byte) in the source operands. The other two instructions specify valid string elements using null termination. A data element is considered valid only if it has a lower index than the least significant null data element.
There are four common programming models for using string and text processing in an application. These models differ in how the string comparison operation is. Below is a summary of each programming model, followed by an example and implementation guidelines for the preferred programming model, ranges.
XML documents are made up of storage units called entities, like Character Data, Element, Comment, CDATA Section, etc. Each type of entity has its own well-formed definition that is a serious of character range rules. The main work of Intel XML parsing is to recognize these entities and their logic structures.
From Intel XML Parsing Accelerator, we found that character checking loop occupies more than 60% CPU cycles of the whole parsing process, depending on the property of benchmark. There are two kinds of important behavior in this loop, read bytes and check whether it is legal for its corresponding enity type. Without any parallel instructions for string comparison, this process must be implemented in serializing mode.
STTNI provides a parallel solution for byte or word comparison. It matches the optimization of bottleneck in XML parser. Therefore, the basic idea is implementing the parallel character check using Range model. While the enity rules are generally very complex which are defined inside Unicode standard, we cannot apply all of these range rules with STTNI, considering the overhead of the STTNI initialization and execution. The best situation is the one byte encoding range (like ASCII), so one STTNI instruction can check all the character in XMM register, and the worst should be characters in the register compose an encoding sequence, each item in which has its own different range rule. In addition, STTNI instructions can define up to 16 bytes characters for parallel check, so we had better provide characters as many as possible to extremely utilize the STTNI power.
So the principal of accelerate Intel XML parsing using STTNI could be concluded as following:
Principle 1: The character to be checked should be the longer the better.
Principle 2: The character rule logic should be simple and cover most of the popular XML documents.
According to Principle 1, we choose the CharData, Comments, CDATA Sections, Element/ Attribute Name, and Attribute Value to apply the STTNI optimization, since normally their content length is not very short. According to Principle2, we simplify the entity rules according to the concrete type of rule characteristic, to ensure using the least STTNI instructions to check the most part of the rule.
Figure 1 shows an example of Character Data rules simplification. We selected full 1 and 2 bytes encoding part, major part of 3 bytes encoding part, and eliminate the complex and infrequently used 4 bytes encoding part. In the simplified 2 and 3 bytes encoding parts, we should well utilize the common range [x80-xBE], only when it fails we need to check the range [x80-xBF], which can extremely save the STTNI execution times. The detail is shown by the code in Figure 3.
Figure 1. Sample Character Data Rules Simplification
Figure 2 shows the sample implementation of the Character Data parsing acceleration using PCMPESTRI, PCMPESTRM instructions with Range/Equal Any Models. Firstly, we locate the Character Data part in the XMM register with the nonCharaData characters. Then we check whether they all hit the 1 byte encoding range (ASCII range), if so continue to process the next data, otherwise invoke recogUnicodeRange (Figure 3) function to check the non hint part with the 2 and 3 bytes encoding short-cut, if it still doesn’t hit, fall back to the normal checking logic. When checking the rules that more than 2 bytes, we should first ensure that characters in XMM register doesn’t contain the incomplete encoding character, as shown in Figure 4.
Figure 2. Sample Implementation of Character Data Rules Check with STTNI
Figure 3. Character Data Rules Check in UTF-8 Except ASCII Range
Figure 4. Incomplete Encoding Bytes Check in XMM Register
We use the Intel XML parsing benchmark to measure the performance of string and text processing for applications. The optimized character range checking implementation described in the previous section is used.
Intel XML parsing benchmark is composed of 31 XML files. Some of them are public available for benchmarking , some of them are typical customer cases. The file size varies from 0.5K to 5.9M. Most of them are composed of the characters in ASCII range.
After applying the STTNI optimization, Intel XML parser gets overall 25% performance improvement. For some cases, the performance gains can be achieved up to 70%. The longer the entity content which hit the STTNI shortcut is and the less STTNI execution times are, the better the performance will be.
The use of String and Text Processing Instructions in Intel SSE4 was shown to improve the performance of the character range checking in Intel XML parsing. For files ranging from 0.5k to 5.9M the STTNI optimization yielded an overall performance improvement of 1.25X with some optimizations getting a 1.7X improvement in performance.
Zhai Lei is a Senior Software Engineer in the Software Solutions Group at Intel Corporation working for the XML Engineering team. His current focus is on Intel XML parsing acceleration, on both algorithm level and instruction level, such as Intel SSE optimizations for XML processing. His email is email@example.com.
 For more details on XML documents, see the W3C XML spec Extensible Markup Language (XML) 1.0 (Fourth Edition) and Namespaces in XML 1.0 (Second Edition).
 0x0A’, ‘xOD’ have special new Line processing, so we don’t put them in the common Character Data rules.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserverd for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804