Schema Validation with Intel® Streaming SIMD Extensions 4 (Intel® SSE4)


Introduction

Intel® Streaming SIMD Extensions 4 (Intel® SSE4) is a new set of Single Instruction Multiple Data (SIMD) instructions designed to improve the performance of various applications, such as video encoders, image processing, 3D games, and string/text processing. Intel SSE4 builds upon the Intel® 64 and IA-32 instruction set, the most popular and broadly used computer architecture for developing 32-bit and 64-bit applications. Intel SSE4.2 is a new subset of Intel SSE4 instructions that will be introduced in the 45nm Next Generation Intel® Core™2 processor family.

Intel SSE4.2 offers the unique capabilities for string and text processing (STTNI) to simultaneously operate on 16 bytes in two strings. This white paper will describe how Intel SSE4.2 instructions are used in schema validation component of Intel® XML software suite to achieve great performance speedup and memory consumption reduction.


Parallel TRIE with Intel® SSE4.2/STTNI

Schema validation reads an XML document and validates the XML document against a schema, which asserts the structure and data type of XML data. One common operation of schema validation is to compare a string in the input XML data against the strings pre-defined in the schema. One frequent example is that if the input string is an element/attribute name, the schema validation looks up all possible element/attribute names under the current validation context, asserts the input string must match one of them, and fetches corresponding declaration for further validation. Another example is that when validating a string against schema data type like string enumeration, the string is compared against a set of validate strings and must match one of them.

The schema is typically represented internally by hash tables or sorted string list to support efficient lookup. The hash table associates a string to a hash value and organizes the strings according to the hash value. A string look up from a hash table consists of hash value computation and string comparisons to choose the string out from the strings with the same hash value. For instance, all element names allowed under a specific structure might be organized as a hash table so that an input string can be validated to be legal or illegal element name via a simple hash table look up. The alternative way is to organize the strings as a sorted string list, which uses two-dimensional binary search algorithm to look up a sing from pre-sorted stings.

The quality of schema validation is a factor that is determined by the validation speed it takes. Therefore, string lookup operation is often the target of performance optimization to improve schema validation quality. In our original implementation of SAX-style schema validation, the hash table lookup accounts for significant mount of time and up to 30%-35% especially when the strings are long, and the two-dimension binary search algorithm lookup for up to 15% to 20%.

An innovative string dictionary by parallel TRIE data structure is designed for efficient string lookup and low memory consumption in the schema validation. Such data structure fully utilizes the Intel SSE4.2/STTNI instruct ions, which offer the unique capabilities for string and text processing to simultaneously operate on 16 bytes in two strings under a rich set of programmable modes with an immediate byte as programming control.

Figure 1. Original TRIE Organization

In traditional TRIE (depicted in Figure 1), each TRIE layer is organized as a simple 256-slot sparse character array, assuming that all characters are within ASCII range. According to current character of input string, next layer TRIE could be found quickly by “NEXT(string, position) = sparse_array[string[position]]”. In addition, in the real practice, strings in the TRIE often share a lot of common prefix or characters so that there are few slots of the array filled with meaningful pointers and it always takes several lookups between layers to finally locate the item with associated string.

The Parallel TRIE overcomes above weakness completely and proposes to (1) organize each TRIE layer from big sparse array to compact string; (2) combine adjacent two characters (bytes) as one short atomic integer to reduce likelihood of collision so that less TRIE layer is needed; (3) organize parallel TRIE to fully utilize parallel string comparison capability provided by Intel SSE4.2/STTNI for high performance string lookup.

Figure 2. Example of Parallel TRIE Organization

Figure 2 shows the example of parallel TRIE layer organization. For each slot in parallel TRIE layer, there is a link pair for two different purposes:

  • When there is no next level data structure, one link pointing at odd length string and the other link pointing at even length string
  • When there is next level data structure, one link pointing at next level TRIE data structure for those strings, which have another 16 bytes to look up, and the other link pointing at other strings, which have no more 16 bytes to look up.

 

In parallel TRIE, the number of layers (and lookup between layers) could be reduced by careful selection of short atomic integer from input string. Traditional TRIE directly uses the character at current string position to locate the next level data structure. However, parallel TRIE uses the non-conflicting short integer within the selection window (from current position to 16 bytes more) of the input string to locate the next level data structure. Here conflict means that there is one short integer within the selection window, which is already in the compact string of current parallel TRIE layer.

If there is conflict in short atomic integer selection, another short integer is chosen from the selection window and is inserted just before the conflicting position. If there is no short integer available to resolve conflict, such string will either be added to fail through strings, the number of which is generally small and appropriate data structure like list or hash table could be chosen, or be constructed as next parallel TRIE, depending on whether input string is long enough (more than 16 bytes) for another lookup.

Figure 3 demonstrates the process of how parallel TRIE is constructed step by step. Assuming “DocumentTypeCode” is to be inserted after “DocumentType”, short integer “Do” causes conflict because it is already in the compact string after “DocumentType” is inserted. Therefore, another short integer “Co” is chosen from the selection window ranging from ‘D’ to ‘e’ and gets inserted just before “Do”. In addition, after “OtherUnclassifiedType”, “OneType1” and “OtherType2” are inserted one by one, “OtherType1” and “OtherType2” form the fail through strings and “OtherUnclassifiedType” forms another parallel TRIE layer.

Figure 3. Parallel TRIE Construction by Appropriate Atomic Short Integer Selection

When input string is searched on the constructed parallel TRIE, the selection window of such input string is compared with the compact string of parallel TRIE. The left most satisfied slot is selected for further action, depending on whether it points at a fail through strings, another parallel TRIE layer or just items with associated string. If it doesn’t point at item with associated string, shown as dash line in Figure 3, it could be either fail through strings which have no more 16 bytes to look up or another parallel TRIE layer which needs recursive lookup after shifting selection window by fetching another 16 bytes. Finally, a string comparison is needed to verify whether the associated string of located item is the same as input string.

Figure 4. Dramatically Accelerate String Lookup by Intel® SSE4.2/STTNI

With the parallel string comparison capability provided by Intel SSE4.2/STTNI techniques, the string lookup could be achieved in dramatically efficient parallel manner: comparing 16 bytes selection window from input string and 16 bytes from the compact string of parallel TRIE layer by just one hardware SIMD instruction. Figure 4 shows how hardware SIMD instruction on Nehalem platform looks up string “DocumentTypeCode” in the parallel TRIE of Figure 3: it compares selection window “DocumentTypeCode” and compact string “TiCoDoPuOt” by one SIMD instruction and the left most satisfied slot “Co” is found, which directly points at the item with associated string “DocumentTypeCode”.

As the summary, Table 1 shows the major differences between parallel TRIE and traditional TRIE data structure with the help of STTNI.

Table 1. Summary of Parallel TRIE with Intel® SSE4.2/STTNI

Difference Compared with Traditional TRIE

Advantages Compared with Traditional TRIE

How String And Text Processing Instructions are Used

Each parallel TRIE layer is organized as compact string instead of big sparse character array.

Small memory consumption with relatively fast access with the help of STTNI

Use STTNI to find satisfied slot from the compact string of each parallel TRIE layer.

Use two bytes (16-bit short integer) as one atomic unit in compact string.

Likelihood of conflict is reduced.

Use word (16-bit short integer) granularity mode of STTNI.

String lookup can compare 16 bytes of input string with 16 bytes of compact string in efficient parallel manner.

Reduce the number of string lookup among layers and quickly locate the item with associated string with the help of STTNI.

Use STTNI to efficiently compare the strings in parallel manner.

 

Intrinsic functions _mm_cmpestri and _mm_cmpestrm which wrap the Intel SSE4.2 instruction PCMPESTRI and PCMPESTRI are provided by Intel C/C++ Compiler 10.0. A constant mask is set to let the instruction compare input string with compact string as 16-bit short integer granularity and return the left-most index or the mask of matched slots in parallel manner. When there is no hardware Intel SSE4.2/STTNI support in other platforms or compiler cannot recognize such intrinsic functions, a software-optimized (inline and immediately return when first matched slot is found) simulation function is provided as well to perform the same functionality.

Figure 5 shows the code fragment of string lookup to find next level information in parallel TRIE, and Figure 6 shows the code fragment of string lookup to resolve conflict in selection window during parallel TRIE construction.

Figure 5. Code Fragment of String Lookup to Find Next Level Information

#define STTNI_MASK SIDD_CMP_EQUAL_ANY | SIDD_UWORD_OPS | SIDD_LEAST_SIGNIFICANT 

unsigned short * rules = curr_trie->getRules(); 

unsigned short * input = static_cast<unsigned short*>(static_cast<void*>(const_cast<char*>(string + position))); 

unsigned short * pt = rules; 

int index = 8, len = next_size; 

#if defined(_STTNI) 

if (usingSTTNI) { 

const __m128i ahead = _mm_loadu_si128((__m128i*)input); 

while (len > 0 && (index = _mm_cmpestri(ahead, actlen, _mm_loadu_si128((__m128i*)pt), len, STTNI_MASK)) == 8) { 

pt += 8; len -= 8; 

} 

} 

else 

#endif 

{ 

while (len > 0 && (index = pcompstri_eqany(input, (actlen > 8? 8 : actlen), pt, (len > 8? 8 : len))) == 8) { 

pt += 8; len -= 8; 

} 

} 

if (index == 8) return NULL; 

index += (pt - rules); //index is the what we need 

 

Figure 6. Code Fragment of String Lookup to Resolve Conflict in Selection Window

#define STTNI_MASK SIDD_CMP_EQUAL_ANY | SIDD_UWORD_OPS | SIDD_LEAST_SIGNIFICANT 

for (int m = 0; m < size; m++) { 

Item listitem = static_cast<Item>(list_trie->getNextLink(m)); 

int oldlength = Helper::length(listitem), actrulelen = oldlength - position > 16? 8 : (oldlength - position)>>1; 

unsigned short * rules = static_cast<unsigned short*>(Helper::string(listitem) + position); 

#if defined(_STTNI) 

if (usingSTTNI) { 

mask |= (int)_mm_cvtsi128_si32(_mm_cmpestrm(_mm_loadu_si128((__m128i*)rules),actrulelen 

_mm_loadu_si128((__m128i*)input), actinputlen, STTNI_MASK)); 

} 

else 

#endif 

{ 

mask |= (int)pcompstrm_eqany(rules, actrulelen, input, actinputlen); 

} 

} 

for (int position = 0; position < actinputlen; position++) { 

if ((mask & (1<<position)) == 0) { 

nextrule = input[position]; // next rule is what we need 

break; 

} 

} 

 


Results

Schema validation processing consists of two core components: SAX-style parsing and validation runtime processing by a series of event handlers. Each of them occupies almost half of CPU clock ticks of the overall core schema validation according to the profiling data collected by VTune™ analyzer.

In schema validation event handlers, below hot bottlenecks, where core schema validation needs look up associated information according to input string, are all replaced with string dictionary with parallel TRIE. When there is no Intel SSE4.2/STTNI instruction available, a software-optimized (inline and immediately return when first matched slot is found) simulation function is provided as well to perform the same functionality.

  • In cases containing string enumeration, verify whether the input string exists in a set of valid strings.
  • Look up attribute declaration according to attribute namespace and local name.
  • Look up element declaration according to element namespace and local name.
  • Look up next state in model group DFA according to current state, element namespace and local name.

 

Figure 7 shows overall core schema validation performance speedup and memory reduction where all above hot bottlenecks are replaced by string dictionary with parallel TRIE. The performance data is collected from Nehalem platform with 1 G memory and there are 14 schema validation cases for benchmarking. If there is no hardware Intel SSE4.2/STTNI support, there is minor improvements, but if there is hardware Intel SSE4.2/STTNI support on Nehalem platform, there could be averagely about 21% performance speedup compared with original solution.

Figure 7. Schema Validation Performance Speedup and Memory Reduction


Conclusion

This white paper describes how Intel SSE4.2/STTNI instructions are used in schema validation component of Intel XML software suite to speed up performance and reduce memory consumption in attribute/element declaration lookup, string enumeration facet validation and model group state transion, which are hot bottlenecks in core schema validation.


About the Author

Yongnian Le is a Senior Software Engineer in the Software Solutions Group at Intel Corporation working for the XML Engineering team. His current focus is on high performance XML front-end products, like XML parser and schema validation. He is responsible for Intel SSE4.2/STTNI enabled schema validation project and also involved with leading the development of schema direct parsing project. Yongnian has been in Intel for 5 years and has various experiences in the fields of compiler generated code optimization, debug capabilities and platform specific support in Intel® C++ compiler and IXP network processor compiler. His email is mailto:yongnian.le@intel.com.


Per informazioni più dettagliate sulle ottimizzazioni basate su compilatore, vedere il nostro Avviso sull'ottimizzazione.