PCMPESTRI behaviour

PCMPESTRI behaviour

I got 15 (not 16) as a result of PCMPESTRI instruction for such arguments:BYTE pbmem[] = { 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x00, 0xFF, 0xBA, 0x00 };
BYTE pbcmp[] = { 0x00, 0x01, 0xBA, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22 };

__m128i cmp = _mm_loadu_si128( reinterpret_cast< const __m128i* >( pbcmp ) );
__m128i mem = _mm_loadu_si128( reinterpret_cast< const __m128i* >( pbmem ) );

DWORD dwFound = _mm_cmpestri( cmp, 3, mem, 16, _SIDD_UBYTE_OPS|_SIDD_CMP_EQUAL_ORDERED|_SIDD_POSITIVE_POLARITY|_SIDD_LEAST_SIGNIFICANT );

Is it correct?

7 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Is here anyone?

It looks like a taboo topic for discussion... Is it really a bug of the processor?

Best Reply

_mm_cmpestri/PCMPESTRI, expecially when isused together with _SIDD_CMP_EQUAL_ORDERED/substring search, with might be quite complicated for straightforward understanding,
therefore "Intel 64 and IA-32 ArchitecturesSoftware Developers ManualVolume 2 (2A & 2B):Instruction Set Reference, A-Z" ( available at http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-2a-2b-instruction-set-a-z-manual.html ) has a dedicated description as: "4.1 IMM8 CONTROL BYTE OPERATION FOR PCMPESTRI / PCMPESTRM / PCMPISTRI / PCMPISTRM"

And you can alwaysdouble check behaviour by using SDE emulation, from http://software.intel.com/en-us/articles/intel-software-development-emul...

if doesnt take much of the time but hopefully will give more confidence about the result, if in doubt(s)

Thank you.
Yes, PCMPXXX are quite complicated + I noticed that they are too slow (search last byte and compare all if found algorithm is much faster for my tasks).

In your code, it tries to perform an ordered compare of a 3-byte reference string (starting with a null byte) with the 16 byte sequence "cmp", which had a null byte at offset 15. So the instruction gave you what you asked as a partial match of the 1st bytein the3-byte reference string.

Since the ordered comparison operation done by hardware is forward, I think naturally using these STTNI for forward search algorithm will be more favorable to producehigher speed up than attempting reverse sub-string matches.

Although I haven't tried reverse sub-string match algorithm yet, but I can offer some general heuristic, based on the Glibc string library work, which had improved roughly an order of magnitude with forward sub-string searches. For character matching, significant improvement in either direction were delivered by STTNI as well.

At the heuristic level, the strong gain in forward sub-string matches come from these factors:
1. loads should not be in the dependency chain (this the common thesis of OOO processor), if the expression of advancing buffer address caused the OOO engine to expose latency of the instruction, you need to work on the address expression to facilitate the OOO to start executing the load early. You can consult the code in the Optimization manual or in Glibc.
2. expediting 1st-byte-not-a-match cases, this is the 1st partwhere STTNI sub-string match excels over previous byte-granular algorithm. When you do reverse searching with the buffer length known, unless your match is at the tail end of the buffer, where you can be clever and luck will be on your side with byte-granular technique. Otherwise, STTNI will still be able to help in other test scenario.
3. Once you found a partial match in the forward direction, STTNI will also help you with verifying the initial partial match leads to a full match or a false match.
4. doing reverse search with a long reference string (> 16 bytes) would pose a bit more challenge with STTNI's forward search preference.
5. As a general rule, your mileage will vary, depending how you choose your test data set and how your code take advantage with the heuristic. For example, a database engine adopted STTNI for relational query operators, it now reports the highest non-cluster TPC-H performance.

When I wrote about "search last byte", I had in mind a search algorithm of one byte (last/first/minor,what) of the desired sequence. Vectorized, of course. And then comparison of all bytes of desired sequence, if found (via xmm/ymm also).In my tasks that a single byte is rare-meet within the buffer searched (misses are infrequent), the length of the entire sequence - from 1 to 4 bytes, and the length of the buffer - from tens to tens of thousands bytes. So, such algorithm is much faster than "direct" PCMPXXX usage, checked.Thank you for your comment!

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi