Hi all, this is my first post
Paper "Inside Intel Next Generation Nehalem Architecture" by Ronak Singhal (SP08_NGMS001_100r_eng.pdf) contains comparison of strlen uses PCMPSTRx instruction and ordinal x86-code. SSE4.2 code looks very nice, but what is approximate speedup?
And why scalar x86 code was used? With SSE2 instructions strlen could also be coded; here is my implementation: http://wmula.republika.pl/proj/sse2string/src/strlen.S. I'm wondering how faster SSE4.2 code is.
BTW what is latency/throughput of PCMPSTRx instructions? Does latency depend on input data or is constant? I didn't find answers in recent manuals.