Integer SIMD instructions weirdness -- needs fixing

Integer SIMD instructions weirdness -- needs fixing

I hope that Intel engineers are going to read this, and improve integer SIMD instructions on future CPUs.

1. PSRLB/PSLLB/PSRAB/PSLAB -- they do not exist. How about adding them?

2. All SIMD shifts -- it is not possible to shift each packed byte/word/dword/qword by different amount.

Regarding #2:

- Why did you make another SIMD register as a count parameter for those instructions if we cannot specify more than one shift value?!? You could have simply used GPR for the count parameter!

PSRLW xmm0, eax would be more usefull than PSRLW xmm0, xmm2, not to mention that you could use xmm2 to have 8 different shift counts, one for each word in the destination.

Because of someone's mistake, now we will never have proper SIMD shift instruction -- behavior of PSRLW xmm0, xmm2 can never be changed. If we want more usefull shift we will need another instruction, and the current one will stay forever as a dead weight in the x86 instruction set.

3. SIMD bit manipulation (PBSETB/W/D/Q, PBCLRB/W/D/Q, PBTSTB/W/D/Q) -- it would be nice to have an instruction which can set, clear, or test different bit in each packed byte, word, dword, or qword.


xmm0 = 0 0 0 0 (dword)
xmm1 = 4 3 2 1 (dword)

PBSETD xmm0, xmm1

xmm0 = 0x10, 0x08, 0x04, 0x02 (dword)

To be continued.

28 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Fully agreed. SSE obviously was never designed for general purpose use, but to accelerate specific workloads. Then, as an afterthought, more generality was added. And now we have a mess. :(

AVX seemed like a good new start. But so far I don't have the feeling it will help at all - rather make everything even more complicated.

About the SIMD shifts: it took me more than a day, IIRC, to figure out why my shifts were not working. Until I realized that PSRLW xmm0, xmm1 behaves like you said. That was not clear from the documentation and completely against any consistency expectations of the SSE API. (I was hoping for a fast release of LRBni hardware back then. Those dreams were shattered...)

Yes, and it is still easy to misunderstand the PSRLW documentation today!

As I said, I will continue to list things I consider drawbacks in this thread:


This instruction would be much more usefull if you could specify byte positions where you want each byte loaded instead of just load / don't load flag for each byte.

It would be even more usefull if it did not require 16-byte alignment and if it did not touch memory beyond what you specify in the mask register.


esi = 55 AA C3 90 15 FF FF FF FF FF FF FF FF FF FF FF (FF means unreadable location)
xmm1 = empty

xmm0 = 80 80 80 80 80 80 80 80 80 80 80 07 06 05 04 03 (80 means do not read byte from source, 0-15 means put into that byte position in destination)

PBLENDVB xmm1, xmmword ptr [esi]

xmm1 = 00 00 00 00 00 00 00 00 55 AA C3 90 15 00 00 00


Insert and extract instructions accept only immediate parameter -- this should have been a GPR!

If you need to read 5 bytes from memry into SIMD register and you are at the end of valid memory page, the only option is to MOVZX into GPR and then PINSRB five times. If you need variable byte count, you need very complex and slow code for that because PINSRB does not accept GPR as position parameter which would make it possible to at least write a loop instead of using comparisons and branches.

Igor, SIMD shifts and many other integer SIMD issues are solved in the upcoming XOP instruction set. It would be great to have it in Intel chips too... (contrary to enlarging the mess by another incompatible extension)

Igor, your assumption

If we want more useful shift we will need another instruction, and the current one will stay forever as a dead weight in the x86 instruction set.

isnot strictly consequential. By clearing some CPUID flags, and defining a reserved bit for a better usage of these op-codes, Intel could correct that without breaking backward compatibility. Even more, for an intermediate processor generationa flag in a status word could switch between thedepreciated features and the corrected ones.

But I am not sure whether a processor manufacturerwould do something which might taste like admitting a mistake.

Once the instruction is defined, its behavior must not change or the old code will break.

Old code has no way of knowing about your proposed new flags, and it expects the instruction to work in the same way it did when the code was written.

Thus, the only way to change some instruction behavior is to introduce new instruction encoding -- in other words, a new instruction.

For example, if you had:

PSRLW xmm0, xmm2

Where xmm2 contains a single shift count in the low order word, and you wanted to have:

PSRLW xmm0, xmm2

Where xmm2 contains 8 different shift counts (one in each word) instead, it would be impossible to keep the same instruction name and encoding without breaking existing software.

That is why I said that even if they introduce such new instruction, the old one will remain as a dead weight forever. Just look what happened with LAHF in 64-bit mode, and you will understand what I am talking about.

I am amazed that Intel CPU engineers can afford themselves the luxury of designing instructions without thoroughly thinking through what would be more usefull to developers.

In case of PSRLW, it is blindingly obvious that:

1. PSRLW xmmdst imm8 (where imm8 is a shift count value from 0 to 15)
2. PSRLW xmmdst, reg32 (where reg32 holds a single shift count value in a low word)
3. PSRLW xmmdst, xmmsrc/mem128 (where xmmsrc/mem128 hold 8 different shift count values)

Was the way to go, and while Intel can still introduce PSRLW xmmdst, reg32, they cannot change the behavior of PSRLW xmmdst, xmmsrc/mem128 without breaking existing code.

While we are complaining about missing integer instructions, how about better support for bytes? Specifically, I am missing pmullb, pmullhb, psllb, psrlb. There are already word and dword versions of these, so it doesn't seem a stretch to have byte versions.

The motivation for these instructions is this computation:

(a * b + 0x7f) / 255

which is common when blending pixels with 8 bit channels. The division with 255 can be done like this: (t = (a * b) + 0x80, (t + (t >> 8)) >> 8), but for that to work with current SIMD instruction sets, you have to unpack the inputs to 16 bits which means quite a bit of arithmetic and register pollution.

If there were pmul{lh}b and psrlb instructions, you could do this instead:

    // input uint8_t a, b

    uint8_t hi, lo, t;

    hi = ((a * b) & 0xff00) >> 8;        // pmulhb
    lo = ((a * b) & 0x00ff);             // pmullb

    t = lo >> 7;                         // psrlb
hi += t;                             // padd 

t = hi ^ 0x7f // pxor
    if ((int8_t)lo > (int8_t)t)         // pcmpgt
        lo = 0xff;
        lo = 0x00;

return hi - lo;    // psub

This code is only seven instructions for sixteen operations, and, importantly, has only three live variables at any time.

It can easily be verified that it is equivalent to (a * b + 0x7f) / 255 for all inputs.

I agree that those instructions would also be usefull too, especially byte shifts.

We have PMOVMSKB. It compacts the sign bits of individual bytes to a GPR.

It would be nice if we had opposite operation as well -- move bits from GPR to sign bits of individual bytes in SIMD register. This would enable:

- parallel bit processing/testing
- more flexible blending with PBLENDVB

itlooks like something you can do easily(for MMX code) with a PAND 0x7f7f7f7f7f7f7f7f + a POR with a256 x 64-bit look up table (or just a MOVQ from the LUT for a lot of useful codes)

Yes but PMOVMSKB works with all 16 bytes so there are 16 sign bits that get packed into a GPR.

Therefore, you would need to be able to do the same in reverse (16 bits from GPR to 16 bytes' sign bits in SIMD). It is possible but impractical -- hardware implementation would certainly be faster.

Finally, I don't see how only 8 LUT entries would be enough?

you are right, I asnwered way too fast sorry, I had in mind a 256-entry LUT (mixed up with the 8-bit index), Ihave kindafixed the post now

now indeed, 16-bit indices will lead to a monster 1MB LUT which is impractical

Actually, there is a way to do it with only 2 KB LUT -- you make 256 entries, 8-byte per entry, LUT, and look it up twice (once with high and once with low byte).

Still, I think that even 2KB LUT is large enough to compete for L1 cache with the rest of the algorithm.

Of course, you can go down further and work on four nibbles thus reducing LUT size to 16 entries, 4-byte per entry (64 bytes LUT), but that considerably increases code complexity both in number of arithmetic/logic operations and number of memory accesses required to accomplish something hardware could (and should) do with a single instruction.


you can avoid thelook-up table altogetherbyusing a combination of pshufb and pcmpeqb:

__m128i v = _mm_shuffle_epi8(input, _mm_set_epi8(1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0));

const __m128imask = _mm_set_epi8(0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01, 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01);
v = _mm_and_si128(v, mask);
v = _mm_cmpeq_epi8(v, mask);

First, the 16 bits are broadcasted across the 16 bytes with the shuffle instruction. Then the relevant bit is masked out with "and" for each byte. Finally, all the bits are set with the comparison.

Kind regards


I want those 16 bits from EAX unpacked to sign bit of corresponding byte in the SIMD register. I only need sign bits to be set/cleared. Any ideas on how to do that?


You mean something like this:

unsigned short signs;
__m128i values, v;
v = _mm_insert_epi16(v, signs, 0);
v = _mm_shuffle_epi8(input, _mm_set_epi8(1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0));
const __m128i mask = _mm_set_epi8(0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01, 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01);
v = _mm_and_si128(v, mask);
v = _mm_cmpeq_epi8(v, mask);
v = _mm_and_si128(v, _mm_set1_epi8(128));
values = _mm_and_si128(values, _mm_set1_epi8(255));
values = _mm_or_si128(values, v);

I'm more fluent in intrinsics than assembler, but I assume that the translation is easy for you. Unfortunately, I don't no any more efficient way than the and-and-or sequence for blending the bits. It goes without saying that constants should be precomputed and kept in memory or registers. If the variable "signs" comes from memory and is part of a loop, you can unroll the loop and directly load into v.By using different shuffles you can then reduce the loads and avoiding the insertions.

By the way, I've learned this trick to use cmpeq to propagate bits accross a byte from Nick, so credit is due to him.

Kind regards


Something like that, yes. However, in line 9 you need to have 0x7F mask, not 0xFF.

Other option would be to insert bits into the register which is already cleared -- in that case, and 0x7F can be left out.

Still this code requires 4 constants which you must keep in a register which does not leave a lot of space for other things and it is not exactly trivial. Having a hardware instruction that does this would be great.


you are right about the mask in line 9. Thanks for catching that.

I was not trying to argue that the proposed code completely solves your problem, but it was the best I could think of.

Kind regards


No problem, thanks for teaching me a new trick :)

I intend to keep posting here my own ideas for new potentially usefull CPU instructions. If others want to contribute they are welcome.

If others want to contribute they are welcome

I will welcomea new branch prediction hint : RANDOMLY_TAKEN

It will be useful for branches that are pathologically hard to predict (high branch prediction miss rate) and when running 2 threadson a core (or 4 threads on future x86 targets like MIC)

The idea is that whena branch tagged with such a hint is encountered,its thread stall until the branch is resolved (if other threads are running on the same core), letting the other thread(s) do useful work instead of wasting resources (and power) executing nearly half the time in pure waste the wrong path

Even with vectorized code sometimes hard to predict branches are useful to provide speedups (for example if they allow to skip several instructions including loads), see for example this post of mine:

The 4th constant can be avoided by merging the bits using a ^ ((a^b) & mask) as suggested in this collection of bit twiddling hacks. This results in:

unsigned short signs;   
__m128i values, v;   
v = _mm_insert_epi16(v, signs, 0);   
v = _mm_shuffle_epi8(input, _mm_set_epi8(1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0));   
const __m128i mask = _mm_set_epi8(0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01, 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01);   
v = _mm_and_si128(v, mask);   
v = _mm_cmpeq_epi8(v, mask);   
v = _mm_xor_si128(v, values); 
v = _mm_and_si128(v, _mm_set1_epi8(128)); 
values = _mm_xor_si128(values, v); 

For sake of completeness, I would also like to mention that there is a psignb instruction that can replace the last 3 instructions if someone wants to negate the values and not just copy the sign bits.


Thanks for improving the code futher, now I am curious to test its performance.

Also, many thanks for the link, a lot of nice tricks to learn.

>>PSRLW xmm0, eax would be more usefull than PSRLW xmm0, xmm2, not to mention that you could use xmm2 to have 8 different shift counts, one for each word in the destination.

I fully agree. Add

PSRLW xmm0, imm8

etc for L/R B/W/D/Q/QQ (QQ would use ymm...)

XMM registers are too scarce of a resource to be wasted for an 8-bit shift count.
Different shift values is interesting

PSRL? xmm0,xmm1

where ? = B/W/D/Q/DQ/QQ (QQ would use ymm...)

and where the ? in the 2nd register contains each of the counts.
I am not sure how frequently the differing counts would appear in code.

Also I agree with you that there should be orthagonality. We have move sign bits of (b/w/d/q/qq) to gp register, there should also be move the other way. Some thought should be given as to if 0x80 or 0xFF gets moved in for bytes. I think the 0xFF would align itself with the masks generated with the compares.

Jim Dempsey

Jim, there is already PSRLW xmm, imm8 if I am not mistaken :)

Different shift counts would enable really interesting bit manipulations, especially if horizontal OR instruction would be added. Perhaps even better if we had a packed bit shuffle instruction.

Regarding masks, I prefer 0x80.

Quoting Igor LevickiRegarding masks, I prefer 0x80.

Can you elaborate a little bit, why this is the case? My personal choice would be a mask. I can also understand if someone wants to negate or copy the sign, but Icannot seewhy you would like to set only the sign bit of an integer.


Please read previous post by Jim Dempsey for context regarding 0x80 or 0xFF.

Whole idea to be able to move sign bits from GPR to SIMD register is to be able to do bit packing/unpacking which is needed for example for planar to packed (a.k.a. chunky) conversion.

For example, PMOVMSKB can be used for packed to planar conversion because it can move sign bits from 8 successive bytes from a SIMD register to a single byte in GPR which can then be stored as 8 successive bits from bitplane 7 into memory. Then you shift whole SIMD register left by 1 bit and repeat PMOVMSKB storing the next result into bitplane 6, etc.

Reverse from that would take a byte from plane 0 in memory to a GPR, move it to sign bits in SIMD using the instruction that does opposite from PMOVMSKB, then you shift whole SIMD register right by 1 bit, then repeat with byte from plane 1 until you get to plane 7. In the end you would have 8 bytes in SIMD register which you could write out to your packed image buffer.

Bitplane conversion (packed to planar and vice versa) is used in some image compression algorithms.

Thanks a lot for the explanation. Now, I understand.

You are welcome Thomas.

Leave a Comment

Please sign in to add a comment. Not a member? Join today