How to convert three 8-bit 1-channel images to a 24-bit three channels image using SSE

How to convert three 8-bit 1-channel images to a 24-bit three channels image using SSE

3 images before change
image after change

Thank you in advance!

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Thomas Willhalm (Intel)'s picture

I would go fora combination of pshufb andpblendvb (or _mm_shuffle_epi8 and_mm_blendv_epi8 if you are talking instrisics).

  1. You start loading 3 registers with the first 16 values of R, G, and B.
  2. pshufb with an appropiate mask can be used to move the values to the correct position, i.e.
    1. R1 to position 1, R2 to position 4, R3 to position7, R4 to position 10, R5 to position 13, R6 to position 16
    2. G1 to position 2, G2 to position5,G3 to position 8,and so on
    3. B1 to position 3, and so on
  3. Using pblendb, you can merge the results and store them
  4. The next 16 values are then G6,B6,R7,G7,B7,...,B14,G15, which will require different shuffle masks
  5. Finally you store the last 16 values after a similar scheme.

After processing these 16*3 values, the pattern repeats.

You can avoid the blendvb by using a proper masks on pshufb (which zero's the parts that are not needed on pshuf and than just or the 3 phsufb values)

Thank you for your reply, I am a SSE newer, would you like to post a more exactly instructionAs far as I know there is not a pshufb in SSE,how to do it just using SSE?

Is it right to use the following mask?








But how to do it just using the SSE instructions?

Thomas Willhalm (Intel)'s picture

pshufb is part of the Supplemental Streaming SIMD Extensions 3 (SSSE3), which are supported by all current Intel processors down to Core 2 Duo and Atom. If you want to support older hardware, the Intel C++ compiler provides the necessary framework for a dispatcher that executes an SSSE3 and a generic version. I don't see a way how you could work around pshufb.


This has been done with SSE2 in the SSEPlus project which is released under the Apache License V2.0 but should be less efficient than SSSE3 on Intel Processors. It has the advantage to run on AMD processors where SSSE3 is not present.

Another way would be to use IPP but this might not be an option for you.


It is so difficult for me to learn these instructions, that I do not know how to start writing the program! Would you like to write the program exactly?
Thank you very much!

Login to leave a comment.