Hi there.

This is the place to ask for new instructions, right? Here is something that I want for years. Bit interleaving.

The idea is to take the lower 16 bits of an integer, and muxing them into all odd bits of the result and, setting all even bits of the result to zero.

Such an instruction can give huge performance improvements for all kind of two dimensional array access:

Assume we have a big matrix to work on, big enough that it does not fit into the L1D-cache anymore. Also for simplicity assume width and heights are powers of two.

The addressing of such a matrix looks like this:

index = x * elements_per_line + y;

element = matrix_base[index];

Accessing this matrix works very fast if we access data sequential, e.g. in rows. This is not always possible and there are access-patterns that give near worst case performance figures (matrix transpose is one of them).

With bit interleaving we could simply store the matrix in morton order. This will get around the worst case cache performance problem for matrix transpose and will give good results for other access-patterns as well as long as x and y have some locality.

index = bit_interleave(x) + bit_interleave(y)*2;

element = matrix_base[index];

In some cases one can work around the problem and do the morton-ordering incremental, but that is unfortunately not always possible. Interleaving bits with the CPU otoh is very slow (still worth it if it saves a cache-miss though). I know that the matrix transpose is a bad example because there are better ways to do this, but I think it's ok to show the principle.

Btw - Bit interleaving has other uses as well. I use to do arithmetic in multiple bitfields in a single integer, just like MMX with the difference that my bitfields are arbitrary and not always of identical size. The zero-bit gaps between the bits can be used to stop the carry moving from one overflowing bit-field into the next during additions. In general it has great bit-twiddeling potential.

The TI C64x+ DSP has such instructions called deal and shfl. At first I thought these are special purpose instructions to speed up some common DSP algorithm, but after a year of DSP assembler coding I found out how usefull they are in general.

I really miss them on the x86.

Nils