# Cross lane operations, how?

## Cross lane operations, how?

Question #1

I have:
xmm0/mem128 = A3 A2 A1 A0

And I want to have:
ymm0 = A3 A3 A2 A2 A1 A1 A0 A0

Question #2

I have:
ymm0 = B3 A3 B2 A2 B1 A1 B0 A0

And I want to have:
xmm1/mem128 = A3 A2 A1 A0
xmm2/mem128 = B3 B2 B1 B0

Question #3

I have:
xmm1/mem128 = A3 A2 A1 A0

xmm2/mem128 = B3 B2 B1 B0

And I want to have:
ymm0 = B3 A3 B2 A2 B1 A1 B0 A0

How to accomplish those seemingly trivial transformations having in mind AVX cross-lane limitations?

10 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

Here a solution using AVX2 (working on 32 bit integer entities):

#1:
org: ymm0 = x x x x a3 a2 a1 a0
vpermq ymm0,ymm0,0x10 => ymm0 = x x a3 a2 x x a1 a0 / select qwords x1x0
vpunpckldq ymm0,ymm0,ymm0 => ymm0 = a3 a3 a2 a2 a1 a1 a0 a0 / interlace low dwords

#2:
org: ymm0 = b3 a3 b2 a2 b1 a1 b0 a0
vpshufd ymm1,ymm0,0x08 => ymm1 = x x a3 a2 x x a1 a0 / select dwords xx20
vpshufd ymm2,ymm0,0x0d => ymm2 = x x b3 b2 x x b1 b0 / select dwords xx31
vpermq ymm1,ymm1,0x08 => ymm1 = x x x x a3 a2 a1 a0 / select qwords xx20
vpermq ymm2,ymm2,0x08 => ymm2 = x x x x b3 b2 b1 b0 / select qwords xx20

#3:
org: ymm1 = x x x x a3 a2 a1 a0; ymm2 = x x x x b3 b2 b1 b0
vpermq ymm1,ymm1,0x10 => ymm1 = x x a3 a2 x x a1 a0 / select qwords x1x0
vpermq ymm2,ymm2,0x10 => ymm2 = x x b3 b2 x x b1 b0 / select qwords x1x0
vpunpckldq ymm0,ymm1,ymm2 => ymm0 = b3 a3 b2 a2 b1 a1 b0 a0 / interlace low dwords

Thanks, but:

1. That doesn't help a bit with AVX where it is not possible to cross lanes.
2. It takes too much instructions even with AVX2

I really don't know what Intel CPU engineers were thinking when they designed AVX.

Hi Igor, the best I can come with using AVX is :

```#1:

vunpcklps xmm1,xmm0,xmm0

vunpckhps xmm2,xmm0,xmm0

vinsertf128 ymm0,ymm1,xmm2,1

#2:

vpermilps ymm0,ymm0,216 ;  %11011000

vextractf128 xmm3,ymm0,1

vunpcklpd xmm1,xmm0,xmm3

vunpckhpd xmm2,xmm0,xmm3
#3:

vunpcklps xmm3,xmm1,xmm2

vunpckhps xmm4,xmm1,xmm2

vinsertf128 ymm0,ymm3,xmm4,1
```

Hi sirrida,

with AVX2 you can enjoytight code with the 8x32 generic permute, for example for #1simply writing:

vpermps ymm0,ymm1,ymm0

will do the trick, ymm1 should be initialized (typically a loop invariant initialized once) with the proper offsets, i.e. 3 3 2 2 1 1 0 0 in this case

Quoting Igor Levicki
I really don't know what Intel CPU engineers were thinking when they designed AVX.

AVX is just a step toward AVX2. A lot of developers are skipping AVX because AVX2 is clearly a much more complete instruction set.

I think the Intel engineers have always envisioned AVX2 from the start, but it wasn't feasible to implement it all in one go, so they had to choose what parts of it to implement first. I think extending the registers to 256-bit and implementing the floating-point instructions first (by making the integer SIMD stack capable of floating-point operations), was the best compromise they could have made. But even so, AVX is unfortunately only useful for a relatively small range of applications.

That said, AVX2 is intended to be 'vertical' SIMD instruction set, to enable efficient SPMD programming. Think of OpenCL. Each lane executes the same operation on different data elements (i.e. different iterations of a loop). So you're not really supposed to do much if any cross lane work.

It's pretty brilliant to bring such GPU technology within the CPU, but you have to let go of old 'horizontal' SIMD programming models to get the most out of it.

>>So you're not really supposed to do much if any cross lane work.

I have to disagree with this -- you need cross lane to get data in proper position for processing, especially if you are not in control of data layout in memory.

@bronxzy:

Thanks, I will take a look and try your suggestions in some code to see how it performs.

Quoting Igor Levicki
I have to disagree with this -- you need cross lane to get data in proper position for processing, especially if you are not in control of data layout in memory.

That's what gather is for.

And yes, I know it's not part of AVX. But that brings us back to AVX being an intermediate step toward AVX2. It's just not suited for all cases of SPMD programming. Having wide floating-point vectors but no gather limits its usability. You'll have to accept to stick to SSE (or AVX-128) in some situations. Besides, Sandy/Ivy Bridge don't have sufficient cache bandwidth for a large speedup anyway. Haswell is expected to double it.

If AVX naturally fits your use case, great, but otherwise just wait for AVX2 instead of messing around with cross lane operations.

You know, I am kind of sick of those "close but no cigar" products we are being increasingly flooded with these days.

Don't get me wrong, Sandy Bridge is an awesome CPU.

But putting in 256-bit vectors where majority of opcodes can only operate on their 128-bit halves, where you have no cross-lane operations, and where you don't have enough bandwidth for 2x speedup compared to SSE except in synthetic benchmarks with most contrived conditions is something I would personally call "beta" and wouldn't even bother to release and sell.

Can't wait for Haswell.

With that out of the way, I really don't understand why they didn't make shuffle instructions with GPR instead of immediate to begin with -- they would now have up to 64 bits (in x64 mode) for element reordering indices.

I'm not too happy with AVX as well; at least you can utilize the VEX encoding and also do a lot of work with the double amount of floats compared to SSE 4.1.
For me AVX seems to be a bit like SSE (and partially AMD's 3Dnow!): An appetizer and test ballon allowing for some impressive benchmarks.
I simply skipped SSE because I almost exclusively work with integers; with SSE2 and especially SSSE3 (pshufb) the things got much better.
Also I don't like to be forced to do almost all work in-line and I already had complained about that.

On the other hand the decidedly non-orthogonal and mostly line-wise MMX/SSE/AVX command sets almost always simply get the job done with reasonable effort.
Restricting most commands to in-line operation makes the CPUs much simpler allowing e.g. for future Atoms acting on YMM or even larger registers (e.g. Larrabee / Knight's family) as well without making their dies much larger.
As you have probably noticed assuming unlimited parallel processing and 1 cycle per command the AVX2 solution costs 2/2/2 cycles and the AVX solution (bronxzy) 2/3/2 cycles. Using vpermps with preinitialized shuffle masks it even gets the cycle count down to 1/1/1, however the performance of vpermps on simple CPUs probably will be low.
BTW: My AVX2 integer solution is easily transformed to a float solution by replacing vpermq=>vpermpd, vpunpckldq=>vunpcklpd, and vpshufd=>vshufps. Unfortunately vpermpd shuffles every two singles as one double (type mismatch); I'm not sure whether this costs cycles.