Welcome to the Intel(R) AVX Forum!

Welcome to the Intel(R) AVX Forum!

Imagen de aaron-tersteeg (Intel)

Please take a moment to read the papers and download the guide from the Intel AVX web site. If you have any questions about Intel AVX, AES, or SSE4.2 please aske your questions here and we will do our best to get you the infomration.

publicaciones de 9 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.
Imagen de urvabara

Hi!

http://www.anandtech.com/showdoc.aspx?i=3073&p=3

How about the matrix multiplication with Sandy Bridge? How many instructions does it need to do that?

Henri.

Imagen de knujohn4

Hi. I've quickly browsed the programmig reference, but I'm a little confused about the vex prefix and how it is encoded compared to the "normal" SSEx instructions. From chapter 4, "Elimination of escape opcode byte (0FH), SIMD Prefix byte (66H, F2H, F3H) ..." What I'm trying to figure out is how the instruction bytes will look if I'm looking at disassembley or debugging information of the AVX instructions. An example with one or two of the new instructions would be appreciated.


Knut Johnsen, Norway.

Imagen de Mark Buxton (Intel)

Hi Henri,


I couldn't quite figure what that other site was trying to do - it looks like just a fragment of the whole thing and the baseline has unnecessary copies. Here's my first attempt at an AVX version. I coded up C = A*B and looped over C and B so we can look at the throughput.


The number of instructions doesn't really matter here (at least on Sandy Bridge) - it appears to be limited by the number of data rearrangements, or multiplies. There's half as many multiplies in the AVX version, but the extra broadcasts should reduce our performance scaling somewhat. (I'll be back to the states soon so I can run it through the performance simulator and post the results here)


Another neat thing about the AVX version is the extra state - you can imagine wanting to reuse the column broadcast on A and save all those computations - like if one had to compute C = A*B and D = A*E.


// AVX throughput test of 4x4 MMULs


void MMUL4x4_AVX()
{
__asm {


mov ecx, 1024
lea rax, a
lea rbx, b
lea rdx, c


loop_a:
vmovaps ymm0, [rax]// a13 a12 a11 a10 | a03 a02 a01 a00
vpermilps ymm1, ymm0, 0x00// a10 a10 a10 a10 | a00 a00 a00 a00
vpermilps ymm2, ymm0, 0x55// a11 a11 a11 a11 | a01 a01 a01 a01
vpermilps ymm3, ymm0, 0xCC// a12 a12 a12 a12 | a01 a02 a02 a02
vpermilps ymm4, ymm0, 0xFF// a13 a13 a13 a13 | a01 a03 a03 a03


vmovaps ymm0, [rax+32]// a33 a32 a31 a30 | a23 a22 a21 a20
vpermilps ymm5, ymm0, 0x00// a40 a30 a30 a30 | a20 a20 a20 a20
vpermilps ymm6, ymm0, 0x55// a41 a31 a31 a31 | a21 a21 a21 a21
vpermilps ymm7, ymm0, 0xCC// a42 a32 a32 a32 | a21 a22 a22 a22
vpermilps ymm8, ymm0, 0xFF// a43 a33 a33 a33 | a21 a23 a23 a23


vbroadcastf128 ymm9, [rbx]// b03 b02 b01 b00 | b03 b02 b01 b00
vbroadcastf128 ymm10, [rbx+16]// b13 b12 b11 b10 | b13 b12 b11 b10
vbroadcastf128 ymm11, [rbx+32]// b23 b22 b21 b20 | b23 b22 b21 b20
vbroadcastf128 ymm12, [rbx+48]// b33 b32 b31 b30 | b33 b32 b31 b30


vmulps ymm1, ymm1, ymm9
vmulps ymm2, ymm2, ymm10
vmulps ymm3, ymm3, ymm11
vmulps ymm4, ymm4, ymm12
vaddps ymm1, ymm1, ymm2
vaddps ymm3, ymm3, ymm4
vaddps ymm1, ymm1, ymm3



vmulps ymm5, ymm5, ymm9
vmulps ymm6, ymm6, ymm10
vmulps ymm7, ymm7, ymm11
vmulps ymm8, ymm8, ymm12
vaddps ymm5, ymm5, ymm6
vaddps ymm7, ymm7, ymm8
vaddps ymm5, ymm5, ymm7


vmovaps [rdx], ymm1
vmovaps [rdx+32], ymm5


add rbx, 64
add rdx, 64


sub ecx, 1
jg loop_a
}
}



// Baseline for comparsion (can you beat this on SNB?)


void MMUL4x4_SSE()
{
__asm {


mov ecx, 1024
lea rax, a
lea rbx, b
lea rdx, c


loop_a:
movaps xmm0, [rax]
pshufd xmm1, xmm0, 0x00 // a00 a00 a00 a00
pshufd xmm2, xmm0, 0x55 // a01 a01 a01 a01
pshufd xmm3, xmm0, 0xcc // a01 a02 a02 a02
pshufd xmm4, xmm0, 0xFF // a01 a03 a03 a03


movaps xmm5, [rbx]//b03 b02 b01 b00
movaps xmm6, [rbx+16]//b13 b12 b11 b10
movaps xmm7, [rbx+32]//b23 b22 b21 b20
movaps xmm8, [rbx+48]//b33 b32 b31 b30


mulps xmm1, xmm5//a00b03 a00b02 a00b01 a00b00
mulps xmm2, xmm6//a01b13 a01b12 a01b11 a01b10
mulps xmm3, xmm7//a02b23 a02b22 a02b21 a02b20
mulps xmm4, xmm8//a03b33 a03b32 a03b31 a03b30
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdx], xmm1


movaps xmm0, [rax+16]
pshufd xmm1, xmm0, 0x00 // a10 a10 a10 a10
shufps xmm2, xmm0, 0x55 // a11 a11 a11 a11
shufps xmm3, xmm0, 0xcc // a11 a12 a12 a12
shufps xmm4, xmm0, 0xFF // a11 a13 a13 a13


mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
mul
ps xmm4, xmm8
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdx+16], xmm1


movaps xmm0, [rax+32]
pshufd xmm1, xmm0, 0x00 // a20 a20 a20 a20
pshufd xmm2, xmm0, 0x55 // a21 a21 a21 a21
pshufd xmm3, xmm0, 0xcc // a21 a22 a22 a22
pshufd xmm4, xmm0, 0xFF // a21 a23 a23 a23


mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
mulps xmm4, xmm8
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdx+32], xmm1


movaps xmm0, [rax+48]
pshufd xmm1, xmm0, 0x00 // a30 a30 a30 a30
pshufd xmm2, xmm0, 0x55 // a31 a31 a31 a31
pshufd xmm3, xmm0, 0xcc // a31 a32 a32 a32
pshufd xmm4, xmm0, 0xFF // a31 a33 a33 a33


mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
mulps xmm4, xmm8
addps xmm1, xmm2
addps xmm3, xmm4
addps xmm1, xmm3
movaps [rdx+48], xmm1


add rbx, 64
add rdx, 64


sub ecx, 1
jg loop_a
}


}

Imagen de knujohn4

Hello again! Some more info about this was posted at the IDF website some time after my last entry! You can see the PDF here: https://intel.wingateweb.com/SHchina/published/NGMS002/SP_NGMS002_100r_eng.pdf


But the follow up question will be, how can you know (looking at the Bytes in the code segment) if the instructions with C4H / C5H is the VEX prefix or the LES / LDS instruction?


Knut

Imagen de HONGJIU L. (Intel)

The disassembler in the Linux binutils 2.18.50.0.6 or above:


http://www.kernel.org/pub/linux/devel/binutils/


supports AVX:


[hjl@gnu-6 avx-2]$ objdump -Mintel -dw x.o


x.o: file format elf64-x86-64



Disassembly of section .text:


0000000000000000 :
0: 55 push rbp
1: 48 89 e5 mov rbp,rsp
4: 48 83 ec 28 sub rsp,0x28
8: c5 fc 29 45 a0 vmovaps YMMWORD PTR [rbp-0x60],ymm0
d: c5 fc 29 4d 80 vmovaps YMMWORD PTR [rbp-0x80],ymm1
12: c5 fc 29 95 60 ff ff ff vmovaps YMMWORD PTR [rbp-0xa0],ymm2
1a: c5 fc 28 45 80 vmovaps ymm0,YMMWORD PTR [rbp-0x80]
1f: c5 fc 29 45 e0 vmovaps YMMWORD PTR [rbp-0x20],ymm0
24: c5 fc 28 85 60 ff ff ff vmovaps ymm0,YMMWORD PTR [rbp-0xa0]
2c: c5 fc 29 45 c0 vmovaps YMMWORD PTR [rbp-0x40],ymm0
31: c5 fc 28 4d c0 vmovaps ymm1,YMMWORD PTR [rbp-0x40]
36: c5 fc 28 45 e0 vmovaps ymm0,YMMWORD PTR [rbp-0x20]
3b: c5 fc 58 c1 vaddps ymm0,ymm0,ymm1
3f: c9 leave
40: c3 ret

Imagen de Shih Kuo (Intel)

LES/LDS cannot be encoded in 64-bit mode, that should make it easy to tell.


In 32-bit modes, both LDS/LES require a modR/M byte.A VEX-encoded instructionin 32-bit mode would havebits 7 and 6 of the equivalent modR/M byte equal to 11B (corresponding to a reserved form of modR/M encoding for LDS/LES, or an illegal form of LDS/LES). You can infer this from the definition of VEX.R and VEX.vvvv in Figure 4-2 of the spec.

Imagen de Mark Buxton (Intel)

Here's the performance dataI promosed. For the two versions below (small bug fix from the snippet above), looking at throughput for something like an inlined C=A*B, I get 19.3 cycles per 4x4 matrix multiply for the SSE2 version and 13.8 cycles per matrix multiply for the Intel AVX version, or 1.4X. That's for everything hitting in the first level cache. (Disclaimers apply: it's a pre-silicon simulator and the product isn't out yet, so treat this with some skepticism).


In this case both the AVX and SSE2 version's performance is limited by the shuffles (the broadcasts andperms below are all shuffle operations)- theyall execute on the same port(along with the branch at the end and some fraction of the loop counter updates). And in this code I only do about 64 iterations of the loop so there is some small overhead in the benchmark. So if you unroll, performance of both versions increases slightly. Maybe more importantly, if you can reuse any of those shuffles, for example if you had to code up


C= A*B


F= A*E


You would get larger gains. In this case, our simultor shows the AVX version 23.4 cycles (per two 4x4 matrix multiplies) while the SSE2 baseline is 36.9, so 1.6X.


-----


This is the Intel AVX version of a simple inlined 4x4 matrix multiply, Per call, it does 64 iterations of


C= A*B


void MMUL4x4_AVX()


{


__asm {



mov ecx, 1024/16


lea rax, a


lea rbx, b


lea rdx, c


loop_a:



vbroadcastf128 ymm9, [rbx] // b03 b02 b01 b00 | b03 b02 b01 b00


vbroadcastf128 ymm10, [rbx+16] // b13 b12 b11 b10 | b13 b12 b11 b10


vbroadcastf128 ymm11, [rbx+32] // b23 b22 b21 b20 | b23 b22 b21 b20


vbroadcastf128 ymm12, [rbx+48] // b33 b32 b31 b30 | b33 b32 b31 b30


vmovaps ymm0, [rax] // a13 a12 a11 a10 | a03 a02 a01 a00


vpermilps ymm1, ymm0, 0x00 // a
10 a10 a10 a10 | a00 a00 a00 a00


vpermilps ymm2, ymm0, 0x55 // a11 a11 a11 a11 | a01 a01 a01 a01


vpermilps ymm3, ymm0, 0xCC // a12 a12 a12 a12 | a01 a02 a02 a02


vpermilps ymm4, ymm0, 0xFF // a13 a13 a13 a13 | a01 a03 a03 a03


vmovaps ymm0, [rax+32] // a33 a32 a31 a30 | a23 a22 a21 a20


vpermilps ymm5, ymm0, 0x00 // a40 a30 a30 a30 | a20 a20 a20 a20


vpermilps ymm6, ymm0, 0x55 // a41 a31 a31 a31 | a21 a21 a21 a21


vpermilps ymm7, ymm0, 0xCC // a42 a32 a32 a32 | a21 a22 a22 a22


vpermilps ymm8, ymm0, 0xFF // a43 a33 a33 a33 | a21 a23 a23 a23


vmulps ymm1, ymm1, ymm9


vmulps ymm2, ymm2, ymm10


vmulps ymm3, ymm3, ymm11


vmulps ymm4, ymm4, ymm12


vaddps ymm1, ymm1, ymm2


vaddps ymm3, ymm3, ymm4


vaddps ymm1, ymm1, ymm3


vmulps ymm5, ymm5, ymm9


vmulps ymm6, ymm6, ymm10


vmulps ymm7, ymm7, ymm11


vmulps ymm8, ymm8, ymm12


vaddps ymm5, ymm5, ymm6


vaddps ymm7, ymm7, ymm8


vaddps ymm5, ymm5, ymm7


vmovaps [rdx], ymm1


vmovaps [rdx+32], ymm5


add rbx, 64


add rdx, 64


sub ecx, 1


jg loop_a


}


}


This is the Intel SSE2 version of a simple inlined 4x4 matrix multiply, Per call, it does 64 iterations of


C= A*B


void MMUL4x4_SSE()


{


__asm {


; each iteration does one matrix mul (16 elements)


mov ecx, 1024/16


lea rax, a


lea rbx, b


lea rdx, c


loop_a:



movaps xmm0, [rax]


pshufd xmm1, xmm0, 0x00 // a00 a00 a00 a00


pshufd xmm2, xmm0, 0x55 // a01 a01 a01 a01


pshufd xmm3, xmm0, 0xcc // a01 a02 a02 a02


pshufd xmm4, xmm0, 0xFF // a01 a03 a03 a03


movaps xmm5, [rbx] //b03 b02 b01 b00


movaps xmm6, [rbx+16] //b13 b12 b11 b10


movaps xmm7, [rbx+32] //b23 b22 b21 b20


movaps xmm8, [rbx+48] //b33 b32 b31 b30


mulps xmm1, xmm5 //a00b03 a00b02 a00b01 a00b00


mulps xmm2, xmm6 //a01b13 a01b12 a01b11 a01b10


mulps xmm3, xmm7 //a02b23 a02b22 a02b21 a02b20


mulps xmm4, xmm8 //a03b33 a03b32 a03b31 a03b30


addps xmm1, xmm2


addps xmm3, xmm4


addps xmm1, xmm3


movaps [rdx], xmm1


movaps xmm0, [rax+16]


pshufd xmm1, xmm0, 0x00 // a10 a10 a10 a10


pshufd xmm2, xmm0, 0x55 // a11 a11 a11 a11


pshufd xmm3, xmm0, 0xcc // a11 a12 a12 a12


pshufd xmm4, xmm0, 0xFF // a11 a13 a13 a13


mulps xmm1, xmm5


e="Courier New">mulps xmm2, xmm6


mulps xmm3, xmm7


mulps xmm4, xmm8


addps xmm1, xmm2


addps xmm3, xmm4


addps xmm1, xmm3


movaps [rdx+16], xmm1


movaps xmm0, [rax+32]


pshufd xmm1, xmm0, 0x00 // a20 a20 a20 a20


pshufd xmm2, xmm0, 0x55 // a21 a21 a21 a21


pshufd xmm3, xmm0, 0xcc // a21 a22 a22 a22


pshufd xmm4, xmm0, 0xFF // a21 a23 a23 a23


mulps xmm1, xmm5


mulps xmm2, xmm6


mulps xmm3, xmm7


mulps xmm4, xmm8


addps xmm1, xmm2


addps xmm3, xmm4


addps xmm1, xmm3


movaps [rdx+32], xmm1


movaps xmm0, [rax+48]


pshufd xmm1, xmm0, 0x00 // a30 a30 a30 a30


pshufd xmm2, xmm0, 0x55 // a31 a31 a31 a31


pshufd xmm3, xmm0, 0xcc // a31 a32 a32 a32


pshufd xmm4, xmm0, 0xFF // a31 a33 a33 a33


mulps xmm1, xmm5


mulps xmm2, xmm6


mulps xmm3, xmm7


mulps xmm4, xmm8


addps xmm1, xmm2


addps xmm3, xmm4


addps xmm1, xmm3


movaps [rdx+48], xmm1


add rbx, 64


add rdx, 64


sub ecx, 1


jg loop_a


}


}


This is the Intel AVX version assuming you want to reuse the some of the reformatting associated with the left hand side matrix. Per call, it does 64 iterations of


C= A*B


F= A*E


void > MMUL4x4_AVX_2()


{


__asm {



mov ecx, 1024/16


lea rax, a


lea rbx, b


lea rdx, c


lea rsi, e


lea rdi, f


loop_a:



vmovaps ymm0, [rax] // a13 a12 a11 a10 | a03 a02 a01 a00


vpermilps ymm1, ymm0, 0x00 // a10 a10 a10 a10 | a00 a00 a00 a00


vpermilps ymm2, ymm0, 0x55 // a11 a11 a11 a11 | a01 a01 a01 a01


vpermilps ymm3, ymm0, 0xCC // a12 a12 a12 a12 | a01 a02 a02 a02


vpermilps ymm4, ymm0, 0xFF // a13 a13 a13 a13 | a01 a03 a03 a03


vmovaps ymm0, [rax+32] // a33 a32 a31 a30 | a23 a22 a21 a20


vpermilps ymm5, ymm0, 0x00 // a40 a30 a30 a30 | a20 a20 a20 a20


vpermilps ymm6, ymm0, 0x55 // a41 a31 a31 a31 | a21 a21 a21 a21


vpermilps ymm7, ymm0, 0xCC // a42 a32 a32 a32 | a21 a22 a22 a22


vpermilps ymm8, ymm0, 0xFF // a43 a33 a33 a33 | a21 a23 a23 a23


vbroadcastf128 ymm9, [rbx] // b03 b02 b01 b00 | b03 b02 b01 b00


vbroadcastf128 ymm10, [rbx+16] // b13 b12 b11 b10 | b13 b12 b11 b10


vbroadcastf128 ymm11, [rbx+32] // b23 b22 b21 b20 | b23 b22 b21 b20


vbroadcastf128 ymm12, [rbx+48] // b33 b32 b31 b30 | b33 b32 b31 b30



vmulps ymm0, ymm1, ymm9


vmulps ymm13, ymm2, ymm10


vaddps ymm0, ymm0, ymm13


vmulps ymm13, ymm3, ymm11


vmulps ymm14, ymm4, ymm12


vaddps ymm13, ymm13, ymm14


vaddps ymm0, ymm0, ymm13


vmovaps [rdx], ymm0


vmulps ymm0, ymm5, ymm9


vmulps ymm13, ymm6, ymm10


vaddps ymm0, ymm0, ymm13


vmulps ymm13, ymm7, ymm11


vmulps ymm14, ymm8, ymm12


vaddps ymm13, ymm13, ymm14


vaddps ymm0, ymm0, ymm13


vmovaps [rdx+32], ymm0



vbroadcastf128 ymm9, [rsi] // b03 b02 b01 b00 | b03 b02 b01 b00


vbroadcastf128 ymm10, [rsi+16] // b13 b12 b11 b10 | b13 b12 b11 b10


vbroadcastf128 ymm11, [rsi+32] // b23 b22 b21 b20 | b23 b22 b21 b20


vbroadcastf128 ymm12, [rsi+48] // b33 b32 b31 b30 | b33 b32 b31 b30


vmulps ymm1, ymm1, ymm9


vmulps ymm2, ymm2, ymm10


vmulps ymm3, ymm3, ymm11


vmulps ymm4, ymm4, ymm12


vaddps ymm1, ymm1, ymm2


vaddps ymm3, ymm3, ymm4


vaddps ymm1, ymm1, ymm3


vmulps ymm5, ymm5, ymm9


vmulps ymm6, ymm6, ymm10


vmulps ymm7, ymm7, ymm11


vmulps ymm8, ymm8, ymm12


vaddps ymm5, ymm5, ymm6


vaddps ymm7, ymm7, ymm8


vaddps ymm5, ymm5, ymm7


vmovaps [rdi], ymm1


vmovaps [rdi+32], ymm5


add rbx, 64


add rdx, 64


add rsi, 64


add rdi, 64


sub ecx, 1


jg loop_a


}


}


This is theIntel SSE2 baseline version assuming you want to reuse the some of the reformatting associated with the left hand side matrix. Per call, it does 64 iterations of


C= A*B


>F= A*E


void MMUL4x4_SSE_2()


{



__asm {


mov ecx, 1024/16


lea rax, a


lea rbx, b


lea rdx, c


lea rsi, e


lea rdi, f


loop_a:



movaps xmm0, [rax]


pshufd xmm1, xmm0, 0x00 // a00 a00 a00 a00


pshufd xmm2, xmm0, 0x55 // a01 a01 a01 a01


pshufd xmm3, xmm0, 0xcc // a01 a02 a02 a02


pshufd xmm4, xmm0, 0xFF // a01 a03 a03 a03


movaps xmm5, [rbx] //b03 b02 b01 b00


movaps xmm6, [rbx+16] //b13 b12 b11 b10


movaps xmm7, [rbx+32] //b23 b22 b21 b20


movaps xmm8, [rbx+48] //b33 b32 b31 b30


mulps xmm1, xmm5 //a00b03 a00b02 a00b01 a00b00


mulps xmm2, xmm6 //a01b13 a01b12 a01b11 a01b10


mulps xmm3, xmm7 //a02b23 a02b22 a02b21 a02b20


mulps xmm4, xmm8 //a03b33 a03b32 a03b31 a03b30


addps xmm1, xmm2


addps xmm3, xmm4


addps xmm1, xmm3


movaps [rdx], xmm1


movaps xmm0, [rax+16]


pshufd xmm1, xmm0, 0x00 // a10 a10 a10 a10


pshufd xmm2, xmm0, 0x55 // a11 a11 a11 a11

>

pshufd xmm3, xmm0, 0xcc // a11 a12 a12 a12


pshufd xmm4, xmm0, 0xFF // a11 a13 a13 a13


mulps xmm1, xmm5


mulps xmm2, xmm6


mulps xmm3, xmm7


mulps xmm4, xmm8


addps xmm1, xmm2


addps xmm3, xmm4


addps xmm1, xmm3


movaps [rdx+16], xmm1


movaps xmm0, [rax+32]


pshufd xmm1, xmm0, 0x00 // a20 a20 a20 a20


pshufd xmm2, xmm0, 0x55 // a21 a21 a21 a21


pshufd xmm3, xmm0, 0xcc // a21 a22 a22 a22


pshufd xmm4, xmm0, 0xFF // a21 a23 a23 a23


mulps xmm1, xmm5


mulps xmm2, xmm6


mulps xmm3, xmm7


mulps xmm4, xmm8


addps xmm1, xmm2


addps xmm3, xmm4


addps xmm1, xmm3


movaps [rdx+32], xmm1


movaps xmm0, [rax+48]


pshufd xmm1, xmm0, 0x00 // a30 a30 a30 a30


pshufd xmm2, xmm0, 0x55 // a31 a31 a31 a31


pshufd xmm3, xmm0, 0xcc // a31 a32 a32 a32


pshufd xmm4, xmm0, 0xFF // a31 a33 a33 a33


mulps xmm1, xmm5


mulps xmm2, xmm6


mulps xmm3, xmm7


mulps xmm4, xmm8


addps xmm1, xmm2


addps xmm3, xmm4


addps xmm1, xmm3


movaps [rdx+48], xmm1



movaps xmm0, [rax]


pshufd xmm1, xmm0, 0x00 // a00 a00 a00 a00


>pshufd xmm2, xmm0, 0x55 // a01 a01 a01 a01

pshufd xmm3, xmm0, 0xcc // a01 a02 a02 a02


pshufd xmm4, xmm0, 0xFF // a01 a03 a03 a03


movaps xmm5, [rsi] //b03 b02 b01 b00


movaps xmm6, [rsi+16] //b13 b12 b11 b10


movaps xmm7, [rsi+32] //b23 b22 b21 b20


movaps xmm8, [rsi+48] //b33 b32 b31 b30


mulps xmm1, xmm5 //a00b03 a00b02 a00b01 a00b00


mulps xmm2, xmm6 //a01b13 a01b12 a01b11 a01b10


mulps xmm3, xmm7 //a02b23 a02b22 a02b21 a02b20


mulps xmm4, xmm8 //a03b33 a03b32 a03b31 a03b30


addps xmm1, xmm2


addps xmm3, xmm4


addps xmm1, xmm3


movaps [rdi], xmm1


movaps xmm0, [rax+16]


pshufd xmm1, xmm0, 0x00 // a10 a10 a10 a10


pshufd xmm2, xmm0, 0x55 // a11 a11 a11 a11


pshufd xmm3, xmm0, 0xcc // a11 a12 a12 a12


pshufd xmm4, xmm0, 0xFF // a11 a13 a13 a13


mulps xmm1, xmm5


mulps xmm2, xmm6


mulps xmm3, xmm7


mulps xmm4, xmm8


addps xmm1, xmm2


addps xmm3, xmm4


addps xmm1, xmm3


movaps [rdi+16], xmm1


movaps xmm0, [rax+32]


pshufd xmm1, xmm0, 0x00 // a20 a20 a20 a20

FONT>

pshufd xmm2, xmm0, 0x55 // a21 a21 a21 a21


pshufd xmm3, xmm0, 0xcc // a21 a22 a22 a22


pshufd xmm4, xmm0, 0xFF // a21 a23 a23 a23


mulps xmm1, xmm5


mulps xmm2, xmm6


mulps xmm3, xmm7


mulps xmm4, xmm8


addps xmm1, xmm2


addps xmm3, xmm4


addps xmm1, xmm3


movaps [rdi+32], xmm1


movaps xmm0, [rax+48]


pshufd xmm1, xmm0, 0x00 // a30 a30 a30 a30


pshufd xmm2, xmm0, 0x55 // a31 a31 a31 a31


pshufd xmm3, xmm0, 0xcc // a31 a32 a32 a32


pshufd xmm4, xmm0, 0xFF // a31 a33 a33 a33


mulps xmm1, xmm5


mulps xmm2, xmm6


mulps xmm3, xmm7


mulps xmm4, xmm8


addps xmm1, xmm2


addps xmm3, xmm4


addps xmm1, xmm3


movaps [rdi+48], xmm1


add rbx, 64


add rdx, 64


add rsi, 64


add rdi, 64


sub ecx, 1


jg loop_a


}


}

Imagen de Thai Le (Intel)

Hi Knut,



Please see the response from our engineering below:



An example from the XED tool discussed at IDF (https://intel.wingateweb.com/SHchina/published/NGMS002/SP_NGMS002_100r_eng.pdf) is included below. For example, take a look at the first few bytes of VCMPPS. C5FCC2 the first byte is C4 or C5 (in 64-bit mode this is all thats required to know youre dealing with an AVX prefix); the second byte FC is the payoad; and C2 is the CMPPS opcode (same as before; subsequent bytes are also unchanged).



xed-i _mm256_cmpunord_ps.opt.vec.exe > dis



SYM subb:


XDIS 400a86: PUSH BASE 55 push rbp


XDIS 400a87: DATAXFER BASE 4889E5 mov rbp, rsp


XDIS 400a8a: LOGICAL BASE 4883E4E0 and rsp, 0xe0


XDIS 400a8e: DATAXFER BASE B8FFFFFFFF mov eax, 0xffffffff


XDIS 400a93: DATAXFER BASE 89051F381000 mov dword ptr[rip+0x10381f], eax


XDIS 400a99: DATAXFER BASE 890525381000 mov dword ptr[rip+0x103825], eax


XDIS 400a9f: AVX AVXC5FC100511381000 &
nbsp; vmovups ymm0, ymmword ptr[rip+0x103811]


XDIS 400aa7: DATAXFER BASE 89053F381000 mov dword ptr[rip+0x10383f], eax


XDIS 400aad: DATAXFER BASE 890541381000 mov dword ptr[rip+0x103841], eax


XDIS 400ab3: AVX AVX C5FCC20D1C38100003 vcmpps ymm1, ymm0, ymmword ptr[rip+0x10381c], 0x3


XDIS 400abc: AVX AVX C5FC110D34381000 vmovups ymmword ptr[rip+0x103834], ymm1


XDIS 400ac4: LOGICAL BASE 33C0 xor eax, eax


XDIS 400ac6: AVX AVX C5FA1080B8425000 vmovss xmm0, dword ptr[rax+0x5042b8]


XDIS 400ace: LOGICAL BASE 33D2 xoredx, edx



Inicie sesión para dejar un comentario.