How to assign a constant?

How to assign a constant?

I do not think there are constant registers in x86. When I define a const array, x86 access these constants from a memory but not a direct constant in instruction. Any instructions can assign a 128bit/256bit constant to a SSE/AVX register? 

47 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Are you talking about C programming?  About facilities of some specific compiler?

How about a short example to make this specific?

Citation :

chang-li a écrit :

I do not think there are constant registers in x86. When I define a const array, x86 access these constants from a memory but not a direct constant in instruction. Any instructions can assign a 128bit/256bit constant to a SSE/AVX register? 

You can access XMMn registers with the help of inline assembly.This is my preffered method of SSE -aware programming.In order to load XMM register I use align 16 directive on my typedef structure which holds single precision fp and double precision fp scalar values arranged in 1D array and I use movaps instruction to directly load XMMn registers.

In C language

const unsigned char data[16] = {0xFF, 0x00, 0x01, 0xFF,0xFF, 0x00, 0x01, 0xFF,0xFF, 0x00, 0x01, 0xFF,0xFF, 0x00, 0x01, 0xFF};
__m128i xmm0;

xmm0 = _mm_loadu_si128((__m128i *)data); 

In ASM it becomes

movdqa xmm0, PQDWORD PTR [esi+4]

What I expected is 

movdqa xmm0, 0xFF0001FFFF0001FFFF0001FFFF0001FF

I could not find this form in assembly. 

>>>What I expected is

movdqa xmm0, 0xFF0001FFFF0001FFFF0001FFFF0001FF

I could not find this form in assembly>>>

In MASM I can load XMM register directly by using declared primitive type with DUP directive.When using inline assembly you can load directly xmm register by using array name only and without using pointer dereference operator.

Do you have the right expression of inline assembly below? 

movdqa xmm0, 0xFF0001FFFF0001FFFF0001FFFF0001FF

>>Do you have the right expression of inline assembly below?
>>
>>movdqa xmm0, 0xFF0001FFFF0001FFFF0001FFFF0001FF

I checked the Instruction Set Reference ( Order Number: 325383-044US / August 2012 ) and I see that movdqa can not be used with constants.

Please take a look at a page 572 of the manual. Here is a quote:
...
This instruction can be used to load an XMM register from a 128-bit memory location, to store the
contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers.
...

Iliya,

>>...In MASM I can load XMM register directly by using declared primitive type with DUP...

We really would like to see how you do it, please. Thanks in advance.

Citation :

Sergey Kostrov a écrit :

Iliya,

>>...In MASM I can load XMM register directly by using declared primitive type with DUP...

We really would like to see how you do it, please. Thanks in advance.

Here is the code which calculates cosine function by Taylor  series expansion, please bear in mind that this code is not optimized and run slowly because of stack accesses which are not needed. I was able to load directly XMM register without dereferencing pointer.While reading the code please look at "coef" variables which are initialized to cosine series factorial denominators.

.XMM
 .STACK 4096

 .DATA

  argument REAL4 0.0,0.0,0.0,0.0
 step REAL4 0.01,0.01,0.01,0.01
 hi_bound REAL4 1.0,1.0,1.0,1.0
 lo_bound REAL4 0.0,0.0,0.0,0.0
 up_range REAL4 1.0
 lo_range REAL4 0.0
 one REAL4 1.0,1.0,1.0,1.0
 counter BYTE 147
 coef1 REAL4 2.0,2.0,2.0,2.0 ;2!
 coef2 REAL4 24.0,24.0,24.0,24.0 ;4!
 coef3 REAL4 720.0,720.0,720.0,720.0 ;6!
 coef4 REAL4 40320.0,40320.0,40320.0,40320.0 ;8!
 coef5 REAL4 3628800.0,3628800.0,3628800.0,3628800.0 ;10!
 coef6 REAL4 479001600.0,479001600.0,479001600.0,479001600.0 ;12!
 coef7 REAL4 87178291200.0,87178291200.0,87178291200.0,87178291200.0 ;14!
 coef8 REAL4 20922789888000.0,20922789888000.0,20922789888000.0,20922789888000.0 ;16!
 coef9 REAL4 6402373705728000.0,6402373705728000.0,6402373705728000.0,6402373705728000.0 ;18!
 coef10 REAL4 2432902008176640000.0,2432902008176640000.0,2432902008176640000.0,2432902008176640000.0 ;20!
 coef11 REAL4 1124000727777607680000.0,1124000727777607680000.0,1124000727777607680000.0,1124000727777607680000.0 ;22!
 coef12 REAL4 620448401733239439360000.0,620448401733239439360000.0,620448401733239439360000.0,620448401733239439360000.0;24!
 coef13 REAL4 403291461126605635584000000.0,403291461126605635584000000.0,403291461126605635584000000.0,403291461126605635584000000.0 ;26!
 loop_counter BYTE 50
 loop_counter2 BYTE 25
 loop_compare REAL4 0.5,0.5,0.5,0.5

 .DATA?
 result REAL4 147 DUP(?)
 com_lo REAL4 4 DUP(?)
 com_hi REAL4 4 DUP(?)
 start_time DWORD ?
 end_time DWORD ?
 value REAL4 ?
 upper REAL4 1.0
 lower  REAL4 0.0
 counter_compare REAL4 4 DUP(?)

 .CODE
 main PROC

 push ebp
 mov esp,ebp
 sub esp,224
 mov cl,counter
 xor eax,eax
 xor ebx,ebx
 xorps xmm2,xmm2
 xorps xmm0,xmm0
 xorps xmm1,xmm1
 movups xmm5,argument
 movups xmm0,argument
 
 

 finit
 mWrite "Please enter a starting value for cosine calculation"
 call ReadFloat
 fst value
 call Crlf
 
 fld upper
 fcom value
 fnstsw ax
 sahf
 jnb error
 
 
 
 movss xmm5,value
 movss xmm5,value
 movss xmm5,value
 movss xmm5,value
L1:
   movups xmm4,loop_compare
   movups xmm3,xmm5
   cmpps xmm3,xmm4,6
   mov eax,OFFSET counter_compare
   movups [eax],xmm3
   mov ebx,[eax]
   cmp ebx,0
   jne L2
   mov ebx,[eax+4]
   cmp ebx,0
   jne L2
   mov ebx,[eax+8]
   cmp ebx,0
   jne L2
   mov ebx,[eax+12]
   cmp ebx,0
   jne L2
   movups xmm4,loop_compare
   movups xmm3,xmm5
   cmpps xmm3,xmm4,1
   mov eax,OFFSET counter_compare
   movups [eax],xmm3
   mov ebx,[eax]
   cmp ebx,11111111111111111111111111111111b
   je L4
   mov ebx,[eax+4]
   cmp ebx,11111111111111111111111111111111b
   je L4
   mov ebx,[eax+8]
   cmp ebx,11111111111111111111111111111111b
   je L4
   mov ebx,[eax+12]
   cmp ebx,11111111111111111111111111111111b
   je L4
   xor eax,eax
   jz L3
 L2:
 mov cl,loop_counter
 xor eax,eax
 jz L3
 L4:
 mov cl,loop_counter2
 xor eax,eax
 jz L3

 L3:
 mov edx,OFFSET step
 movups xmm4,[edx]
 addps xmm5,xmm4
 
 movups xmm7,xmm5
 movups xmm0,one
 mulps xmm7,xmm7 ;x^2
 movups xmm6,xmm7
 movups [ebp-16],xmm7 ;store x^7
 movups xmm2,coef1
 rcpps xmm1,xmm2 ;1/coef1
 mulps xmm1,xmm7
 subps xmm0,xmm1 ;1-x^2/2! xmm0 accumulator
 movups xmm7,[ebp-16]
 mulps xmm7,xmm6 ;x^4
 movups [ebp-32],xmm7 ;store x^4
 movups xmm2,coef2
 rcpps xmm1,xmm2 ;1/coef2
 mulps xmm1,xmm7
 addps xmm0,xmm1 ;1-1x^2/2!+x^4/4!
 movups xmm7,[ebp-32]
 mulps xmm7,xmm6 ;x^6
 movups [ebp-48],xmm7 ;store x^6
 movups xmm2,coef3
 rcpps xmm1,xmm2 ;1/coef3
 mulps xmm1,xmm7
 subps xmm0,xmm1 ;1-x^2/2!+x^4/4!-x^6/6!
 movups xmm7,[ebp-48]
 mulps xmm7,xmm6 ;x^8
 movups [ebp-64],xmm7 ;store x^8
 movups xmm2,coef4
 rcpps xmm1,xmm2 ;1/coef4
 mulps xmm1,xmm7
 addps xmm0,xmm1 ;1-x^2/2!+x^4/4!-x^6/6!+x^8/8!
 movups xmm7,[ebp-64]
 mulps xmm7,xmm6 ;x^10
 movups [ebp-80],xmm7 ;store x^10
 movups xmm2,coef5
 rcpps xmm1,xmm2 ;1/coef5
 mulps xmm1,xmm7
 subps xmm0,xmm1 ;1-x^2/2!+x^4/4!-x^6/6!+x^8/8!-x^10/10!
 movups xmm7,[ebp-80]
 mulps xmm7,xmm6 ;x^12
 movups [ebp-96],xmm7 ;store x^12
 movups xmm2,coef6      <---    XMM REGISTER IS DIRECTLY LOADED BY INITIALIZED COEF ARGUMENT
 rcpps xmm1,xmm2 ;1/coef6
 mulps xmm1,xmm7
 addps xmm0,xmm1 ;1-x^2/2!+x^4/4!-x^6/6!+x^8/8!-x^10/10!+x^12/12!
 movups xmm7,[ebp-96]
 mulps xmm7,xmm6;x^14
 movups [ebp-112],xmm7 ;store x^14
 movups xmm2,coef7        <---    XMM REGISTER IS DIRECTLY LOADED BY INITIALIZED COEF ARGUMENT
 rcpps xmm1,xmm2 ;1/coef7
 mulps xmm1,xmm7
 subps xmm0,xmm1 ;1-x^2/2!+x^4/4!-x^6/6!+x^8/8!-x^10/10!+x^12/12!-x^14/14!
 movups xmm7,[ebp-112]
 mulps xmm7,xmm6 ;x^16
 movups [ebp-128],xmm7 ;store x^16
 movups xmm2,coef8
 rcpps xmm1,xmm2 ;1/coef8
 mulps xmm1,xmm7
 addps xmm0,xmm1 ;1-x^2/2!+x^4/4!-x^6/6!+x^8/8!-x^10/10!+x^12/12!-x^14/14!+x^16/16!
 movups xmm7,[ebp-128]
 mulps xmm7,xmm6 ;x^18
 movups [ebp-144],xmm7;store x^18
 movups xmm2,coef9
 rcpps xmm1,xmm2 ;1/coef9
 mulps xmm1,xmm7
 subps xmm0,xmm1 ;1-x^2/2!+x^4/4!-x^6/6!+x^8/8!-x^10/10!+x^12/12!-x^14/14!+x^16/16!-x^18/18!
 movups xmm7,[ebp-144]
 mulps xmm7,xmm6 ;x^20
 movups [ebp-160],xmm7 ;store x^20
 movups xmm2,coef10
 rcpps xmm1,xmm2 ;1/coef10
 mulps xmm1,xmm7
 addps xmm0,xmm1 ;1-x^2/2!+x^4/4!-x^6/6!+x^8/8!-x^10/10!+x^12/12!-x^14/14!+x^16/16!-x^18/18!+x^20/20!
 movups xmm7,[ebp-160]
 mulps xmm7,xmm6 ;x^22
 movups [ebp-176],xmm7 ;store x^22
 movups xmm2,coef11
 rcpps xmm1,xmm2 ;1/coef11
 mulps xmm1,xmm7
 subps xmm0,xmm1;1-x^2/2!+x^4/4!-x^6/6!+x^8/8!-x^10/10!+x^12/12!-x^14/14!+x^16/16!-x^18/18!+x^20/20!-x^22/22!
 movups xmm7,[ebp-176]
 mulps xmm7,xmm6 ;x^24
 movups [ebp-192],xmm7 ;store x^24
 movups xmm2,coef12
 rcpps xmm1,xmm2 ;1/coef12
 mulps xmm1,xmm7
 addps xmm0,xmm1 ; +x^24/24!
 movups xmm7,[ebp-192]
 mulps xmm7,xmm6 ;x^26
 movups xmm2,coef13
 rcpps xmm1,xmm2 ;1/coef13
 mulps xmm1,xmm7
 subps xmm0,xmm1

 mov ebx,OFFSET result
 
 movups [ebx],xmm0
 fld  DWORD PTR[ebx]
 call WriteFloat
 call Crlf
 sub cl,1
 jnz L3
 xor eax,eax
 jz L5

 error:
 movups xmm5,argument
 xor eax,eax
 jz L3

 L5:
 exit
 main ENDP
 END main
 

My code works with movups[d] instruction,but testing movdqa has not been ever done.

Thank you. I see that you're using a different movups instruction ( actually it is OK / page 623 in a manual ) and this is how the instruction is used in your codes.

>>...
>>one REAL4 1.0,1.0,1.0,1.0......Note: memory is allocated here / It is Not a literal constant
>>...
>>movups xmm0, one
>>...

It means, that a set of values 1.0,1.0,1.0,1.0 is moved from a memory (!) to xmm0 register. Once again, both instructions, that is movdqa and movups, can not work with constants by design.

>>...both instructions, that is movdqa and movups, can not work with constants by design...

Here is a really small test case with inline assembler in a C/C++ test application:
...
_asm MOVDQA xmm0, 0xFFFFFFFFFFFFFFFF
...
[ Compilation output ]
...
..\prttests.cpp(8759) : error C2415: improper operand type
...

Here is another small test case:
...
__m128 mmValue = { 1.0L, 2.0L, 3.0L, 4.0L };

// _asm MOVDQA xmm0, 0xFFFFFFFFFFFFFFFF // error C2415: improper operand type
// _asm MOVUPS xmm1, 0xFFFFFFFFFFFFFFFF // error C2415: improper operand type

_asm MOVDQA xmm0, [ mmValue ]
_asm MOVUPS xmm1, [ mmValue ]

_asm MOVDQA [ mmValue ], xmm2
_asm MOVUPS [ mmValue ], xmm3
...

>>>It means, that a set of values 1.0,1.0,1.0,1.0 is moved from a memory (!) to xmm0 register. Once again, both instructions, that is movdqa and movups, can not work with constants by design>>>

I misunderstood the problem.In my code the load is coming from the memory and this was not the thread starter's question.The problem is "how to load XMM register with the immediate value".

Exactly and I'd like to change my former statement to:

>>...both instructions, that is movdqa and movups, can not work with literal constants by design...

>>>// _asm MOVDQA xmm0, 0xFFFFFFFFFFFFFFFF // error C2415: improper operand type>>>

Now I remember I have the same situation when I tried to load directly(immediate value)XMM registers.The second test is my preffered method of loading 1D vector represented by the 2 or 4 elements array into XMM register.You can also load the registers when passing structure members.For this use unaligned movups[d] instruction.

Citation :

Sergey Kostrov a écrit :

Exactly and I'd like to change my former statement to:

>>...both instructions, that is movdqa and movups, can not work with literal constants by design...

Sadly Intel processor designers decided to not allow loading SSEn registers with the immediate values.I would like to know what is the cause of such a decision.

>>one REAL4 1.0,1.0,1.0,1.0......Note: memory is allocated here / It is Not a literal constant
>>movups xmm0, one

It also assing xmm0 to a value of a vector with 4 components.

"how to load XMM register with the immediate value"

yes, this is the exact question. Looks answer is no. So following code

movups xmm0,one

because the data one was loaded from memory, the algorithm is not all register based, in which the cache access may be related. The performance will be unexpected. The optimization on SSE/AVX may be collapsed. 

>>...The performance will be unexpected. The optimization on SSE/AVX may be collapsed...

These are a different set of issues ( actually I don't see any problems here / this is how Intel designed these instructions ). The instruction formats are in the manual and please take a look. Here is another test case ( compiled with Intel C++ compiler ):
...
__m128 mmValue0 = { 1.0L, 2.0L, 3.0L, 4.0L };
__m128 mmValue1 = { 5.0L, 6.0L, 7.0L, 8.0L };

_asm MOVDQA xmm0, 0xFFFFFFFFFFFFFFFF // Microsoft C++ compiler: Error C2415: improper operand type
_asm MOVUPS xmm1, 0xFFFFFFFFFFFFFFFF // Microsoft C++ compiler: Error C2415: improper operand type

_asm MOVDQA xmm0, xmmword ptr [ mmValue0 ]
_asm MOVUPS xmm1, xmmword ptr [ mmValue1 ]

_asm MOVDQA xmmword ptr [ mmValue0 ], xmm1
_asm MOVUPS xmmword ptr [ mmValue1 ], xmm0
...

[ Intel C++ compiler output ]
...
..\PrtTests.cpp(8762): (col. 8) error: Unsupported instruction form in asm instruction movdqa.
..\PrtTests.cpp(8763): (col. 8) error: Unsupported instruction form in asm instruction movups.
(0): catastrophic error: fatal error: compilation terminated
...

Because of the design of those SSEn  load/store instructions there will be some performance penalty when the data needs to be loaded first time from the memory.

Loading of the XMM register with the pre-initialized structure member.

SinVector sinvec1 = {-0.1666666,-0.1666666,-0.1666666,-0.1666666},*sinvec1ptr; sinvec1ptr = &sinvec1 // structure initialization

Loading of the member

movups xmm1,sinvec1

By using custom typedef structure of array data type which is aligned on 16 bytes boundaries I can use movaps[d] instructions.

 

 

 

Because of the design of those SSEn  load/store instructions there will be some performance penalty when the data needs to be loaded first time from the memory.

Loading of the XMM register with the pre-initialized structure member:

SinVector sinvec1 = {-0.1666666,-0.1666666,-0.1666666,-0.1666666},*sinvec1ptr; sinvec1ptr = &sinvec1 // structure initialization

Loading of the member:

movups xmm1,sinvec1

By using custom typedef structure of array data type which is aligned on 16 bytes boundaries I can use movaps[d] instructions.

 

 

 

>>"how to load XMM register with the immediate value"
>>
>>yes, this is the exact question. Looks answer is no.

I think we're shifting our discussion to a different subject related to a quality of codes generation. Are we really interested in that? Please take a look at a disassembled test case:

[ Intel C++ compiler ( it is more compact ) ]

...
// __m128 mmValue0 = { 1.0L, 2.0L, 3.0L, 4.0L };
0040471B movaps xmm0,xmmword ptr [dValue+60h (5D8320h)]
00404722 movaps xmmword ptr [mmValue0],xmm0
// __m128 mmValue1 = { 5.0L, 6.0L, 7.0L, 8.0L };
00404726 movaps xmm0,xmmword ptr [dValue+70h (5D8330h)]
0040472D movaps xmmword ptr [mmValue1],xmm0

// _asm MOVDQA xmm0, 0xFFFFFFFFFFFFFFFF // Error C2415: improper operand type
// _asm MOVUPS xmm1, 0xFFFFFFFFFFFFFFFF // Error C2415: improper operand type

// _asm MOVDQA xmm0, xmmword ptr [ mmValue0 ]
00404731 movdqa xmm0,xmmword ptr [mmValue0]
// _asm MOVUPS xmm1, xmmword ptr [ mmValue1 ]
00404736 movups xmm1,xmmword ptr [mmValue1]

// _asm MOVDQA xmmword ptr [ mmValue0 ], xmm1
0040473A movdqa xmmword ptr [mmValue0],xmm1
// _asm MOVUPS xmmword ptr [ mmValue1 ], xmm0
0040473F movups xmmword ptr [mmValue1],xmm0
...

[ Microsoft C++ compiler ]

...
// __m128 mmValue0 = { 1.0L, 2.0L, 3.0L, 4.0L };
0043686D fld1
0043686F fstp dword ptr [mmValue0]
00436872 fld dword ptr [__real@40000000 (49AFDCh)]
00436878 fstp dword ptr [ebp-1Ch]
0043687B fld dword ptr [__real@40400000 (49AFD8h)]
00436881 fstp dword ptr [ebp-18h]
00436884 fld dword ptr [__real@40800000 (49B064h)]
0043688A fstp dword ptr [ebp-14h]
// __m128 mmValue1 = { 5.0L, 6.0L, 7.0L, 8.0L };
0043688D fld dword ptr [__real@40a00000 (49B060h)]
00436893 fstp dword ptr [mmValue1]
00436896 fld dword ptr [__real@40c00000 (49B05Ch)]
0043689C fstp dword ptr [ebp-3Ch]
0043689F fld dword ptr [__real@40e00000 (49B058h)]
004368A5 fstp dword ptr [ebp-38h]
004368A8 fld dword ptr [__real@41000000 (49B054h)]
004368AE fstp dword ptr [ebp-34h]

// _asm MOVDQA xmm0, 0xFFFFFFFFFFFFFFFF // Error C2415: improper operand type
// _asm MOVUPS xmm1, 0xFFFFFFFFFFFFFFFF // Error C2415: improper operand type

// _asm MOVDQA xmm0, xmmword ptr [ mmValue0 ]
004368B1 movdqa xmm0,xmmword ptr [mmValue0]
// _asm MOVUPS xmm1, xmmword ptr [ mmValue1 ]
004368B6 movups xmm1,xmmword ptr [mmValue1]

// _asm MOVDQA xmmword ptr [ mmValue0 ], xmm1
004368BA movdqa xmmword ptr [mmValue0],xmm1
// _asm MOVUPS xmmword ptr [ mmValue1 ], xmm0
004368BF movups xmmword ptr [mmValue1],xmm0
...

I agree that the question was answered completely. Thanks.

Why not set /arch:SSE2 or AVX for the Microsoft compiler?  Mixed x87 and SSE2 will never produce high quality code generation.

>>>I agree that the question was answered completely. Thanks.>>>

Agree with you that question is answered completely.

Code generation is another problem. I do not think any C compilers can generate the best code. I am more concerned on following asm code in a kernel that will be executed thousands and thousands times. I want to all memory access to be removed from kernel function. For example,

init:

// _asm MOVDQA xmm, xmmword ptr [ mmValue0 ]
00404731 movdqa xmm8,xmmword ptr [mmValue0]

kernel:

use xmm8 as a constant

A solution requires to use another group of registers xmm8-xmm15 or ymm8-ymm15. Unfortunately this must be 64-bit x64 platform. So the code like the example of iliyapolak and the optimized AVX IDCT code from Intel  http://software.intel.com/en-us/articles/using-intel-advanced-vector-ext... can be faster.

If x86 has constant registers and allow direct constant loading, there is no need to further optimization.  

>>>A solution requires to use another group of registers xmm8-xmm15 or ymm8-ymm15. Unfortunately this must be 64-bit x64 platform. So the code like the example of iliyapolak>>>

Even without the usage of additional XMM8-XMM15 and YMMn registers my code which was presented as an example can be effectively optimized.If you are intrested please follow this link http://software.intel.com/en-us/forums/topic/347470

>>>am more concerned on following asm code in a kernel that will be executed thousands and thousands times. I want to all memory access to be removed from kernel function>>>

What do you mean by "kernel" and "kernel function"?Are you referring to the kernel mode of operation or maybe  some kind of mathematical kernel (like a gaussian) which has to operate on some data set.

>>>If x86 has constant registers and allow direct constant loading>>>

Afaik so called constant register on some microarchitecture are read-only registers which holds a constant values like zero or pi.I do not think that x86 gp and SSE registers can be classified as a constant read only registers.

"Even without the usage of additional XMM8-XMM15 and YMMn registers my code which was presented as an example can be effectively optimized.If you are intrested please follow this link http://software.intel.com/en-us/forums/topic/347470"

In your code all constants are still loaded from memory. Suppose the sin function will be called million times these constants will be loaded million times. In contrast to load these constants once to registers you see the saving. Since you need to seal sin function as a build-in I do not know how to accelerate it by constant access. 

"What do you mean by "kernel" and "kernel function"?Are you referring to the kernel mode of operation or maybe  some kind of mathematical kernel (like a gaussian) which has to operate on some data set."

The kernel is borrowed from OpenCL. It is not the kernel mode of OS. All these discussions are OS independent.

Afaik so called constant register on some microarchitecture are read-only registers which holds a constant values like zero or pi.I do not think that x86 gp and SSE registers can be classified as a constant read only registers.

Forget constant register and  literal constant they are not for SSE/AVX. However literal constant can be applied for 32-bit (and 64-bit?) general registers. We see the boundary of SSE/AVX.  

>>...Code generation is another problem. I do not think any C compilers can generate the best code.

Have you ever worked with Watcom C/C++ compiler ( older versions 9.x or 10.x )? I regret to see that it is almost forgotten. Sorry for a small deviation from the subject of the thread.

>>...I am more concerned on following asm code in a kernel that will be executed thousands and thousands times. I want to all
>>memory access to be removed from kernel function...

But the code is in memory anyway (!), already cached in L1 line, etc.

I simply would like to refer you to some performance numbers and please take a look at:

Intel(R) 64 and IA-32 Architectures Optimization Reference Manual

Order Number: 248966-026
April 2012

APPENDIX C
INSTRUCTION LATENCY AND THROUGHPUT

Pages 746 and 748 for movdqa instruction
Pages 756 and 758 for movups instruction

"But the code is in memory anyway (!), already cached in L1 line, etc."

What this mean? Instruction cache and Data cache are separated. Below is the data from Appendix C.

The movdqa and movups for register operation are super fast in Core architecture (Latency 1) but slower in P4 (Latency 6). 

1. Table C-9. Streaming SIMD Extension 2 128-bit Integer Instructions

0f_2h is the P4 Prescott

Instruction Latency1 Throughput Execution Unit2
CPUID 0F_3H 0F_2H 0F_3H 0F_2H 0F_2H

MOVDQA xmm, xmm 6 6 1 1 FP_MOVE
MOVDQU xmm, xmm 6 6 1 1 FP_MOVE

2. Table C-9a. Streaming SIMD Extension 2 128-bit Integer Instructions

Intel microarchitecture code name Westmere are represented by 06_25H, 06_2CH and 06_2FH. Intel microarchitecture code name Sandy Bridge
are represented by 06_2AH.

Instruction Latency1 Throughput
CPUID
06_2A
06_2D
06_25/
2C/1A/
1E/1F/
2E/2F
06_17H,
06_1DH
06_0F
H
06_2A
06_2D
06_25/
2C/1A/
1E/1F/
2E/2F
06_17H,
06_1DH
06_0F
H

MOVDQA xmm,xmm
1 1 1 1 0.33 0.33 0.33 0.33
MOVDQU xmm,xmm
1 1 1 1 0.33 0.33 0.33 0.5

While you can't load immediate values to xmm registers, you can load immediates into general purpose 32/64 bit registers and then use movd/movq (and shuffle/broadcast is needed) to initialize xmm/ymm registers with them. I wonder why compilers (at least, gcc) don't do that when they generate floating point code that involves constants. Perhaps, because it turns out to be slower than a regular load from memory?

>>...Perhaps, because it turns out to be slower than a regular load from memory?..

It is a good point but it is so far a very speculative until it is proven in a small test-case.

3-step-operation A: CONSTANT -> load to a regular register -> load to XMM register
vs.
3-step-operation B: CONSTANT -> load to a memory -> load to XMM register

>>>In your code all constants are still loaded from memory. Suppose the sin function will be called million times these constants will be loaded million times. In contrast to load these constants once to registers you see the saving>>>

 

@chang-li

It seems that I have a problem with posting my message.This message is reply to your quoted sentence in my previous post.

My intention was to optimize sine function calculation.It was done by coefficient precalculation and Horner scheme implementation.As you pointed out in your response the problem lies in keeping and loading a constant coefficients from the memory.One of the solution is to use remaining Xmm registers solely for the purpose of coeffs storage.For single sine function call it could be good solution albeit at reduced accuracy,but for millions of sine function calls the executing thread can be prempted and other floating point code can be scheduled to run thus overwriting XMM registers.

>>>Forget constant register and literal constant they are not for SSE/AVX. However literal constant can be applied for 32-bit (and 64-bit?) general registers. We see the boundary of SSE/AVX>>>

Completely agree with you.

Haven't we discussed everything, guys? Good Luck!

informative post...:)

This is a short follow up. I just found that some GCC-like C++ compilers have an option:
...
-fforce-addr - Copy memory address constants into registers before use
...

Does -fforce-addr option is also related to SIMD registers?

>>...Does -fforce-addr option is also related to SIMD registers?

Command line help for the MinGW C++ compiler doesn't specify it. A Manual provides a little bit more information:
...
`-fforce-addr'
Force memory address constants to be copied into registers before
doing arithmetic on them. This may produce better code just as
`-fforce-mem' may.
...
but it is still Not clear what registers will be used.

Citation :

Sergey Kostrov a écrit :

>>...Does -fforce-addr option is also related to SIMD registers?

Command line help for the MinGW C++ compiler doesn't specify it. A Manual provides a little bit more information:
...
`-fforce-addr'
Force memory address constants to be copied into registers before
doing arithmetic on them. This may produce better code just as
`-fforce-mem' may.
...
but it is still Not clear what registers will be used.

I guess it can not be applied to AVX registers because there are no these instructions.  

Chang-li,

I understand what you are trying to achieve, but I fear that you would not gain much. Assuming there was an load instruction for YMM registers with an immediate, the enconding would be longer than 32 bytes. This would result in some major hick-ups in the the core. For example, the loop-stream detector processes the instructions in 32-byte chunks. Therefore, your instruction wouldn't even fit in one chunk!

On the other hand you have two load ports and can do up to two loads per cycles. Reading a constant from memory can be pipelined nicely with other loads as there are no dependencies. When you are absolutely limited by the number of loads, keeping at least some of the constants in a register might help as a last resort.

Kind regards

Thomas

>>>On the other hand you have two load ports and can do up to two loads per cycles.>>>

Will it stay the same on Haswell architecture?I mean load/store ports

Citation :

Thomas Willhalm (Intel) a écrit :

Chang-li,

I understand what you are trying to achieve, but I fear that you would not gain much. Assuming there was an load instruction for YMM registers with an immediate, the enconding would be longer than 32 bytes. This would result in some major hick-ups in the the core. For example, the loop-stream detector processes the instructions in 32-byte chunks. Therefore, your instruction wouldn't even fit in one chunk!

On the other hand you have two load ports and can do up to two loads per cycles. Reading a constant from memory can be pipelined nicely with other loads as there are no dependencies. When you are absolutely limited by the number of loads, keeping at least some of the constants in a register might help as a last resort.

Kind regards

Thomas

It is true for YMM* that is 256-bit (32 bytes). But XMM* is 128-bit (16 bytes) that a direct constant instruction can be in one chunk.

Chang

>>...But XMM* is 128-bit (16 bytes) that a direct constant instruction can be in one chunk...

What about throughput of instructions? For example, in case of a General Purpose MOV instruction it is 3 instructions in one clock cycle. Take a look at Intel Optimization Reference for more information.

Citation :

Sergey Kostrov a écrit :

>>...But XMM* is 128-bit (16 bytes) that a direct constant instruction can be in one chunk...

What about throughput of instructions? For example, in case of a General Purpose MOV instruction it is 3 instructions in one clock cycle. Take a look at Intel Optimization Reference for more information.

There is no XMM* direct constant assign instruction yet.  

Leave a Comment

Please sign in to add a comment. Not a member? Join today