SIMD Inline asm execution help

SIMD Inline asm execution help

srimks's picture

Hello All.

I have a piece of code both in C & as Inlined Assembly -

------c code----
#include

void add (float *a, float *b, float *c)
{
int i;

for (i = 0; i < 4; i++)
c[i] = a[i] + b[i];
}

int main()
{
return 0;
}
----

and Inline asm code as -

----inline asm----
#include

void add(float *a, float *b, float *c)
{
asm (".intel_syntax noprefix\n\t"
"mov eax, a"
"mov edx, b"
"mov ecx, c"
"moovaps xmm0, XMMWORD PTR [eax]"
"addps xmm0, XMMWORD PTR [edx]"
"movaps XMMWORD PTR [ecx], xmm0"
);
}

int main()
{
return 0;
}
----

The above C code was fine, no doubt on it but inline assembly code gave below error messages as -
---
$ icc add-simd.c
/tmp/iccW8zIXvas_.s: Assembler messages:
/tmp/iccW8zIXvas_.s:47: Error: too many memory references for 'mov'
--

Not able to interpret above error messages.

I know I am missing something as this is the first Inline asm I had written till date, so looking for above solutions and references to write Inline assembly using Intel C++ Compiler-v11.0 (ICC) on x86_64 linux machine.

~BR

15 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
srimks's picture
Quoting - srimks Hello All.

I have a piece of code both in C & as Inlined Assembly -

------c code----
#include

void add (float *a, float *b, float *c)
{
int i;

for (i = 0; i < 4; i++)
c[i] = a[i] + b[i];
}

int main()
{
return 0;
}
----

and Inline asm code as -

----inline asm----
#include

void add(float *a, float *b, float *c)
{
asm (".intel_syntax noprefixnt"
"mov eax, a"
"mov edx, b"
"mov ecx, c"
"moovaps xmm0, XMMWORD PTR [eax]"
"addps xmm0, XMMWORD PTR [edx]"
"movaps XMMWORD PTR [ecx], xmm0"
);
}

int main()
{
return 0;
}
----

The above C code was fine, no doubt on it but inline assembly code gave below error messages as -
---
$ icc add-simd.c
/tmp/iccW8zIXvas_.s: Assembler messages:
/tmp/iccW8zIXvas_.s:47: Error: too many memory references for 'mov'
--

Not able to interpret above error messages.

I know I am missing something as this is the first Inline asm I had written till date, so looking for above solutions and references to write Inline assembly using Intel C++ Compiler-v11.0 (ICC) on x86_64 linux machine.

~BR

Is "Igor" around, probably he can help with above?

srimks's picture
Quoting - gabest Maybe there aren't enough end of lines (n) there.

I added the asm with "tn" as below -

--
#include

void add(float *a, float *b, float *c)
{
asm (".intel_syntax noprefixnt"
"mov eax, ant"
"mov edx, bnt"
"mov ecx, cnt"
"movaps xmm0, XMMWORD PTR [eax]nt"
"addps xmm0, XMMWORD PTR [edx]nt"
"movaps XMMWORD PTR [ecx], xmm0nt"
);
}

int main()
{
return 0;
}
--

still am getting error as below -
--
$ icc add-simd.c
/tmp/iccstyfigas_.s: Assembler messages:
/tmp/iccstyfigas_.s:50: Error: Unknown operand modifier `XMMWORD'
/tmp/iccstyfigas_.s:51: Error: Unknown operand modifier `XMMWORD'
/tmp/iccstyfigas_.s:52: Error: Unknown operand modifier `XMMWORD'
--

Why here "Unknown operand modifier `XMMWORD" message being generated? I am using ICC-v11.0 on Linux x86_86 to compile this add-simd.c inlined assembly code.

~BR

gabest's picture

Hm, try getting rid of those ptrs, just use "movaps xmm0, [eax]n" for example. But I have no other idea.

Tim Prince's picture
Quoting - srimks

Why here "Unknown operand modifier `XMMWORD" message being generated? I am using ICC-v11.0 on Linux x86_86 to compile this add-simd.c inlined assembly code.

The different default versions of asm syntax on linux and Windows is one of the reasons for (not to mention standard C). There's no "guess what I mean" facility for inline asm. But you said a few days ago you didn't want advice from many of us.

srimks's picture
Quoting - tim18 The different default versions of asm syntax on linux and Windows is one of the reasons for (not to mention standard C). There's no "guess what I mean" facility for inline asm. But you said a few days ago you didn't want advice from many of us.

Timothy Prince (tim18)

Congrats for becoming a Black-Belt. I think you misunderstood my statements, I never said anything like that..and if you think I said or the comunity feels, am really sorry.

I am wondering for above asm, do I am missing something which recognises XMMWORD pointer? Actually, it's a first time something on asm I am thinking to start. I am in learning phase of inlined asm.

Do I need to install something which recognises Intel inlined MASM format of asm. I am using ICC-v11.0 to execute this above asm code.

~BR
Mukkaysh Srivastav

Tim Prince's picture

Thanks for the congratulations.
If you can get your example working with gcc, with some option such as -masm=intel, and that isn't supported by icc, you could file an issue on premier.intel.com asking whether it is possible. You may also have to explain why standard C or the xmmintrin extension is not a better way.
I wouldn't hold my breath, as the main reason for supporting "intel" syntax is to support compatibility with VC. As you've already seen, that involves a different format. Besides, Microsoft formed a policy when first beginning to support 64-bit of discouraging asm and extending the mmintrin.h scheme, which is supported to a fair extent by the corresponding linux compilers. I did succeed once in getting one of the Microsoft intrinsics adopted in icc, by submitting the feature request on premier.intel.com.
I have an example of inline asm, where I have to use gcc, even though it's not a question of syntax choice. My (64-bit) example was supported by icc for a while, then not supported, as it was never supported by VC. So, it's not always enough to have a useful extension supported by one or the other of gcc or VC to get it in icc.

srimks's picture
Quoting - gabest Hm, try getting rid of those ptrs, just use "movaps xmm0, [eax]n" for example. But I have no other idea.

As suggested, I get below error messages -

---
$icc add-simd.c
/tmp/iccTlamjs.o(.text+0x33): In function `add':
: undefined reference to `a'
---

~BR

srimks's picture
Quoting - tim18 Thanks for the congratulations.
If you can get your example working with gcc, with some option such as -masm=intel, and that isn't supported by icc, you could file an issue on premier.intel.com asking whether it is possible. You may also have to explain why standard C or the xmmintrin extension is not a better way.
I wouldn't hold my breath, as the main reason for supporting "intel" syntax is to support compatibility with VC. As you've already seen, that involves a different format. Besides, Microsoft formed a policy when first beginning to support 64-bit of discouraging asm and extending the mmintrin.h scheme, which is supported to a fair extent by the corresponding linux compilers. I did succeed once in getting one of the Microsoft intrinsics adopted in icc, by submitting the feature request on premier.intel.com.
I have an example of inline asm, where I have to use gcc, even though it's not a question of syntax choice. My (64-bit) example was supported by icc for a while, then not supported, as it was never supported by VC. So, it's not always enough to have a useful extension supported by one or the other of gcc or VC to get it in icc.

I think if I need to run the same with GNU compilers, then I have to re-write the above sample asm in AT&T format. Thses asm is very elimintary as far it's purpose goes, simply to add 4 nos.

~BR

Tim Prince's picture

Well, that's close to the position. As the gcc option to use Intel syntax is seldom used, and of no use for compatibility with Microsoft, it may not be supported by icc.
As you can add 4 numbers in supported ways, the example doesn't explain why you would want this.
You might argue that as PTU displays Intel syntax by default, and the option to change it is well hidden, there should be equivalent support for Intel syntax in icc. However, asking for such consistency among Intel tools hasn't gained much sympathy.

Igor Levicki's picture
Quoting - srimks

Is "Igor" around, probably he can help with above?

Sorry for not seeing it earlier, man has to sleep sometimes.

I'll have to disappoint you -- I am no Linux inline assembler expert but I will try to help anyway.

If I remember correctly mov eax, a is the same as mov eax, OFFSET a -- that is the address of a, not its contents which probably isn't what you want. I would write it this way:

mov	eax, dword ptr [a]
mov	edx, dword ptr [b]
mov	ecx, dword ptr [c]
movaps	xmm0, xmmword ptr [eax]
addps	xmm0, xmmword ptr [edx]
movaps	xmmword ptr [ecx], xmm0

I would also try with and without n at each line end (t should not be required). Replacing xmmword with oword might do the trick as well.

Bear in mind that to benefit from vectorization you have to have more iterations -- this addition should be structured as a loop instead of a function which gets called in a loop because calling a function adds considerable overhead.

Finally, when you have a loop with more iterations you can perform unrolling to the cache line size (64 bytes or 16 floats per iteration). Of course such a loop would require a scalar tail to process the remaining elements if the loop trip count isn't divisible by 16. Both versions would require that a, b, and c be 16-byte aligned.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
Igor Levicki's picture

Just tested with ICC on Windows -- mov eax, a generates the same code as mov eax, [a]. Funny, I could swear that there was a difference between those two at some point, perhaps MASM was handling that differently or it is just a 16-bit legacy in my head.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
Tim Prince's picture
Quoting - Igor Levicki

Finally, when you have a loop with more iterations you can perform unrolling to the cache line size (64 bytes or 16 floats per iteration). Of course such a loop would require a scalar tail to process the remaining elements if the loop trip count isn't divisible by 16. Both versions would require that a, b, and c be 16-byte aligned.

The full cache line unrolling is useful on processors like Core 2 and predecessors, where movups (unaligned loads) are desired inside the cache line, but not across the cache line boundary. The loop would be adjusted so that scalar loads occur only when the boundary is crossed. There is no need for full cache line unrolling on latest CPUs (Intel Core i7, AMD Barcelona); just the alignment adjustment so as to use movaps for as many stores as possible is sufficient.
Use of the mmintrin header intrinsics to avoid in-line asm solves linux vs Windows syntax issues.

Igor Levicki's picture
Quoting - tim18

The full cache line unrolling is useful on processors like Core 2 and predecessors, where movups (unaligned loads) are desired inside the cache line, but not across the cache line boundary. The loop would be adjusted so that scalar loads occur only when the boundary is crossed. There is no need for full cache line unrolling on latest CPUs (Intel Core i7, AMD Barcelona); just the alignment adjustment so as to use movaps for as many stores as possible is sufficient.
Use of the mmintrin header intrinsics to avoid in-line asm solves linux vs Windows syntax issues.

Tim,

I brought up cache line unrolling just to emphasize that the assembler code shown here is far from optimal on any CPU.

Furthermore, I was suggesting it with the assumption that the arrays (a, b, c) are at least cache-line aligned so that you can use movaps without crossing boundaries which is what I usually do in my code.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
Tim Prince's picture

The value of unrolling is a complicated question. In a loop which stores 3 or more data streams, unrolling beyond the minimum for vectorization is less likely to show value. When the compiler distributes (splits) large loops into small enough pieces that further unrolling is called for, the net undesirable effect of excessive code expansion becomes overwhelming.
Vectorizing compilers automatically peel loops so that at least one stored stream should use movaps with 16-byte alignment. In the most desirable situation, that is sufficient to permit movaps for all loads and stores, and there is no possibility of a cache line split.
I didn't want to open up the entire question here, only to point out that the main value of full cache line unrolling comes when some of the data aren't aligned, but, on the older CPU models, performance may be optimized by choosing movups within the cache line and scalar across the boundary.
Understanding of these questions is valuable when examining compiler generated code to evaluate its quality, and setting data alignments, even when not writing low level code.

Login to leave a comment.