FYI: GCC and _mm_pause();

FYI: GCC and _mm_pause();

Portrait de jimdempseyatthecove

I was following-up on a behavioral difference between a program whencompiled with GCC C++ and Intel C++ and thought I would pass the information on to the forum in the event that this is important to its readers.

Intel C++ generates code containing the P4 "PAUSE" instruction, whereas GCC C++ generates code containing "rep; nop". This is due to code generation assuming 80386 compatability although in my test case I was generating a 64-bit application. PAUSE came in with P4.

The issue is that the PAUSE is a low-power consumingshort duration stall, whereas "rep;nop" will be a short duration compute intensive stall. The duration islikely much shorter than PAUSE, and the power consumption will be higher. This issue is observed in code like the following on a multi-threaded program:

volatile int flag = 0;
...
while(!flag)
_mm_pause();

Jim Dempsey

www.quickthreadprogramming.com
11 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.
Portrait de bustaf

Hi
Use usleep() to solve
http://linux.die.net/man/3/usleep
Regards

Portrait de jimdempseyatthecove

The point is not to get a low-power short interval wait.
The point is _mm_pause() is supposed to insert the PAUSE instruction.

If I do not want the PAUSE instruction, then I would code not using the _mm_pause();

volatile int flag = 0;
...
while(!flag)
continue;
...

or

#if defined(USE_mm_pause)
#define PAUSE _mm_pause
#else
#define PAUSE usleep
#endif
...
while(!flag)
PAUSE();

Jim Dempsey

www.quickthreadprogramming.com
Portrait de Sergey Kostrov
Quoting jimdempseyatthecove The point is not to get a low-power short interval wait.
The point is _mm_pause() is supposed to insert the PAUSE instruction.

If I do not want the PAUSE instruction, then I would code not using the _mm_pause();
...
Jim Dempsey

Did you try to create your own replacement of '_mm_pause' intrinsic function in case of compiling
for Linux? I've just done a quick testand:
...
__asm__ ( "pause;" );
...
was easily compiled by a g++ compiler to:
...
LM5498:
/APP
pause;
/NO_APP
LBE971:
...

Note:
It is froman *.s file

PS1: I really don'tlikehow software developers ofGCC project implemented asupport for intrinsic
functions. I recently had a problem with '_mm_prefetch' on a Linux platform. Now another software
developer has issues with '_mm_pause'. There is a strange comment in'xmmintrin.h' header file:

...
/* Implemented from the specification included in the Intel C++ Compiler
User Guide and Reference, version 8.0. */
...

It would be nice to find that document and to investigate what Intel really recommends! Andin myGCC
installation'_mm_pause'is 'rep-nop-ed' as well:

...
static __inline void
_mm_pause (void)
{
__asm__ __volatile__ ("rep; nop" : : );
}
...

PS2: I could guess that an old version of GCC compiler couldn't compile 'pause'and a software
developer decided to use 'rep-nop' instructions instead. Later, everybody forgot about it.

Portrait de jimdempseyatthecove

Sergey,

Thanks for taking your time to comment on this. My code does have many __asm__ support routines and will likely insert _mm_pause_really(). The point I was making is I got blind-sided by _mm_pause() not being implemented "properly".

By "properly" I mean use PAUSE (pause) .and. if compiling -march=i386 (or some architecture that does not support pause) that the compiler generates an error .unless. user supplies (new) option to explicitly substitute something for the purposes of PAUSE, .or. the user inserts there own code to use _mm_pause or their choice of something else.

You found similar issues with _mm_prefetch, how many other similar issues are there lurking out there?

Jim Dempsey

www.quickthreadprogramming.com
Portrait de Sergey Kostrov
Quoting jimdempseyatthecove ...
You found similar issues with _mm_prefetch, how many other similar issues are there lurking out there?

Jim Dempsey

Integration and portability problems existed, exist, and will exist as soon as developers are making new
features in existing software products. It getsworst when a compatibility with an older software product has
to be provided.

My recent "discovery" isin Visual Studio 98 ( some companies are still using it! ).For example, a
declaration like:

typedef union _RTALIGN16 tagRTm128i
{
RTint8 m128i_i8[16];
RTint16 m128i_i16[8];
RTint32 m128i_i32[4];
RTint64 m128i_i64[2];
RTuint8 m128i_u8[16];
RTuint16 m128i_u16[8];
RTuint32 m128i_u32[4];
RTuint64 m128i_u64[2];
} RTm128i;

could not be compiled by a Visual C++ 6.0 compiler from Visual Studio 98 because of _RTALIGN16 after
a key word 'union'.

A declaration without _RTALIGN16 like:

typedef uniontagRTm128i
{
...
} RTm128i;

will be succesfullycompiled.

Another "little" problem isthat Visual Studio 98 doesn't have built-in types like m128, or m128i, etc, but
in some Microsoft's internal DLLs, I mean DLLs fromVisual Studio 98, these types are already used!

Best regards,
Sergey

Portrait de jimdempseyatthecove

RE: _RTALIGN16

Aligned data is one of those implementation issues. If you also must compile with GCC C++, the align goes on the tail-end of the declaration.

BTW - you cannot (should not) simply remove the _RTALIGN16. You must rework the declaration such that the RTm128i has a 16-byte alignment attribute. Without the attribute you rely on chance for alignments to 16-byte boundaries (or you are left with explicitly specifying the alignment on all instantiations of an RTm128i object).

Jim Dempsey

www.quickthreadprogramming.com
Portrait de Sergey Kostrov
Quoting jimdempseyatthecove ...
You found similar issues with _mm_prefetch, how many other similar issues are there lurking out there?
...
Jim Dempsey

I've found another one. Pleasetake a look if interested:

http://software.intel.com/en-us/forums/showthread.php?t=101379&p=1#173169

and it is related to inline assembler used in GCC or MinGW C/C++ compilers and RDTSC instruction.

Best regards,
Sergey

Portrait de Russell Selph

I realize this discussion is quite old at this point, but I thought I'd add a comment for those that find this page via google:

The PAUSE instruction assembles to the byte string 'F3 90', and REP; NOP assembles to 'F3 90'. No difference. This is according to the 'Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2B Instruction Set Reference, N-Z'.

Also check out the discussion on stack overflow. I'd put in the link, but the spam filter is unhappy with that. It's question #7086220, titled 'What does “rep; nop;” mean in x86 assembly?'

Portrait de jimdempseyatthecove

From Pentium 4 and later processors this introduces a random short stall in a low power and low cache interaction state *** without altering processor state. The 90, one byte instruction, is XCHG AX,AX, or EAX,EAX, or RAX,RAX depending on data width (and or optional prefex). XCHG with self is effectively "nop", and xchg does not alter flags. NOP is an alias for XCHG eAX,eAX.
The REP prefix is used on string instructions and where eCX contains the iteration count for the following (string) instruction. For non-string instructions the behavior of the prefix is undefined....
Excepting for REP XCHG eAX,eAX in which case a random short stall (low power, low cache interaction), and witout decrimentiong eAX.

On earlier processors than P4 CX _may_ have been decrimented and the instruction repeated that many of times. Other processors may have faulted with illegal instruction.

Jim Dempsey

www.quickthreadprogramming.com
Portrait de Sergey Kostrov

>>...The PAUSE instruction assembles to the byte string 'F3 90', and REP; NOP assembles to 'F3 90'. No difference...

That is interesting and thanks for the note.

Connectez-vous pour laisser un commentaire.