AVX transition penalties and OS support

AVX transition penalties and OS support

Hello,

I already got some experience with SSE to AVX transition penalties and read the following article: http://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoidin...

There is written, only zeroall or zeroupper gets the cpu in the safe state where no penalties can occure.

Isn't this a problem in multithreading, multiprocessing? I mean, assume process A is running with SSE legacy code. For example normal floating point operations with scalar SSE code. And process B is using AVX and only at the end of function has a zeroupper.

What if context switch occurs in the middle of AVX code? The OS will switch context including YMM registers. But even if the upper are all zero, wouldn't the cpu remain in the other state? So context switches might lead to penalties for process A without any influene of the programmer. Or is there something I missunderstood?

This scenario just came to my mind and I don't know how one could solve this. Or is there a possibility for the OS to avoid this problem?

55 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

An AVX enabled OS is supposed to protect from and hide all upper register contents during context switches.  So, if the OS supports AVX properly, it can also run "legacy" SSE jobs without interactions with AVX jobs.  If you run an OS which doesn't support AVX (Windows XP, Win7 without SP, Red Hat 5 come to mind), your concerns about AVX are well founded.

>>...What if context switch occurs in the middle of AVX code?..

It is a highly possible scenario. However, if your AVX code is small ( let's assume it... ) and doesn't do too much calculations then you could use a synchronization object, for example a Critical Section, to prevent the context switching. Does it make sense?

>>>you could use a synchronization object, for example a Critical Section, to prevent the context switching. Does it make sense?>>>

Can you prevent the context switching when the Scheduler code runs at DPC level way above the normal thread execution level? Sometime hardware event will trigger an ISR and all the normal activity below or equal to DIRQL will be postponed.

IIRC floating point state(including xmm and ymm register context) will be saved KPCR structure.

Ok, lets say the OS supports it, otherwise I does not make sense, as even calculation errors might appear.

But I do not really understand how OS can hide this? I mean the context switch is transparent for the software and this accounts also for the AVX registers. But the state of the CPU concerning AVX, SSE must be saved, too. How should the OS do this?

I know small real times OS. They, for example, save all registers and CPU state register on task stack and restore it from another task stack. Then processing time is passed to this task. But internal AVX or SSE state of CPU is not in a register, is it? So simply changing register values for the according task might not be enough.

Or can this state also be saved in restored by special cpu instruction etc? If so, then I agree, that all conditions for a proper context switch are full filled.

>>>But I do not really understand how OS can hide this? I mean the context switch is transparent for the software and this accounts also for the AVX registers. But the state of the CPU concerning AVX, SSE must be saved, too. How should the OS do this?>>>

Windows OS will save processor context which includes also SSEn registers into special data structure which is called "Kernel Processor Region Control Block" if you know how to work with windbg you can dump this structure with the following commands "!pcr" and dt nt_!KPCR "address of pcr" and look for the pointer to KPRCB structure.IIRC KPRCB should contain saved SSE registers context.There is also special routine called "KeSaveFloatingPointState" which is called by the driver and is used to store volatile floating point context.

Quote:

iliyapolak wrote:

>>>But I do not really understand how OS can hide this? I mean the context switch is transparent for the software and this accounts also for the AVX registers. But the state of the CPU concerning AVX, SSE must be saved, too. How should the OS do this?>>>

Windows OS will save processor context which includes also SSEn registers into special data structure which is called "Kernel Processor Region Control Block" if you know how to work with windbg you can dump this structure with the following commands "!pcr" and dt nt_!KPCR "address of pcr" and look for the pointer to KPRCB structure.IIRC KPRCB should contain saved SSE registers context.There is also special routine called "KeSaveFloatingPointState" which is called by the driver and is used to store volatile floating point context.

I think that SSE context is saved in FXSAVE_FORMAT structure.

>>...Can you prevent the context switching when the Scheduler code runs at DPC level way above the normal thread execution level?

It is impossible to answer Yes or No. But, If a priority of the thread is raised to Time Critical all threads with lower priorities will be preemted. On Windows XP, for example, mouse and keyboard, UI updates and Task Manager are preemted completely. It is not recommended to do it in cases when calculations or processing take too much time and it has to be done for a really critical and small pieces of codes.

>>>It is impossible to answer Yes or No. But, If a priority of the thread is raised to Time Critical all threads with lower priorities will be preemted.>>>

Yes you are right.

>>>On Windows XP, for example, mouse and keyboard>>>

If you mean a driver's routine which is servicing a hardware mouse or keyboard event is not the same as a thread's priority.User mode threads usually run at IRQL = passive level and cannot block ISR DPC routine.Only thread which run in kernel mode can raise IRQL  above DPC level and preempt the execution of vital system code.

>>...If you mean a driver's routine which is servicing a hardware mouse or keyboard...

I think Yes. Unfortunately, we're moving away from AVX related issues...

 >>>But the state of the CPU concerning AVX, SSE must be saved, too. How should the OS do this?>>>

By saving the content of the SSE and AVX registers to the special structure probably this structure "FXSAVE_FORMAT"

>>>Unfortunately, we're moving away from AVX related issues...>>>

Sometime in the heat of discussion small deviations from the main topic are unavoidable:)

Thanks for the inputs,

I looked in the manual you suggested in the other thread. Now I understand it much better.

Modern Intel CPUs are quite complex (yes, my comparision with an AVR 8 bit was not very lucky). And all the states including SSE and AVX and FPU state (and for sure a lot of other things) can be stored. Thus, context switches that save and restore this information and the registersmanage everything correct.

Sorry for the late answer, but I was away a few days and had to do some things for university.

I would recommend you to read Windows Internals books.You can find there a lot of deep technical information.

Could you recommend anything special or a certain one?

>>>Could you recommend anything special or a certain one?>>>

Yes of course.

Please follow this link http://www.amazon.co.uk/Windows-Internals-PRO-Developer-Mark-Russinovich...

This book is filled with advanced technical information about innerworkings of the Windows OS.As excpected from the Microsoft you won't find there any code,but the level of explanation is going very deep into kernel.

One more question: I have been looking for information about exact transition penalty information in cycles. Unfortunaltey I could not find any, although I have in mind something about 60-80 cycles. Where do I find this information?

Do you mean a transition from user mode to kernel mode?

Here is a very interested post ://forum.osdev.org/viewtopic.php?p=117933#p117933

>>... I have in mind something about 60-80 cycles...

There is some estimate value in the article you've referenced in your 1st post of the thread. These numbers, I mean 60-80 cycles, look right.

iliyapola,

I think I did not state my question clear enough. I am talking about AVX state transistion because of mixing avx with sse legacy.

Sergey,

thanks for that, I thought I read it in one of the manuals, did not think of this article. But the manuals do not mention anything is this right? I mean concrete cycle counts.

But can I find any concrete informations about cycles for both directions?

Hi Christian,

>>...But can I find any concrete informations about cycles for both directions?

Please try to look at Intel Manuals or try to find all ( or as many as possible ) Intel Articles related to that subject. Sorry, I can't suggest anything else now.

>>...Sorry, I can't suggest anything else now...

Christian, Are you interested in another small project related to measuring AVX-to-SSE and SSE-to-AVX transistions on Sandy Bridge and Ivy Bridge systems?

>>>I think I did not state my question clear enough. I am talking about AVX state transistion because of mixing avx with sse legacy>>>

It is ok:) Later I understood your intention.I have found this article about the transition penalty ://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf

Btw.I disabled rich-text option and removed any ocurrence of www and http protocol indentifiers.

>>>It is ok:) Later I understood your intention.I have found this article about the transition penalty ://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf

Thanks for the article!

[/quote]

>>>Christian, Are you interested in another small project related to measuring AVX-to-SSE and SSE-to-AVX transistions on Sandy Bridge and Ivy Bridge systems?

Yes, would be interesting! Write me a PM or lets open a new thread, or is it related to this subject?

I look in the manuals once again. So far I only have these two articles that mention concrete numbers. Manuals only provide hints to avoid transition penalties, why and how to do this. If I find anything else, I will post it here.

>>>Thanks for the article!>>>

As always You are welcome.

>>...Yes, would be interesting! Write me a PM or lets open a new thread, or is it related to this subject?

Christian, we could proceed in the same way as we did with SqrtTestApp project. So, I'll prepare some proposals and let you know as soon as they are ready. Please try to think and prepare your proposals how you would measure these transitions.

A new thread "Measuring AVX-to-SSE and SSE-to-AVX transitions on Sandy Bridge and Ivy Bridge systems" could be created as soon as we have some numbers. It doesn't make sence to create it right now.

Ok, this is my first idea:

Let us code a simple loops that does for example an addition and an multiplication with AVX. Then we might only store the lower half of the register with an SSE intrinsics. This should create both transition penalties. Compiling this code with and without /arch:AVX we should get an version with and one without the penalties. This could be checked by Intel SDE, where we get the exact count of transitions. If we meassure time and compare results, difference should be time for penalties. By knowing the number of transitions we should put this into concrete cycle numbers.

What do you think about it, do you have another idea or see some problem?

I think about one thing. If we only use one half of the AVX result and store it with SSE, will the compiler realized he could use SSE for AVX as we only need half of it? Then the test might faild.

The Intel compiler, when /arch:AVX is set so as to support AVX intrinsics, generates equivalent AVX-128 code from SSE intrinsics, so there should be no transition penalty.

>>...The Intel compiler, when /arch:AVX is set so as to support AVX intrinsics, generates equivalent AVX-128 code from SSE
>>intrinsics, so there should be no transition penalty.

Thanks for the comment, Tim. I think we shouldn't use /arch:AVX option.

Hi everybody,

>>...What do you think about it, do you have another idea or see some problem?

Christian proposals are very good and please take a look at my proposals ( they are similar ):

- Create a test application with support for 32-bit and 64-bit Windows platforms
- Disable All optimizations in Release in Debug configurations for the test application
- Select two instructions to reproduce AVX-to-SSE transition ( AVXI1 - AVX instruction 1 / SSEI1 - SSE instruction 1 )
- Select two instructions to reproduce SSE-to-AVX transition ( SSEI2 - SSE instruction 2 / AVXI2 - AVX instruction 2 )
- Implement two test cases to reproduce AVX-to-SSE and SSE-to-AVX transitions ( as simple as possible )
- Verify in the Debugger that compiler did not replace SSE instructions with AVX-128 instructions
- Verify with SDE that transitions are present for both cases ( AVX-to-SSE and SSE-to-AVX )
- All measurements have to be done when a priority of the process is switched to Real-Time
- Use rdtsc instruction to measure all time intervals in clock cycles ( cc )
- Number of interations should be around 2^20 and it will provide acceptable accuracy +/- 5(cc)
- Measure overhead of an empty for loop ( OEFL(cc) - Overhead of Empty For Loop )
- Measure latency of AVXI1(cc)
- Measure latency of SSE1(cc)
- Measure latency of AVXI2(cc)
- Measure latency of SSE2(cc)
- Total time of test for AVX-to-SSE transition TTT1(cc)
- Total time of test for SSE-to-AVX transition TTT2(cc)
- Calculate time for AVX-to-SSE transition: T-AVX2SSE(cc) = ( TTT1(cc) - OEFL(cc) - AVXI1(cc) - SSE1(cc) ) / N
- Calculate time for SSE-to-AVX transition: T-SSE2AVX(cc) = ( TTT2(cc) - OEFL(cc) - AVXI2(cc) - SSE2(cc) ) / N

It is actually a good idea to combine Christian's and Sergey's proposals and their results should be consistent. If results are Not consistent then there is a problem and additional investigation will be needed.

Note: I forgot to include TTT1(cc) and TTT2(cc)

>>- Select two instructions to reproduce AVX-to-SSE transition ( AVXI1 - AVX instruction 1 / SSEI1 - SSE instruction 1 )
>>- Select two instructions to reproduce SSE-to-AVX transition ( SSEI2 - SSE instruction 2 / AVXI2 - AVX instruction 2 )

Christian, Could you select just two instructions, one for AVX and another for SSE, instead of four? I think it will simplify tests. Thanks in advance.

>>>Measure overhead of an empty for loop ( OEFL(cc) - Overhead of Empty For Loop )>>>

Execution of for-loop statements will be performed in parallel with the execution of the code contained inside the loop block.This will be done on Port0 and/or Port1.

Sergey, looks quite good your idea.

What do you think of a simple logic and for SSE and AVX. Both should have latency 1 and througput 1 for sandy and ivy bridge. I think this is quite a good basis.

There is one thing: If we execute something in a loop we always get both transitions. Becauce if we do SSE and then AVX command in a loop. The next iteration will cause the opposite transition. So we might use a zeroall to avoid this.

>>> Execution of for-loop statements will be performed in parallel with the execution of the code contained inside the loop block.This will be done on Port0 and/or Port1.

This would make it hard to meassure empty for loop. But what about Intel SDE? You can let is analyse some statements regarding througput and latency. Maybe this information might help us.

>>...There is one thing: If we execute something in a loop we always get both transitions. Becauce if we do SSE and then
>>AVX command in a loop. The next iteration will cause the opposite transition. So we might use a zeroall to avoid this...

Yes, that is correct and I missed it.

>>...This would make it hard to meassure empty for loop...

I did a couple of tests in the past and take a look:
...
// Sub-Test 6.1 - Overhead of Empty For Statement
{
///*
CrtPrintf( RTU("Sub-Test 6.1 - [ Empty For Statement ]\n") );

g_uiTicksStart = SysGetTickCount();
for( RTint t = 0; t < 10000000; t++ )
{
;
}
CrtPrintf( RTU("Sub-Test 6.1 - 10,000,000 iterations - %4ld ticks\n"),
( RTint )( SysGetTickCount() - g_uiTicksStart ) );
//*/
}
// Sub-Test 6.2 - Overhead of Empty For Statement
{
///*
CrtPrintf( RTU("Sub-Test 6.2 - [ Empty For Statement ]\n") );

RTclock_t ctClock1 = 0;
RTclock_t ctClock2 = 0;

ctClock1 = ( RTclock_t )CrtClock();
for( RTint t = 0; t < 1000000; t++ )
{
;
}
ctClock2 = ( RTclock_t )CrtClock();
CrtPrintf( RTU("Sub-Test 6.2 - 1,000,000 iterations - %4ld clock cycles\n"),
( RTint )( ( RTfloat )( ctClock2 - ctClock1 ) / 1000000 ) );
//*/
}
...
and I'll run these tests on Ivy Bridge with a higher number of interations.

Notes:
CrtPrintf = _tprintf
RTU = _T
SysGetTickCount = GetTickCount
RTclock_t = clock_t
CrtClock = __rdtsc

Christian, Let's wait for a couple of days for input from the community or Intel software engineers. Let's start in the middle of the next week.

>>>I did a couple of tests in the past and take a look:>>>

Ithink that loop overhead will have some influance on the speed of execution only when the floating-point values are used as a loop control variables.When you use integer values as loop control variables modern processor will exploit an instruction level parallelism and execute both types of instruction in parallel.

>>...There is one thing: If we execute something in a loop we always get both transitions...

Christian, In that case our generic equations need to be changed to:

>>>>...
>>>>- Calculate time for AVX-to-SSE transition: T-AVX2SSE(cc) = ( TTT1(cc) - OEFL(cc) - AVXI1(cc) - SSE1(cc) ) / ( N * 2 )
>>>>- Calculate time for SSE-to-AVX transition: T-SSE2AVX(cc) = ( TTT2(cc) - OEFL(cc) - AVXI2(cc) - SSE2(cc) ) / ( N * 2 )
>>>>...

However, these are too generic equations.

We don't know if time for AVX-to-SSE transition will be equal to time for SSE-to-AVX transition for ALL possible combinations of AVX and SSE instructions which cause transitions.

So, Zeroall-Approach at the end of for-loop looks good and I would really stick with a generic case assuming that some variations in transition time are possible but we're not going to prove or disprove it. It could be a different R&D project...

Have you selected AVX and SSE instructions you're the most interested?

>>>This would make it hard to meassure empty for loop>>>

IIRC Sergey demonstrated it on "optimization of sine function" thread.Unfortunately the results sre lost because of forum transition.

P.s

Very interesting test case unfortunately I cannot contribute because of my old cpu.I will follow the tests.

Sergey,

I know it is only an assumption. But AVX to SSE means CPU has to store something and SSE to AVX means loading of the register information.  In my mind both should as a first approximation take nearly the same time. Maybe storing is faster if data is written to cache and for loading the data is already in memory or other cache.

Do you think the exact instruction has an incluence? I suppose the alwayse store and restore the same amount of data.

To start we should select ADD or MUL instructions. The are commonly used and provide great performance.

// EDIT: Maybe you are right and there is a dependcy. The article mentions 60-80 cycles. Either this is uncertainty of the meassurements or there is some factor that has incluence on this.

>>...Do you think the exact instruction has an incluence?..

I don't know. Even if there is a finite number of instructions verification of different combinations will be a time consuming ( wasting? ) process.

>>... The article mentions 60-80 cycles...

I was also surprized to see that range because 20 clock cycles difference looks too much.

Yes, unfortunately this is something the manuals do not make precise statements. You only read, please avoid this transitions and how to do this.

I think it is quite hard to create real tests. Even if we seperate the meassurements for both transitions we have still the influence of the loop as meassuring for-loop overhead is quite hard. If you do not do anything inside it, compiler is optimizing it away.

Do you have some idea? Can one read something like CPU cycle count? This might be a solution if possible.

>>...Even if we seperate the meassurements for both transitions we have still the influence of the loop as meassuring for-loop
>>overhead is quite hard. If you do not do anything inside it, compiler is optimizing it away...

I've measured an overhead of empty for-loop some time ago. Also, that is why the 2nd line in specs is:
...
- Disable All optimizations in Release in Debug configurations for the test application
...

>>...Can one read something like CPU cycle count?

Yes and RDTSC instruction needs to be used ( or unsigned __int64 __rdtsc(void) intrinsic declared in intrin.h ).

>>> we have still the influence of the loop as meassuring for-loop overhead is quite hard. If you do not do anything inside it, compiler is optimizing it away.>>>

It will be executed in parallel with SSE/AVX uops stream on Port0 and Port1.Only when unrelated integer instruction will be scheduled for execution somehow inbetween for-loop uops stream there could be some overhead.

Quote:

Sergey Kostrov wrote:

>>...Can one read something like CPU cycle count?

Yes and RDTSC instruction needs to be used ( or unsigned __int64 __rdtsc(void) intrinsic declared in intrin.h ).

The built-in compiler intrinsic __rdtsc is available in Microsoft and Intel C++ compilers, but not in gcc, where you would need to write out inline asm according to your choice of 32- or 64-bit mode.  Intel MKL library includes a dsecnd function based on time stamp counter, with built-in translation to seconds.  Both __rdtsc and dsecnd require the programmer to select the same thread for each use of the function, to take care of situations where the counter is not synchronized among CPUs at hardware reset (as it is on most motherboards with Intel CPUs).  On Intel CPUs since Woodcrest, the counter actually counts buss clock cycles and multiplies by the nominal CPU clock multiplier (independent of power settings or overclocking).

Christian, I don't know if you do any Linux programming with a GCC compiler...

>>... where you would need to write out inline asm according to your choice of 32- or 64-bit mode...

That is correct and the following code does the job:

inline uint64 GetClock( void )
{
uint64 uiValue;
__asm__ volatile
(
"rdtsc;" : "=A" ( uiValue )
);
return ( uint64 )uiValue;
}

Quote:

Sergey Kostrov wrote:

Christian, I don't know if you do any Linux programming with a GCC compiler...

>>... where you would need to write out inline asm according to your choice of 32- or 64-bit mode...

That is correct and the following code does the job:

inline uint64 GetClock( void )
{
uint64 uiValue;
__asm__ volatile
(
"rdtsc;" : "=A" ( uiValue )
);
return ( uint64 )uiValue;
}

That code is for 32-bit mode, with gcc or icc.   The method to return the 64-bit result as a normal uint64_t in 64-bit mode is different.e.g.

   unsigned int _hi,_lo;
   asm volatile("rdtsc":"=a"(_lo),"=d"(_hi));
   return ((unsigned long long int)_hi << 32) | _lo;

For Windows historians I have also the X64 code from before the implemention of __rdtsc built-in.

Thanks, Tim! I'll need to review that piece of codes ( used currently on a project... ) in case of 64-bit applications...

>>...For Windows historians I have also the X64 code from before the implemention of __rdtsc built-in.

That would be nice to see and please post it. Thanks in advance!

Pages

Login to leave a comment.