Intel® Xeon Phi™ x200 series (KNL) Ring 3 Monitor/MWait

Intel® Xeon Phi™ x200 series (KNL) Ring 3 Monitor/MWait

We are glad to announce a model specific feature of the Intel® Xeon Phi™ x200 series (formerly known as Knights Landing (KNL)) which allows the MONITOR and MWAIT instructions to be executed in user code at ring 3, as well as in kernel mode.

A description of how to enable this feature is available here. This information will be incorporated in a future edition of the Intel® 64 and IA-32 Architectures Software Developer Manuals.

21 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Do these instructions cause the core to transition into and out of Core C1 state when executed in user mode?

Any hints on performance?  

Thoughts:

  • Traditionally Monitor/MWait have existed to support transitions in and out of Core C-states.  Since these transitions are not particularly fast, the Monitor/MWait functionality did not need to be fast either. 
  • Of course the definition of the Monitor/MWait instructions does not require that they do anything specific with regard to C states.  The instructions could certainly be used to implement a reduced-power C0 active wait state that avoids the need to understand the best way to implement reduced-power spin-waits on different processor implementations. 
  • By abstracting the semantics from the implementation, one could imagine such an approach could be used to interface with a hardware communication infrastructure able to provide both lower power consumption and lower latency than the traditional spin-wait approaches.

 

"Dr. Bandwidth"

Do these instructions cause the core to transition into and out of Core C1 state when executed in user mode?

They behave as described in the PRM, so, yes, they hint to the core that it should move to the requested C state. Of course, if the core has multiple HW threads executing it cannot move to a lower power state until all HW threads are idle.

The benefit to be expected is not wakeup latency, but the fact that you have told the core that this HW thread is now idle, which allows it to re-allocate OoO resources to other HW threads that still have (you hope) useful work to do. So even though your last-in to last-out barrier time is not improved (or even gets worse), the overall application performance can improve because stragglers get to run faster while they're holding everyone else up.

The other stuff sounds like things for discussion with beer in hand in Salt Lake :-)

It should be obvious that I am writing to bit 0 instead of bit 1 because I am old and tainted by my FORTRAN past.

I will try again with the correct bit.
 

"Dr. Bandwidth"

It might also be a good idea to add an extra warning to that page about the difference between bit 0 and bit 1 and the possible side effects of setting reserved bits.  I don't think that I am the only person who has trouble with seeing "bit 1" and incorrectly translating it to "the first bit".  (Because of the importance of "first" as a human mental concept, this seems to happen a lot more with "1" than with any other indices.... At least that is my current excuse....)

"Dr. Bandwidth"

I also fell into that trap, but an architect corrected me before I had actually run the command :-).

Since the post includes an example showing exactly what you need to do to set the correct bit using wrmsr (with the value 2) and that this is really intended as "Here's what's coming in the next PRM" (so that we can start promoting kernel patches) I don't see the need to emphasize the point even more.

 

James,

Do you know if Intel will incorporate this feature on KNL versions of OpenMP, TBB, etc...? IOW for use in standard barriers.

!$OMP BARRIER
!$OMP TASKWAIT
!$OMP END TEAMS ...
etc...

Jim Dempsey

Do you know if Intel will incorporate this feature on KNL versions of OpenMP, TBB, etc...? IOW for use in standard barriers.

Jim, I cannot comment on what will be in future software releases. However, as a general point, it's reasonable to notice that we optimize our software to use all the features of our hardware. 

Folks, Linux kernel patches to implement sensible support for KNL R3 monitor/mwait have recently been accepted into the mainline kernel. Once I've written it, I'll be posting another blog here to show how to use the (now standard) Linux interfaces to detect the feature from user code.

If you want to play with them now, the patches are

p.s. John, with these you won't need to set explicit bits in MSRs :-)

Hi, thanks for the post. Putting them into user space looks appealing, but I have 2 questions about these instructions.

(1) What would happen if multiple cores try to write to the same monitored memory address simultaneously?

I've seem a few works optimizing scalable locks using these two instructions, but most of them only let the lock holder modify the monitored memory location, which makes me curious about the concurrency of monitor/mwait. Suppose thread-A in core 0 and thread-B in core 1 both try to access a location monitored by thread-C on core 2, will thread-C receive 2 events (perhaps by mwait twice)? Will thread-C suffer from a non-repeatable read about the lock structure?

(2) Is there any real usage example, e.g., in the intel's own product/infrastructure, to illustrate the purpose of putting these 2 instructions into user space? I understand it is more convenient for users as they don't have to modify the kernel(I like it), and the strategies can be more flexible because kernel participation is avoided. But since it will lead the core into a optimized state, the syscall/sysret cost seems not very important compared with the wait time. What extra benefits does users get, for example, in thread synchronization?

Thanks

Actually I have one more little question, all threads share a common kernel address space, therefore we can monitor a specific memory address. But in user space, different process have different address space, so monitor/mwait is only useful for single-process, multiple-threads applications?

Separate processes can have shared memory areas as well.

Jim Dempsey

引文:

jimdempseyatthecove 写道:

Separate processes can have shared memory areas as well.

Jim Dempsey

Yes, thanks for pointing out, I missed that, it occured to me after I post it. My main concern is still about concurrency, and I think I may need to write some simple kernel modules to explore the monitor/mwait behaviors. But if I really need some kind of locks or atomic operations to deal with concurrent write, then it seems some part of the low-latency advantage is lost in certain situations...

MONITOR/MWAIT was intended as a power saving technique, not as a low-latency means. Any low-latency effects are a side effect of the implementation. See #2 by McCalpin

Jim Dempsey

引文:

jimdempseyatthecove 写道:

MONITOR/MWAIT was intended as a power saving technique, not as a low-latency means. Any low-latency effects are a side effect of the implementation. See #2 by McCalpin

Jim Dempsey

I see, but does it purely save some power or does it improve other HW threads' performance at the same time? According to Cownie #3,

you have told the core that this HW thread is now idle, which allows it to re-allocate OoO resources to other HW threads that still have (you hope) useful work to do. So even though your last-in to last-out barrier time is not improved (or even gets worse), the overall application performance can improve because stragglers get to run faster while they're holding everyone else up.

However, I did a quick test with latest stress-ng, and the performance is not improving (I create a new topic in https://software.intel.com/en-us/forums/intel-isa-extensions/topic/783757).

Am I missing something, am I using these instructions in the wrong way?

Thanks.

引文:

Zihan Y. 写道:

(1) What would happen if multiple cores try to write to the same monitored memory address simultaneously?

I've seen a few works optimizing scalable locks using these two instructions, but most of them only let the lock holder modify the monitored memory location, which makes me curious about the concurrency of monitor/mwait. Suppose thread-A in core 0 and thread-B in core 1 both try to access a location monitored by thread-C on core 2, will thread-C receive 2 events (perhaps by mwait twice)? Will thread-C suffer from a non-repeatable read about the lock structure?

You are ascribing more semantics to the combination of monitor/mwait than they have. They do not tell you anything more than "Now might be a good time  to check whatever condition you are waiting on". The mwait instruction can complete whenever the implementation finds it convenient; completion of the mwait does not guarantee that the monitored line has changed. The only guarantee it gives you is that it will complete when the line is written. So the thread will continue to execute soon after that, but it may continue to execute at any arbitrary time even though the line was not written. Thus the information that the mwait completed is only a hint that something interesting might have happened.

The monitor/mwait instructions can certainly not be used for synchronization on their own. Hence your questions about what memory states a thread sees are moot. It can see any of the same states it could have seen if it was polling without using monitor/mwait.

引文:

Zihan Y. 写道:

(2) Is there any real usage example, e.g., in the intel's own product/infrastructure, to illustrate the purpose of putting these 2 instructions into user space? I understand it is more convenient for users as they don't have to modify the kernel(I like it), and the strategies can be more flexible because kernel participation is avoided. But since it will lead the core into a optimized state, the syscall/sysret cost seems not very important compared with the wait time. What extra benefits does users get, for example, in thread synchronization

You can see them being used in the LLVM OpenMP runtime... http://openmp.llvm.org 

p.s. I should also have pointed you at the Intel® Architecture Instruction Set Extensions and Future Features Programming Reference (Ref. #319433-034) which describes the "UMONITOR—User Level Set Up Monitor Address", "UMWAIT—User Level Monitor Wait" and "TPAUSE—Timed PAUSE" instructions which will appear in some future Intel processor.

引文:

Cownie, James H (Intel) 写道:

You are ascribing more semantics to the combination of monitor/mwait than they have. They do not tell you anything more than "Now might be a good time  to check whatever condition you are waiting on". The mwait instruction can complete whenever the implementation finds it convenient; completion of the mwait does not guarantee that the monitored line has changed. The only guarantee it gives you is that it will complete when the line is written. So the thread will continue to execute soon after that, but it may continue to execute at any arbitrary time even though the line was not written. Thus the information that the mwait completed is only a hint that something interesting might have happened.

The monitor/mwait instructions can certainly not be used for synchronization on their own. Hence your questions about what memory states a thread sees are moot. It can see any of the same states it could have seen if it was polling without using monitor/mwait.

I see, thanks for clarifying.

引文:

Cownie, James H (Intel) 写道:

You can see them being used in the LLVM OpenMP runtime... http://openmp.llvm.org

Thanks for the link, but do you mean they use it in ring 3 on KNL, or the general moniro/mwait in kernel space? I just cloned the source, but I find it just checks the presence of monitor/mwait, I don't see anywhere to use mwait instruction.

引文:

Cownie, James H (Intel) 写道:

p.s. I should also have pointed you at the Intel® Architecture Instruction Set Extensions and Future Features Programming Reference (Ref. #319433-034) which describes the "UMONITOR—User Level Set Up Monitor Address", "UMWAIT—User Level Monitor Wait" and "TPAUSE—Timed PAUSE" instructions which will appear in some future Intel processor.

Yes, that is what I'm expecting in Snow Ridge, but it seems they won't be available soon, so I'll just explore the feature of monitor/mwait in kernel space for now, I don't have a KNL at hand. I guess they are almost the same in terms of functionality, just in different priviledge level.

Thanks for the link, but do you mean they use it in ring 3 on KNL, or the general moniro/mwait in kernel space? I just cloned the source, but I find it just checks the presence of monitor/mwait, I don't see anywhere to use mwait instruction.

Sorry, my mistake. You are right the R3 monitor support code clearly never got pushed out into the LLVM runtime.(I think the issue was that we wrote it before the OS patches existed and didn't want to push it until they did, then it got forgotten, and with the "real" umonitor/umwait/tpause coming there seems no point in pushing out KNL specific code that won't exactly carry forward).

Trivial code looks like this, though

static void mwaitNonZero(uint32_t volatile * target)
{
    while (*target == 0)
    {
        _mm_monitor ((void *)target, 0, 0);
        if (*target != 0)   // Avoid race if target updated after the while test but before the monitor starts.
            break;
        _mm_mwait (0, 0);
    }
}

 

引文:

Cownie, James H (Intel) 写道:

Sorry, my mistake. You are right the R3 monitor support code clearly never got pushed out into the LLVM runtime.(I think the issue was that we wrote it before the OS patches existed and didn't want to push it until they did, then it got forgotten, and with the "real" umonitor/umwait/tpause coming there seems no point in pushing out KNL specific code that won't exactly carry forward).

Trivial code looks like this, though

static void mwaitNonZero(uint32_t volatile * target)
{
    while (*target == 0)
    {
        _mm_monitor ((void *)target, 0, 0);
        if (*target != 0)   // Avoid race if target updated after the while test but before the monitor starts.
            break;
        _mm_mwait (0, 0);
    }
}

I see, then I'll wait for the release of new processors. Thanks for the code piece.

Leave a Comment

Please sign in to add a comment. Not a member? Join today