-opt-streaming-stores

-opt-streaming-stores

Is there an attribute I can attach to a pointer so that any store through that pointer will be a streaming store where possible (i.e. MOVNTQ, MOVNTPS, or VMOVNTPS)? If not, can that be added? It would be nice to contol cache pollution on a fine grained basis. It seems like you would have all the machinery for this already in your compiler due to the presence of the -opt-streaming-store flag. I just need a way to select the functionality at the pointer level.

-Jeff

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Did you look into #pragma nontemporal (ptr) which appears to do what you ask (one for loop at a time)? I've run into cases where this pragma was ignored with certain architecture options; if that's your problem, you may have to choose an older architecture for the function in question, and submit an issue asking on premier.intel.com whether it might be fixed for the architecture of your choice. I don't know of any reason why this pragma shouldn't apply to generation of vmovntps, although I haven't seen that case work.
The compiler has to be able to apply an alignment adjustment for the stream which is to be nontemporal; if there is only one stream stored per loop that would be the normal action. That stream has to be one which is used only for stores (no other operations); you're probably aware of that.

Quoting TimP (Intel)
Did you look into #pragma nontemporal (ptr) which appears to do what you ask (one for loop at a time)? I've run into cases where this pragma was ignored with certain architecture options; if that's your problem, you may have to choose an older architecture for the function in question, and submit an issue asking on premier.intel.com whether it might be fixed for the architecture of your choice. I don't know of any reason why this pragma shouldn't apply to generation of vmovntps, although I haven't seen that case work.
The compiler has to be able to apply an alignment adjustment for the stream which is to be nontemporal; if there is only one stream stored per loop that would be the normal action. That stream has to be one which is used only for stores (no other operations); you're probably aware of that.

Hi Tim,

I had seen the nontemporal pragma, but I thought it could only be applied to a loop. The problem is that I am using a functor, a la TBB, and the functor only contains one iteration worth of work. The actual looping is done elsewhere, where there is no scope to annotate the pointer.

An advantage of allowing this as an attribute is that I could (in theory) do something like this:

typedef double * __restrict__ __attribute((aligned (32)) __attribute__((streaming_store)) SSptr_t ;

Thanks,
-Jeff

Best Reply

In order to use non-temporal stores, the data must be packed into 128- or 256-bit (for AVX-256) bundles by the application. This can be done only by auto-vectorization with pragma nontemporal, or by using the intrinsics or asm explicitly. You could save your results explicitly in a buffer just big enough to cause optimized memcpy() to shift into nontemporal, and then push them out by memcpy() (using one of the optimized libraries). You could submit a premier issue to ask the compiler team about it, but I think these are the only alternatives feasible to implement on the platform.

OK. thank you for the insight.

-Jeff

Hi Jeff,

I've created a feature request for you here to our code generator team. I'll update the thread when we have any updates on this. If you submit a Premier Support issue, make sure to reference this thread so that the engineer that takes the issue sees that a feature request has been submitted.

Brandon Hewitt
Technical Consulting Engineer

For 1:1 technical support: http://premier.intel.com

Software Product Support info: http://www.intel.com/software/support

Quoting Brandon Hewitt (Intel)
Hi Jeff,

I've created a feature request for you here to our code generator team. I'll update the thread when we have any updates on this. If you submit a Premier Support issue, make sure to reference this thread so that the engineer that takes the issue sees that a feature request has been submitted.

Hi, I've submitted a Premier request for the typedef form shown in #2 above. I talked to someone on the Intel compiler vectorization team face-to-face about this, and also talked to someone on the language team in Hillsboro face-to-face. I'm looking forward to this being implemented since it will allow me (and everyone else) to get performance out of the compiler without introducing the softwaremaintenance issues associated with pragmas and align directives.

Quoting Brandon Hewitt (Intel)
Hi Jeff,

I've created a feature request for you here to our code generator team. I'll update the thread when we have any updates on this. If you submit a Premier Support issue, make sure to reference this thread so that the engineer that takes the issue sees that a feature request has been submitted.

Brandon, now that issue #672743 allows you to attach attributes to typedefs, would you still be willing to suggest that streaming stores be another type of supported attribute? Again, the reason that the nontemporal attribute needs to be attached to the data rather than used via a pragma on the loop is because when using functors or lambdas, the loop construct can be declared elsewhere, and may not know anything about the variables used in the loop body. There is no way to use a pragma to mark unknown variables, but attributes are perfect for this.

example:

template
void IndexSet_forall(int begin, int end, OP& op)
{
for ( int ii = begin ; ii < end ; ++ii ) {
op( ii );
}
}

-Jeff

Leave a Comment

Please sign in to add a comment. Not a member? Join today