Guaranteed atomic operations on Xeon Phi

Guaranteed atomic operations on Xeon Phi

jimdempseyatthecove's picture

The IA32 and Intel64 (host) processors have Guaranteed Atomic Operations for load and store of

word aligned word
double word aligned double word
P6 and later aligned and unaligned word, dword and qword within single cache line.

What are the guaranteed atomic load and store operations on Xeon Phi?

The reason I ask this is that I am observing non-atomic stores of dword and qword (__int32 and __int64) values (within cache line) where different threads are writing to different variables within the cache line. If I add inter-value pad to extend across cache line, the stores do not interfere. When pad removed stores interfere. I examined the disassembly to assure that GP register to memory instructions are used (IOW only mov of register to memory).

The code is store only, different threads writing to different locations. These are not Read/Modify/Write instructions.

Same code strategy works on host (Xeon E5).

Jim Dempsey
8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
James Cownie (Intel)'s picture

The rules for Xeon Phi are supposed to be the same as for other X86 processors. Assuming that you are operating on naturally aligned memory, loads and stores should not tear when writing to cachable memory.

If you are writing to memory that is shared over the PCI and mapped non-cachable, other rules may apply...

jimdempseyatthecove's picture

Hi Jim,

I know the rules are supposed to be at least similar to P4, not sure about P6. I could not find documentation on Xeon Phi similar to Intel(R) Architecture Software Developer's Manual, Volumes 3A, Section 8.1.1. The presumption is that it is the same. However, the cache system and the RAM being GDDR5 may have subtle differences. The memory is local to the Xeon Phi.

A sketch of the code is

	int counts[nc];

	for(int i=0; i < nc; ++i) counts{nc] = -1;


	#pragma omp parallel


	  int myThread = omp_get_thread_num();

	  for(int myCount=0; myCount < loopCount; ++myCount) {

	   i ... // work

	    counts[myThread] = myCount; // post progress

	  } // end inner loop

	} // end parallel


The actual code is more complex in that threads also observe the counts of adjacent threads. Any individual thread can get ahead by +1 count, but no more. I am assuming that the compiler naturally aligned the int counts array on the stack (they've done so as long as I can remember).

Using the above bungs up that adjacent counts, however placing counts into a 64-byte struct and using an array of those structs, then the counting progresses as expected. (at least as far as I've tested)

Additional note, the bung-up is not always observed. In fact it is somewhat rare in that it is observed once every several runs of the program (10-20). The counted loop runs just over 6500 iterations. And the error also appears to be inter-core. And the compute runtime for the loop is on the order of 8 seconds.

It could be a coding error on my part, it does happen from time to time. If I can make a simplified reproducer I will do so and send it in. For now, I will complete what I am doing using the work-around of the struct.

Jim Dempsey
James Cownie (Intel)'s picture

In your sketch there aren't any concurrent stores to the same data (just to the same cache-line), so there isn't really any atomicity of stores issue. (That applies in cases like *p = 0; in one thread and *p = -1; in another [where p is the same, of course], and guarantees that you see one or minus one, not 0xff00ff00, or some other similar mess).

jimdempseyatthecove's picture

The atomnicity of the stores does not arise from multiple threads storing into the same cell (int in this case), rather it is having different threads storing into separate cells within the same cache line. In this situation, the conflicting writes to different subsections of a cache line, must properly merge the data (or invalidate those sub-sections) into the cache lines held in the other cores (if there, or into RAM if not). The symptoms appear as if stale data is used for the merge. I am not privy to the internals of the memory/cache subsystem my guess is a byte mask is used for this purpose. I will see if I can make a simple reproducer.

Jim Dempsey
James Cownie (Intel)'s picture

The cache-coherency protocol does not work the way you think it does... What actually happens is that before a core can update a cache line it must obtain it in exclusive state, which means that no-other cache can have a copy. There is therefore never any need to merge the data, and there are no masks anywhere.

In any case, you have kindly let me know privately that you found a bug, and the hardware is behaving correctly.  So I'll close this thread.

John D. McCalpin's picture

It can be extremely difficult to understand ordering rules in current microprocessors, especially if your application code does not have a structure that looks like the examples that are used in the definitions of the ordering rules.

James Cownie's comment about how cache coherence works is correct in the most common cases, but there is actually one case in which byte masks are used -- when streaming stores are used and the buffer is flushed in a state in which not all elements have been written.   In this case the processor emits a different store instruction (or sequence of store instructions) so that the memory controller has the information needed to perform the merge of the updated bytes with the unmodified bytes from memory.  (The unmodified bytes are not available to the processor because the whole point of using a streaming store is to avoid needing to read the line from memory before writing to it.)    The protocol used for these stores is typically an undocumented internal bus protocol, but for AMD processors the transaction(s) are equivalent to the publicly documented SizedWriteByte command of the HyperTransport protocol.  This, not surprisingly, sends data and a byte mask for the memory controller to use in the merge operation.  Depending on the specific bytes that have been written either one or two SizedWriteByte transactions are required (handling a maximum of 32 Bytes each).   I don't have any specific information about how Intel processors perform this function, but given the historical notes (below) it seems very likely that the same approach is used.

I should note that this is not the desired mode of operation for streaming stores.  Normally you write entire lines and these lines are written to memory as full 64 Byte blocks.  The memory controller then simply needs to compute the ECC bits for the new data and overwrite the line in DRAM.  The more general merging functionality is needed for the beginning and end of transfers that are not integral numbers of aligned 64 byte blocks and it is needed for cases in which the streaming store buffers are flushed prematurely -- for example if the process gets interrupted in the middle of writing a block of streaming stores and the next process to run wants to use those partially filled buffers.

This mechanism is supported in large part because legacy IO devices sometimes need to be able to write to system memory.  Since IO devices are not part of the coherence domain, they cannot participate in the standard mechanism (getting a copy of the line in "exclusive" state, modifying the target bytes, and writing the whole line back to memory), and some mechanism must exist to be able to merge byte-granularity data.  Later, processors were given the ability generate these sorts of stores to support what is now known as "streaming stores" or "non-temporal stores" (or sometimes "non-globally-ordered stores").   These help the STREAM benchmark a lot more than they help anything else, but I suppose that could be said about a fair number of feature/benchmark pairings.


John D. McCalpin, PhD "Dr. Bandwidth"
jimdempseyatthecove's picture

Thanks John for the detailed response. As Cownie mentioned, I had replied to him off-line to inform him it was a coding error in my program where an assumed requirement of one part of the code was not followed by a different part of the code. The symptoms produced were as if the atomic stores were not faithfully performed. This was totally my fault. This morning, about 4:00AM, I awoke from sleep with the answer to my problem.

The code is performing a plesiochronous barrier technique that attains an additional 20% boost in performance over prior optimizations performed in the article. The code is now 44% faster than an optimized example published in Intel(R) Xeon Phi(TM) Coprocessor High-Performance Programming by Jim Jeffers and James Reinders, Chapter 4. I suspect some additional performance can be attained. This is a great book and a marvelous resource, all MIC programmers should have a copy.

If you want, I can send you a draft of an article I will be posting on IDZ. The last optimization technique is the one that used to exhibit the symptom, but has since been corrected. I can send you the code too, if you want comment, critique, and/or tweak.

Jim Dempsey

Login to leave a comment.