Word tearing on Intel Prozessors

Word tearing on Intel Prozessors

Recently I learned about word tearing in threaded programs and that it can occur on alpha processors. But what about Intel chips?

Is there a guarantee, that word tearing cannot occur on Pentium SMP architectures?

I tried to force the error on my Pentium III Dual Processor machine, but it did not occur with the little test program:
http://www.pontohonk.de/example/threadtest.C

Can anybody give me a hint to appropriate documentation about this problem on Intel processors?

13 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I'm no expert but it seems that a guarantee against word tearing for a specific processor would be difficult if not impossible to issue. It seems that it would not only depend on the processor but also the compiler. Here's a quote from an article I found on a Google search:

"...given a multithreaded program that has been compiled to have longword actual granularity, if any two of the program's threads can concurrently update different bytes or words in the same longword, then that program is, in theory, at risk for encountering a word-tearing race condition. However, in practice, language-defined restrictions on the alignments of data may limit the actual number of candidates for a word-tearing scenario..."

There's some good background information where I found this if others are as unfamiliar with this condition as I am.

I'll keep looking around to see if there is an architectural reason this can't occur on Intel processors.

For anyone unfamiliar with word tearing, here's a good explanation taken from David R. Butenhof's book, Programming with POSIX Threads:

If a variable crosses the boundary between memory units, which can happen if the machine supports unaligned memory access, the computer may have to send the data in two bus transactions. An unaligned 32-bit value, for example, may be sent by writing the two adjacent 32-bit memory units. If either memory unit involved in the transaction is simultaneously written from another processor, half of the value may be lost. This is called "word tearing."

I will guess that if the variable cross a boundary between two cache lines there is a higher probability that the problem occur in practice.

|line A  |line B  |  32 byte lines (PIII)
|--------|--------|  
|       *|***     |  **** : 4-byte quantity

If CPU#1 issues a snoop read request for line A before the variable was updated by CPU#2 and right after a request for line B, word tearing will probably occur.

you should be able to force the problem with an allocation with _mm_malloc(64,32) + a 31 byte offset (you will be off by 1 byte the end of the 1st line)

the best way to avoid with 100% confidence such nasty cases is to use synchronization for all shared variables, as you know probably under WIN32 critical sections have low overhead and are well suited to implement simple guards

Using CriticalSection is a good solution for performance, but not using analigned access even better for performance and allows not using cs for 4-byte quantities.

> performance, but not using analigned access even
> better for performance and allows not using cs for
> 4-byte quantities.

well, not really for the code snippet of the original poster of this (forum) thread, operator ++ and operators += -= are used : typically lead to more than one instruction, in any case no cycle indivisible / atomic operation can be garenteed with these

> operators += -= are used : typically lead to more
> than one instruction, in any case no cycle

looks like I was plain wrong on this one, I've tried to compile part of the code with icl 7.1 and one instr. per operator is used :


void *work( void *arg, int count)
{
   int *pointer = static_cast(arg);
   volatile int &counter = *pointer;

   for (unsigned long i = 0; i < count; ++i)
   {
      ++counter;
      counter += 5;
      counter -= 4;
      counter -= 1;
   }

   return 0;
}

part of the ASM :


  add       DWORD PTR [ebp], 1                            ;1231.9
  add       DWORD PTR [ebp], 5                            ;1232.7
  add       DWORD PTR [ebp], -4                           ;1233.7
  add       DWORD PTR [ebp], -1   


-

so, with properly aligned data, can we say for sure that it's safe practice to use such style of code without any CS ?

No! A single instruction is only atomic with respect to interrupts on that processor. A read-modify-write memory operation does not prevent another CPU from reading the data during the modify.

> No! A single instruction is only atomic with respect
> to interrupts on that processor. A read-modify-write
> memory operation does not prevent another CPU from
> reading the data during the modify.

yes it was my understanding so far, before to read this kdmitry's comment :

"(...) but not using analigned access even better for performance and allows not using cs for 4-byte quantities. "

After all, I guess his statement is a wrong assumption that works most of the time but can lead to very nasty bugs

Here are some extracts from the IA-32 Intel Architecture Software Developer's Manual Volume 3 : System Programming Guide

[start of quote]
7.1.1 Guaranteed Atomic Operations

The Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors guarantee that the following basic memory operations will always be carried out atomically:

. Reading or writing a byte

. Reading or writing a word aligned on a 16-bit boundary

. Reading or writing a doubleword aligned on a 32-bit boundary

The Pentium 4, Intel Xeon, and P6 family, and Pentium processors guarantee that the following additional memory operations will always be carried out atomically:

. Reading or writing a quadword aligned on a 64-bit boundary.

. 16-bit accesses to uncached memory locations that fit within a 32-bit data bus.

The P6 family processors guarantee that the following additional memory operations will always be carried out atomically:

. Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a 32-byte cache line.

Accesses to cacheable memory that are split across bus widths, cache lines and page boundaries are not guaranteed to be atomic by the Pentium 4, Intel Xeon, P6 family, Pentium and Intel486 processors. The Pentium 4, Intel Xeon and P6 family processors provide bus control signals that permit external memory subsystems to make split accesses atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be avoided.
[end of quote]

I interpret this as meaning that on Pentium processors with reputable chipset etc, all simple reads and writes up to 64-bits are atomic (user programs work on cached memory).

But then in section 7.1.2.2 SOFTWARE CONTROLLED BUS LOCKING of the same document:

[start of quote]
Software should access semaphores (shared memory used for signalling between multiple processors) using identical addresses and operand lengths. For example, if one processor accesses a semaphore using a word access, other processors should not access the semaphore using a byte access.
[end of quote]

So is it safer for unlocked operations than for bus-locked operations !?

By the way, the cited document (excellent and downloadable free of charge from the Intel website) seems to change without any version control, so the sections may have moved.

Hope this helps
Dar

The documents you quote say nothing about "reputable chipset;" they give reasons why you must take care that your data are properly aligned (by your program). Alignment of 64-bit data isn't necessarily automatic on Windows. malloc() pointers must be adjusted, or the 128-bit aligned _mm_malloc() may be employed. I submitted an Intel Premier problem report on ICL 7.1 generation of mis-aligned local arrays of doubles; it was rejected, on the grounds that early Microsoft compilers don't necessarily align doubles. Microsoft compilers continue to generate 64-bit moves using pairs of 32-bit operations, in order to allow for mis-alignment. Other compilers, such as gcc, are clearly on the side of type-appropriate alignment.

I don't understand your question about the different operand lengths. That used to be a frequent practice in the Microsoft libraries. It incurs stalls when reading a longer operand than was written. Surely it doesn't make sense to do that. Even the C standard, where threading is not a consideration, makes it clear that this is a non-portable practice.

Quoting - daradder
The P6 family processors guarantee that the following additional memory operations will always be carried out atomically:

. Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a 32-byte cache line.

Accesses to cacheable memory that are split across bus widths, cache lines and page boundaries are not guaranteed to be atomic by the Pentium 4, Intel Xeon, P6 family, Pentium and Intel486 processors. The Pentium 4, Intel Xeon and P6 family processors provide bus control signals that permit external memory subsystems to make split accesses atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be avoided.
[end of quote]

I interpret this as meaning that on Pentium processors with reputable chipset etc, all simple reads and writes up to 64-bits are atomic (user programs work on cached memory).

Dar

>>(user programs work on cached memory).

do you mean cacheable memory?

The manual uses the word cached memory, and if it means memory that is cached at the time of executing ld/st instruction, then it does not give much for user level programs, does it?

Quoting - _ace_

>>(user programs work on cached memory).

do you mean cacheable memory?

The manual uses the word cached memory, and if it means memory that is cached at the time of executing ld/st instruction, then it does not give much for user level programs, does it?

A new post after 6 years.

Leave a Comment

Please sign in to add a comment. Not a member? Join today