How to solve bugs of simultaneously misaligned memory accesses

How to solve bugs of simultaneously misaligned memory accesses

gangti's picture
  • First Qeustion

Recently, I encountered a rarely happened bug.

**Environment:

1. The address of a pointer(called pMemory) is mis-aligned.

2. Two thread simultaneously access pMmeory

3. Our program runs on a server with 8 CPUs

4. Original value of pMemory is 0xFFFF FFFF

**Operation Sequence:

1. One thread read the value of pMemory while the other thread modified pMemory.

the read/modify instructions both are MOV

2. The first thread firstly read the lower part of pMemory, that is 0xFFFF

3. The second thread modified pMemory from 0xFFFF FFFF to 0x02de 2c68

4. The first thread secondly read the higher part of pMemoyr, that is 0x02de,

and finally the first thread read the pMemory as 0x02de ffff which is a invalid pointer.

Currently we are discussing the way to solve the problem.

Do you have any suggestion?

I don't have too much time, so would you please rely as soon as possible.

BTW, our program is a network program, so the memory is designed to be aligned on one-byte with compiler options such as /Zp1.
It's impossible for us to change /Zp1 to natural alignment with aspect of risks and workload.

  • Second question

Intel 64 and IA-32 Architectures

Software Developers Manual

Volume 3A:

System Programming Guide, Part 1

8.1.1 Guaranteed Atomic Operations

The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M,Pentium 4, Intel Xeon,

and P6 family processors provide bus control signals that permit external memory

subsystems to make split accesses atomic;

however,nonaligned data accesses will seriously impact the performance of the

processor and should be avoided.

Would you please detail the way:

"provide bus control signals that permit external memory subsystems to make split accesses atomic"

23 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Dmitry Vyukov's picture

There are 2 ways to solve the problem. First is to make the data aligned. And the second way is to use LOCKed instructions (that's what "provide bus control signals that permit external memory subsystems to make split accesses atomic" about). I think that it's enough just to modify the data with LOCK XCHG instruction, and on reader side you can leave plain MOV. But keep in mind that on QPI-based systems (Core i7), LOCKed instructions on unaligned data can be *very* slow (order of 5000 cycles).

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
jimdempseyatthecove's picture

In your pointer fetch code, I will guess that you have a test for pointer == 0xFFFFFFFF as an indication that the pointe is not yet set.

You can change this test to test for either the high word or low word being equal to 0xFFFF
Any valid pointer you insert into this DWORD will likely not have 0xFFFF in either word. Allocations are generally aligned to 8 bytes (or more) so the low pointer should never be 0xFFFF. Also, 0xFFFF in the high address points to the last few pages of virtual memory (OS in Window or potentially -stack addressing in *ux). You should look at what you place into this pointer to assure my assumption.

Aligning the pointer to a DWORD address (32-bit system) or QWORD (64-bit system) as Dmitiy suggestswould assure that writes occur in one operation. (excepting possibly for address of pointer residing in I/O space)

Jim Dempsey

www.quickthreadprogramming.com
gangti's picture

I really appreciate your suggestions.

As your saying, I make a solution to take over the problem. Is it right or not?

;---------------------------------------

**On Writer's side

push ax

mov ax, XXXX ; XXXX means certain value that I want to write to the unaligned memory

lock xchg YYYY, ax ; write value in ax to memory, YYYY means the unaligned address of the memory

pop ax

**On Reader's side

mov ax, YYYY

;---------------------------------------

Still I have some questions:
1. "on reader side you can leave plain MOV...". In my case:
;---------------------------------------
a Write's side | Reader's side
b | read lower part
c lock the bus |
d write lower part|
e write higher part |
f atomaticly unlock the bus |
g | read higher part
;---------------------------------------

What will happen in step c?

Will step c wait until step g of reader's side finishes?

Or if step c success immediately, reader's side should go into a bug? I think so.

2. Though the prefix "LOCK" has not effects on "MOV" instruction, I think that "MOV" instruction need also lock the bus.

In my point of view, all the instructions that read/write the same unaligned memory should lock the bus, in order to make the instructions atomic.

Isn't that right or not?

Dmitry Vyukov's picture
Quoting gangti 2. Though the prefix "LOCK" has not effects on "MOV" instruction, I think that "MOV" instruction need also lock the bus.

In my point of view, all the instructions that read/write the same unaligned memory should lock the bus, in order to make the instructions atomic.

Isn't that right or not?

I think you are right, it was premature optimization on my side.
If correctness is the only concern, then use LOCK XCHG for both reader and writer. It should 100% work.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
gangti's picture

Dear Jim Dempsey
Thank you very much, that sounds a good idea. I will take it asa possible solution and try to find a best one.

jimdempseyatthecove's picture

LOCK will affect MOV when stored/retrieved data spans cache line and/or natural boundry.

consider:

; coded for 32-bit pointers where shared pointer may be unaligned
; shared pointer has 0xFFFFFFFF prior to store of new pointer
; address 0xFFxxxxxx is invalid (IOW 
; indicating invalid pointer
; only one producer and one consumer of pointer

mov edx, addressOfSharedPointer
mov eax, newPointer
mov [edx], ax  ; store lsw
rcr ax,16
mov [edx+2], al ; store 3rd byte
mov [edx+3], ah ; store 4th byte (overwrite 0xFF of 0xFFxxxxxx)

-----------------------------------------------------------

; read
mov edx, addressOfSharedPointer
loop:
mov eax, [edx]	; collect all 4 bytes
cmp eax, 0xFF000000
jae loop


===============================
or


; read
mov edx, addressOfSharedPointer
loop:
mov eax, [edx]	; collect all 4 bytes
cmp eax, 0xFF000000
jb toReturn
pause
jmp loop
toReturn:
ret


The LOCK, although functionally correct, is an expensive operation.
If this pointer manipulation is infrequent, then use the LOCK.
If the pointer manipulation is heavily used, then experiment with code similar to above.
You can also modify the write of pointer to test to see if the pointer is DWORD or WORD aligned
If on odd byte address, use the code first listed above,
If on DWORD, simply write the data as DWORD,
If on WORD (only thing left) write as WORDs (or as low WORD followed by DWORD)


Jim Dempsey



www.quickthreadprogramming.com
gangti's picture

I think you are right, it was premature optimization on my side.
If correctness is the only concern, then use LOCK XCHG for both reader and writer. It should 100% work.

Well, that is just what I want to ask you, HOW TO USE LOCK XCHG FOR READER?
thanks for your help

gangti's picture

Dear Jim Dempsey
Your code will definitely sove the problem, thank you very much. It really helps.

jimdempseyatthecove's picture

** error in code sample

rcr ax, 16

should read

ror eax, 16

I assume you caught this typing error.

Jim

www.quickthreadprogramming.com
jimdempseyatthecove's picture

Gangti,

For others following this thread, would you be so kind to run a performance test of your application using the LOCK method and the method outlined in my sketch. The readers may find your report useful in determining if they should go to a little extra effort in producing faster code.

Jim Dempsey

www.quickthreadprogramming.com
gangti's picture

If possible, I will post the performance test result on this thread.
:-( it's midnight now in China...

Dmitry Vyukov's picture
Quoting gangti Well, that is just what I want to ask you, HOW TO USE LOCK XCHG FOR READER?
thanks for your help

Ah, sorry. For reader you must use LOCK CMPXCHG. Try to change the variable from 0 to 0. In either case the variable is left physically unchanged, and you the get a current value.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture

Btw, can't it be so that the variable is 1- or 3-byte aligned (or perhaps mis-aligned), and not just 2-byte aligned? If so, and if you decide to use plain MOVs (as Jim suggested), you must to handle that cases too.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
gangti's picture

Dmitriy Vyukov
thanks, that really works

gangti's picture

Dmitriy Vyukov
To my case, the pointer is a member of a big struct one-byte aligned.
Currently, the address of the pointer is 2-byte aligned.
Someday when we add some other members before the pointer in the struct,
then the address of the pointer may be 1-byte aligned.

Well, the current codition may not be the worst.

grant-haab (Intel)'s picture

Then to be truly portable and not have nightmaresmaintaining your code, I strongly recommend to always use "lock cmpxchg"(for read) with "lock xchg" (for write) instead of using the 0xFFFF or 0xFF tricks. (BTW, I'm not sure the 0xFFFF trick will work with all memory allocation schemesanyway. Especially since members of structures are not naturally aligned in your application.)

- Grant

- Grant Haab OpenMP Language Committee Member TBB/OpenMP Software Engineer
Dmitry Vyukov's picture
Quoting Grant Haab (Intel)

Then to be truly portable and not have nightmaresmaintaining your code, I strongly recommend to always use "lock cmpxchg"(for read) with "lock xchg" (for write) instead of using the 0xFFFF or 0xFF tricks.

I would recommend to either stay single-threaded and do not bear all the complexities of concurrent software, or at least get some performance benefit from concurrent software. And plunge into concurrent software and then find yourself slower than single-threaded version looks quite strange. Load via LOCK CMPXCHG can be 10000 times slower than MOV.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
jimdempseyatthecove's picture

Grant,

The problem arises when the cache line splits the DWORD. This may occure at comma in

0xFFFFFF,FF (least significant byte in lower cache aligned address)
0xFFFF,FFFF (least significant 2 bytes in lower cache aligned address)
0xFF,FFFFFF (least significant3 bytes in lower cache aligned address)
0xFFFFFFFF (DWORD not split between cache lines)

The LOCK prefix, depending on processor model cost you 100x to 500x the overhead of a write for naturaly aligned variables (also conatined within 1 cache line).

If performance matters then consider safe alternatives that bypass LOCK.

Note, the triple write:

write word containing 2 lowest bytes
write byte containing byte 3 ofDWORD
write byte containing byte 4 of DWORD

May (depending on processor model), I said may occur, due to processor write combining, as a single write to memory when the DWORD is fully contained within the same cache line, and in 2 writes when split across cache lines. Ineither of the two circumstances (split or not split) the high byte will be written in the last write (which may be the only write). Without testing the code, my estimate is 50x to 500x the performance of the LOCK prefixed code.

Jim Dempsey

www.quickthreadprogramming.com
Dmitry Vyukov's picture
Quoting jimdempseyatthecove
The LOCK prefix, depending on processor model cost you 100x to 500x the overhead of a write for naturaly aligned variables (also conatined within 1 cache line).

Jim, the problem is not with writes, they are not scalable in either case.

The problem is with reads. A program can perform 1 read per 0.5 cycles per *thread* if implemented with MOV, or 1 read per 100-1000 cycles per *system* if implemented with CMPXCHG. The worst thing one may do in a concurrent program is to turn perfectly scalable read operation into completely non scalable write operation (rebuke rw mutexes).

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
jimdempseyatthecove's picture
Quoting Dmitriy Vyukov
Jim, the problem is not with writes, they are not scalable in either case.

Depending on the cache archetecture, the writes are somewhat scalable.

With the same 64-byte memory node (cach line sized memory node) resident in multiple caches of different cores, some cache designs will insert natural aligned 1, 2, 4, 8 byte objects into each cach core'sline without causing an eviction in the other cache(s). As long as another processor/core doesn't cause cache line eviction on your core your writes will scale.

I think it is time to stop the speculative discussion and write some code. We canrun this on various processor models and see what comes out.

Jim

www.quickthreadprogramming.com
grant-haab (Intel)'s picture

Jim,

We can all write microbenchmarks, but ultimately that won't help the author of this post one bit. What matter is the actual code being run. If the ratio of synchronized memory access to non-synchronized memory accesses is low enough, then locked instructions won't affect scalability that much. If the ratio is high, then tricks may need to be played to avoid the locked instructions to get scalability.

Each programmer has to make the portability/usability vs. performance tradeoff themselves given their particular program. That is the bottom line for the author who initiated this post. Without the results of the author's experiments onthier owncode, there is nothing we can tellthe authorthat will be guaranteed to be the right decision because we cannot evaluate the tradeoff without much more data.

- Grant

- Grant Haab OpenMP Language Committee Member TBB/OpenMP Software Engineer
jimdempseyatthecove's picture

>>If the ratio of synchronized memory access to non-synchronized memory accesses is low enough, then locked instructions won't affect scalability that much. If the ratio is high, then tricks may need to be played to avoid the locked instructions to get scalability.

That is what I said in my earlier post.

FWIW I ran a test on Q6600

Tests were run in both 32-bit and 64-bit with 32-bit and 64-bit pointers being written to respectively.
A character buffer was aligned to 4096 byte alignment and of size of 8192 + sizeof( void*).
Timings were made for storing void* at character index ranging from 0 to 8192-1.
For each of the 8192 offsets 100000 stores were performed and a rdtsc count was made for the loop.
This count was converted to double and divided by the number of stores (to get averate time in ticks to store)
5 test runs made, lowest value of each offset kept.
Then sum of all best 8192 samples were expressed

Q6600 32-bit pointers

Test1 total = 16400.1 (using simple store of pointer)

Test2 (w/LOCK) total = 374715 22.8484x

Test3 (multi-write) total = 28346.4 1.72844x

Q6600 64-bit pointers

Test1 total = 16927.5

Test2 (w/LOCK) total = 674978 39.8746x

Test3 (multi-write) total = 40752.6 2.40748x

The Test3 on 32-bit system performs

store of short (low 2 bytes)
store of char (3rd byte)
store of char (4th byte) - the char containing 0xFF to be overwritten last

The Test3 on 64-bit system

store of long (low 4 bytes)
store of short (bytes 5 and 6)
store of char (byte 7)
store of char (8th byte) - the char containing 0xFF to be overwritten last

32-bit shows 13.22x improvement using multi-store technique over LOCKed technique
64-bit shows 16.56x improvement using multi-store technique over LOCKed technique

Note, the test code was all C++ (no ASM) so this was completely portable.

Jim Dempsey

www.quickthreadprogramming.com

Login to leave a comment.