I am a newbie in Write Combine subject. I am measuring a burst IO write performance via mmap on 64 bit Linux and try to understand the WC issue on the IO memory write. I have several basic questions about WC use for this purpose.
The following is my example test setup for a burst write with Write Combine mode enabled:
1. The device driver set IO memory region using ioremap_wc (MTRR). This IO memory is the non prefetchable region. The PAT can be set with write combine or non cached flag.
2. The device driver provides mmap operation for the user space so that the user app can access IO memory, which is resided in the PCIe device, with _mm256_stream_si256.
3. The user program is keep writing 64/24 bytes data streams into IO memory and the data regions are well aligned in 64/24 byte boundary.
4. When the user app writes a 64 bytes burst data into IO memory, it will be called a non temporal instruction _mm256_stream_si256.
5. The CPU, i5 / i7, has 10 WC buffers that are 64 bytes size long per each WC buffer.
6. If possible, want to avoid the use of memory barriers that will degrade a performance of IO burst write operation.
Here are my basic questions regarding
1. Since _mm256_stream_si256 is a non temporal instruction, it cause a weak ordering. If this IO memory region is assigned as a write only, is this really cause a reordering on PCIe write transaction? When this can be happened? Can I have an example that cause a reordering in this scenario? Assume that the burst write IO memory address are incrementally changed in 64 bytes aligned value if 64 bytes burst write is called and in 24 bytes aligned value if 24 bytes burst write is called.
2. If two _mm256_stream_si256 instructions are used for 64 bytes burst write, what size of atomicity can it be guaranteed?
3. If memcpy or pointer operations is used instead of _mm256_stream_si256 function call, does this cause more chance to have a reordering and less chance of to have a write combining?
4. Is a memory barrier use a non avoidable in order to eliminate the out of order issue on PCIe transactions?
5. For PAT setup, does the IO burst write, on user app through mmap, give the same effect whether it is set a write combine or a non cached flag on?