P4 L1/L2 data cache write-through questions

P4 L1/L2 data cache write-through questions

1) Does anyone know what exactly happens during aligned 128bit (16B SSE/SSE2) write to address that is already cached in L1-D?
2) What's more important, what are maximum REAL, attainable L1/L2 writing speeds?

Manuals tell about: two uops (address/store), write-combining buffer's (WCB) read for ownership (RFO), write to L1 and write-through to L2, L2 operation can start every two cycles. Two uops are right, as is L2 starting new operation every two cycles; however, when writing to 4 cache lines only (4 WCBs used in one place) the bandwidth (20GBps on 2.4GHz P4, 9B/c) suggests that writes CAN happen faster than every two cycles. When writing to more than 4 lines, bandwidth drops to average 4.5B/c (10GBps on 2.4GHz P4), and neither preloading, changing sequence of writes nor write-touching can raise that number.

3) What is the RFO's latency?
4) Does it block the L2 for all this time?

5) Is there any method to force WCB to wait for other data to be written and commit the result only then? (i.e. can we do one WCB 64byte write to L2 instead of four transactions changing only 16 bytes each?)

6) Why, if WCBs have already RFO, changing the sequence of writes lowers the throughput? (i.e. writes to 8 cache lines in sequence:
[eax], [eax+16], [eax+32], [eax+48], [eax+64], [eax+80], [eax+96], [eax+112], ... [eax+496]
are much faster than writes to
[eax], [eax+16], [eax+64], [eax+32], [eax+128], [eax+48], [eax+192], [eax+80], ... [eax+496]
?)

Any help welcome...

Regards,

Anna

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I understand that an answer to the above question can be not really easy (for various reasons, technology related or not). However, as this is very important characteristic for code performance, any, and I mean any, comments are welcome. If someone knowledgable would like to explain the problem, but not publicly, you can email me at this email.

Also, I understand that this forum is devoted to ICC, and not to the microarchitecture. However, described problem makes programming much harder also with ICC. If hosts - or anyone else - know about other Intel-microarchitecture-related forum, where this thread would be more appropriate, please let me know.

Regards, Anna

Login to leave a comment.