1) Does anyone know what exactly happens during aligned 128bit (16B SSE/SSE2) write to address that is already cached in L1-D?
2) What's more important, what are maximum REAL, attainable L1/L2 writing speeds?
Manuals tell about: two uops (address/store), write-combining buffer's (WCB) read for ownership (RFO), write to L1 and write-through to L2, L2 operation can start every two cycles. Two uops are right, as is L2 starting new operation every two cycles; however, when writing to 4 cache lines only (4 WCBs used in one place) the bandwidth (20GBps on 2.4GHz P4, 9B/c) suggests that writes CAN happen faster than every two cycles. When writing to more than 4 lines, bandwidth drops to average 4.5B/c (10GBps on 2.4GHz P4), and neither preloading, changing sequence of writes nor write-touching can raise that number.
3) What is the RFO's latency?
4) Does it block the L2 for all this time?
5) Is there any method to force WCB to wait for other data to be written and commit the result only then? (i.e. can we do one WCB 64byte write to L2 instead of four transactions changing only 16 bytes each?)
6) Why, if WCBs have already RFO, changing the sequence of writes lowers the throughput? (i.e. writes to 8 cache lines in sequence:
[eax], [eax+16], [eax+32], [eax+48], [eax+64], [eax+80], [eax+96], [eax+112], ... [eax+496]
are much faster than writes to
[eax], [eax+16], [eax+64], [eax+32], [eax+128], [eax+48], [eax+192], [eax+80], ... [eax+496]
Any help welcome...