P4 L1/L2 data cache write-through questions

P4 L1/L2 data cache write-through questions

1) Does anyone know what exactly happens during aligned 128bit (16B SSE/SSE2) write to address that is already cached in L1-D?
2) What's more important, what are maximum REAL, attainable L1/L2 writing speeds?

Manuals tell about: two uops (address/store), write-combining buffer's (WCB) read for ownership (RFO), write to L1 and write-through to L2, L2 operation can start every two cycles. Two uops are right, as is L2 starting new operation every two cycles; however, when writing to 4 cache lines only (4 WCBs used in one place) the bandwidth (20GBps on 2.4GHz P4, 9B/c) suggests that writes CAN happen faster than every two cycles. When writing to more than 4 lines, bandwidth drops to average 4.5B/c (10GBps on 2.4GHz P4), and neither preloading, changing sequence of writes nor write-touching can raise that number.

3) What is the RFO's latency?
4) Does it block the L2 for all this time?

5) Is there any method to force WCB to wait for other data to be written and commit the result only then? (i.e. can we do one WCB 64byte write to L2 instead of four transactions changing only 16 bytes each?)

6) Why, if WCBs have already RFO, changing the sequence of writes lowers the throughput? (i.e. writes to 8 cache lines in sequence:
[eax], [eax+16], [eax+32], [eax+48], [eax+64], [eax+80], [eax+96], [eax+112], ... [eax+496]
are much faster than writes to
[eax], [eax+16], [eax+64], [eax+32], [eax+128], [eax+48], [eax+192], [eax+80], ... [eax+496]

Any help welcome...



publicaciones de 2 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

I understand that an answer to the above question can be not really easy (for various reasons, technology related or not). However, as this is very important characteristic for code performance, any, and I mean any, comments are welcome. If someone knowledgable would like to explain the problem, but not publicly, you can email me at this email.

Also, I understand that this forum is devoted to ICC, and not to the microarchitecture. However, described problem makes programming much harder also with ICC. If hosts - or anyone else - know about other Intel-microarchitecture-related forum, where this thread would be more appropriate, please let me know.

Regards, Anna

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya