I've some very simple question. I hope, this is really simple.
As I read and done already, bulk (coupled) streamin read/write should give some till significant speedup.
After some more profiling, I've found one very small older method im our software that takes to much time in my opinion. The most time is spent to the last instruction - wtite data. For the future question - there is no guarantee by design, that destination memory fits in some cache and, more, the cache is not overwritten so far - so there are really some access penalties.