I just double-checked my program and data. It seems that I copied my program's read phase timings to the wrong place in excel. The fact is: copy: 3.5 cycles/element, read: 1.1 cycles/element, write:3.4 cycles/element. This is explanable since some writing overhead can be hidden by write buffers.
It seems that I need to further look into why my reading phase gets 5 cycles per element but the copy program is OK.