Throughout the memory hierarchy, data moves at cache line granularity - 64 bytes per line. Although this is much larger than many common data types, such as integer, float, or double, unaligned values of these or other types may span two cache lines. Recent Intel architectures have significantly improved the performance of such 'split loads' by introducing split registers to handle these cases, but split loads can still be problematic, especially if many split loads in a row consume all available split registers.
A significant proportion of cycles is spent handling split loads.
Consider aligning your data to the 64-byte cache line granularity. See the
Intel 64 and IA-32 Architectures Optimization Reference Manualfor more details.