I see 2 instruction for virtually performing the same operations - vmovups and vmovapd as per the intel intrinsics guide (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=3,3...) except with respect to the expectation of memory alignment.
However, am very interested in understanding the performance implications of the using one of above vs the other?
The intel developers guide doesn't give us much information about this phenomenon (https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-...)
Basically it only states
"Software may see performance penalties when unaligned accesses cross cacheline boundaries, so reasonable attempts to align commonly used data sets should continue to be pursued."
Is there some resource someone could point to which has some significant information particularly on this topic?