_mm_lddqu_si128 and _mm_loadu_si128

Hi, I would like to ask how much improvement I can get by replacing _mm_loadu_si128 by _mm_lddqu_si128 on a 64-bit machine. I wrote a simple program and tried to see the difference between these two load instructions but I could not see any improvement at all. According to my understanding, _mm_lddqu_si128 takes care of unaligned data loading better than _mm_loadu_si128. The following in my test code. Any comments or advice are appreciated!----------------------------------time1 = get_time(); srand(time(0)); for(i=0; i<999999; i++) { k = rand(); t1 = _mm_loadu_si128((__m128i*)(array+k)); // array is NOT 16-byte aligned //t1 = _mm_lddqu_si128((__m128i*)(array+k)); } time2 = get_time(); printf("Total Time = %8.4lfms\\n", (time2-time1)*1000);-----------------------------------Thanks,Ivan

there used to be a difference in pentium 4, in modern processors there is no difference

