I used as this way:
__declspec(align(16)) int diff;
but when I use these data as this way:
__m128i *d = (__m128i*) diff;
dl0 = _mm_load_si128(d);
dl3 = _mm_load_si128(d+3);
the program crashed. it can only use the function _mm_loadu_si128, butits performace is rather slowly than the function _mm_load_si128.
the result is: when I overwrite the arithmetic, It not easy to see the sse2's strong suit.