A couple of comments:
First, can we assume the proper amount of memory was allocate for a, b, c? (N doubles each)
Second, you might check the dissassembly code to see if register pressure cause the pointers to one or more of aa, bb, cc to be refetched from memory as opposed to remaining cached.
Jim Dempsey
Yes, the pointers were allocated with
a = (double *) malloc (N*sizeof(double));
likewise for b and c.
I ran the benchmark for N=2000000. Would it be reasonable to assume that there is sufficient work in the loop that whether the pointer is cached would not significantly affect performance?