How to Implement Efficient MADD Operations on 64-Bit Intel® Architecture
Challenge
Implement a multiply and add operation efficiently on 64-bit Intel® architecture. Integer matrix multiplication is a common procedure that is used generically to explore options for optimizing nested loops and specifically for computations with multimedia algorithms, such as JPEG image decoding, all the way to uses as diverse as oceanographic analysis. The essence of this algorithm is the assignment statement:
c[i][j] = c[i,j] + a[i][k] * b[k][j];
This innocuous looking line of code, which is typically known as "madd" (for multiply and add), is one of the more difficult processing tasks that a CPU must perform. Such computations break down into a series of multiply-accumulate operations that cannot efficiently be built out of simple, separate functions. The problem here is that multiplication takes several clock cycles and addition only one. Therefore, a critical scheduling problem exists. If the addition is done too early, the CPU stalls. If it is done too late, there is wasted time, because the multiplication units are not kept busy.
The following algorithm performs the madd algorithm, but not as efficiently as possible:
/***********************************/
#include /* matmul_cont.c */
#include
#include
/***********************************/
int
main (int argc, char *argv[])
{
int i, j, k;
int N; // command line arg for no. of iterations
int *a, *b, *c; // pointers to matrices a, b, and c
int *am, *bm, *cm; // row pointers for matrices a, b, and c
clock_t start; // from 'time.h'
if (argc != 2)
{
printf ("Usage: %s
", argv[0]);
exit (-1);
}
N = atoi (argv[1]); // convert command line argument to int
a = malloc (N * sizeof (int *)); // allocate row pointers
am = malloc (N * N * sizeof (int)); // allocate NxN matrix of int's
for (i = 0; i < N; i++) // initialize row pointers
a[i] = &am[i * N];
b = malloc (N * sizeof (int *)); // (same for b and c)
bm = malloc (N * N * sizeof (int));
for (i = 0; i < N; i++)
b[i] = &bm[i * N];
c = malloc (N * sizeof (int *));
cm = malloc (N * N * sizeof (int));
for (i = 0; i < N; i++)
c[i] = &cm[i * N];
for (i = 0; i < N; i++) // initialize a and b to random int's
for (j = 0; j < N; j++) // mod 32768
{
a[i][j] = rand () % 32768;
b[i][j] = rand () % 32768;
}
start = clock (); // start the clock
/* begin matrix multiply */
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
{
c[i][j] = 0;
for (k = 0; k < N; k++) // inner loop to compute dot product
c[i][j] += a[i][k] * b[k][j];
}
/* end matrix multiply */
/* note that there is no unit stride in inner loop, so cache misses account
for substantial overhead */
start = clock () - start; // stop clock and calc. elapsed clocks
printf ("N = %d Time = %f
", N, (float) start / CLOCKS_PER_SEC);
free (a);
free (am);
free (b);
free (bm);
free (c);
free (cm);
return (0);
}
Solution
Further tune the inner loop by making it "unit stride." What this means is that for multiple-dimensioned arrays, access to each element is fastest if it can be done on the smallest possible step size (i.e., unit stride). The outer loop will automatically be unit stride, but the inner loop will not, because its stride is N, which is always assumed greater than one. The comments in the matmul_stride.c program explain how the unit stride was obtained.
/**********************************/
#include /* matmul_stride.c */
#include
#include
/**********************************/
int main (int argc, char *argv[])
{
int i, j, k;
int N; // command line arg for no. of iterations
int *a, *b, *c; // command line arg - no. of iterations
int *am, *bm, *cm; // pointers to matrices a, b, and c
int stride_scale; // stride_scale provides unit stride in inner loop
clock_t start;
if (argc != 2)
{
printf ("Usage: %s
", argv[0]);
exit (-1);
}
N = atoi (argv[1]); // convert command line argument to int
a = malloc (N * sizeof (int *)); // allocate row pointers
am = malloc (N * N * sizeof (int)); // allocate NxN matrix of int's
for (i = 0; i < N; i++) // initialize row pointers
a[i] = &am[i * N];
b = malloc (N * sizeof (int *)); // (same for b and c)
bm = malloc (N * N * sizeof (int));
for (i = 0; i < N; i++)
b[i] = &bm[i * N];
c = malloc (N * sizeof (int *));
cm = malloc (N * N * sizeof (int));
for (i = 0; i < N; i++)
c[i] = &cm[i * N];
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
{
a[i][j] = rand () % 32768;
b[i][j] = rand () % 32768;
}
start = clock (); // start the clock
/* begin matrix multiply */
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
c[i][j] = 0;
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
{
stride_scale = b[j][i];
for (k = 0; k < N; k++) // inner loop to compute dot product
c[j][k] += a[j][k] * stride_scale; // but now we have unit stride
} // in inner loop,
/* end matrix multiply */
start = clock () - start;;
printf ("N = %d Time = %f
", N, (float) start / CLOCKS_PER_SEC);
free (a);
free (am);
free (b);
free (bm);
free (c);
free (cm);
return (0);
}
/**********************************/
The matmul_stride.c program demonstrates how the inner loop can be coerced into being unit stride (under the assumption that the matrix, b is already transposed). This means that once N is longer than the cache line, the performance degradation that existed in matmul_cont.c prior to this tuning is eliminated. (It is worth noting that in C, accessing matrix 'b' is by column and C uses row-major order.)
Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.