How to Implement Efficient MADD Operations on 64-Bit Intel® Architecture

Теги:

Challenge

Implement a multiply and add operation efficiently on 64-bit Intel® architecture. Integer matrix multiplication is a common procedure that is used generically to explore options for optimizing nested loops and specifically for computations with multimedia algorithms, such as JPEG image decoding, all the way to uses as diverse as oceanographic analysis. The essence of this algorithm is the assignment statement:

c[i][j] = c[i,j] + a[i][k] * b[k][j];

 

This innocuous looking line of code, which is typically known as "madd" (for multiply and add), is one of the more difficult processing tasks that a CPU must perform. Such computations break down into a series of multiply-accumulate operations that cannot efficiently be built out of simple, separate functions. The problem here is that multiplication takes several clock cycles and addition only one. Therefore, a critical scheduling problem exists. If the addition is done too early, the CPU stalls. If it is done too late, there is wasted time, because the multiplication units are not kept busy.

The following algorithm performs the madd algorithm, but not as efficiently as possible:

/***********************************/

#include       /* matmul_cont.c */

#include 

#include 


/***********************************/


int

main (int argc, char *argv[])

{

int i, j, k;

int N;               // command line arg for no. of iterations

int *a, *b, *c;      // pointers to matrices a, b, and c

int *am, *bm, *cm;   // row pointers for matrices a, b, and c


clock_t start;       // from 'time.h'


if (argc != 2)

{

printf ("Usage: %s 


", argv[0]);

exit (-1);

}


N = atoi (argv[1]);  // convert command line argument to int


a = malloc (N * sizeof (int *));    // allocate row pointers

am = malloc (N * N * sizeof (int)); // allocate NxN matrix of int's

for (i = 0; i < N; i++)             // initialize row pointers

a[i] = &am[i * N];


b = malloc (N * sizeof (int *));    // (same for b and c)

bm = malloc (N * N * sizeof (int));

for (i = 0; i < N; i++)

b[i] = &bm[i * N];


c = malloc (N * sizeof (int *));

cm = malloc (N * N * sizeof (int));

for (i = 0; i < N; i++)

c[i] = &cm[i * N];


for (i = 0; i < N; i++)      // initialize a and b to random int's

for (j = 0; j < N; j++)  // mod 32768

{

a[i][j] = rand () % 32768;

b[i][j] = rand () % 32768;

}


start = clock ();            // start the clock


/* begin matrix multiply */

for (i = 0; i < N; i++)

for (j = 0; j < N; j++)

{

c[i][j] = 0;

for (k = 0; k < N; k++)   // inner loop to compute dot product

c[i][j] += a[i][k] * b[k][j];

}

/* end matrix multiply */


/* note that there is no unit stride in inner loop, so cache misses account

for substantial overhead */


start = clock () - start; // stop clock and calc. elapsed clocks


printf ("N = %d		Time = %f

", N, (float) start / CLOCKS_PER_SEC);


free (a);

free (am);

free (b);

free (bm);

free (c);

free (cm);


return (0);

}

 


Solution

Further tune the inner loop by making it "unit stride." What this means is that for multiple-dimensioned arrays, access to each element is fastest if it can be done on the smallest possible step size (i.e., unit stride). The outer loop will automatically be unit stride, but the inner loop will not, because its stride is N, which is always assumed greater than one. The comments in the matmul_stride.c program explain how the unit stride was obtained.

/**********************************/

#include       /* matmul_stride.c */

#include 

#include 

/**********************************/


int main (int argc, char *argv[])

{

int i, j, k;

int N;         // command line arg for no. of iterations

int *a, *b, *c;   // command line arg - no. of iterations

int *am, *bm, *cm;   // pointers to matrices a, b, and c


int stride_scale;   // stride_scale provides unit stride in inner loop


clock_t start;


if (argc != 2)

{

printf ("Usage: %s 


", argv[0]);

exit (-1);

}


N = atoi (argv[1]);      // convert command line argument to int


a = malloc (N * sizeof (int *));    // allocate row pointers

am = malloc (N * N * sizeof (int)); // allocate NxN matrix of int's

for (i = 0; i < N; i++)       // initialize row pointers

a[i] = &am[i * N];


b = malloc (N * sizeof (int *));    // (same for b and c)

bm = malloc (N * N * sizeof (int));

for (i = 0; i < N; i++)

b[i] = &bm[i * N];


c = malloc (N * sizeof (int *));

cm = malloc (N * N * sizeof (int));

for (i = 0; i < N; i++)

c[i] = &cm[i * N];


for (i = 0; i < N; i++)

for (j = 0; j < N; j++)

{

a[i][j] = rand () % 32768;

b[i][j] = rand () % 32768;

}


start = clock ();         // start the clock


/* begin matrix multiply */

for (i = 0; i < N; i++)

for (j = 0; j < N; j++)

c[i][j] = 0;


for (i = 0; i < N; i++)

for (j = 0; j < N; j++)

{

stride_scale = b[j][i];

for (k = 0; k < N; k++)   // inner loop to compute dot product

c[j][k] += a[j][k] * stride_scale;   // but now we have unit stride

}                      // in inner loop,

/* end matrix multiply */


start = clock () - start;;


printf ("N = %d		Time = %f

", N, (float) start / CLOCKS_PER_SEC);


free (a);

free (am);

free (b);

free (bm);

free (c);

free (cm);


return (0);

}

/**********************************/

 

The matmul_stride.c program demonstrates how the inner loop can be coerced into being unit stride (under the assumption that the matrix, b is already transposed). This means that once N is longer than the cache line, the performance degradation that existed in matmul_cont.c prior to this tuning is eliminated. (It is worth noting that in C, accessing matrix 'b' is by column and C uses row-major order.)


Source

Tests of Efficient Implementation of Madd Algorithms on an Itanium®-Based System

 


Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.