How to Achieve Peak Transfer Rate - C/C++

Compiler Methodology for Intel® MIC Architecture

How to Achieve Peak Transfer Rate - C/C++

Overview

This is a short and handy example to measure optimal data transfer rates. This example shows how to replace malloc() and free() with _mm_malloc() and _mm_free() to allocate and free data aligned on 4K boundaries, which is optimal for DMA transfers to the Intel® Xeon® Phi™ coprocessor. Actual data rates are not shown in this example. This sample merely shows techniques for efficient data transfer.

Topics

Data has to be allocated with 4K alignment for optimal DMA performance. DMA is used for efficient data movement over PCIe to the Intel® Xeon® Phi™ coprocessor.

Allocate the data on the coprocessor side using _mm_malloc() before the timing loop.  Use free_if(0) alloc_if(0) when you do the data transfer inside the loop.  One simple version of the code is given below.

How to run the code:

-bash-4.1$ icc -offload-build bwtest.c

-bash-4.1$ ./a.out -h -a <buffer alignment> -d <device ID> -n <number of iterations>

Usage:

-bash-4.1$ ./a.out

Bandwidth test. Buffer alignment: 4096. DeviceID: 0. Number of iterations: 20.

          Size(Bytes) Send(Bytes/sec) Receive(Bytes/sec)

          <your results will be shown here>

-bash-4.1$

Code:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <ia32intrin.h>

/* buffer alignment */
static int align = 4096;

/* device id */
static int device = 0;

/* number of interations in benchmarking loop */
static int niters = 20;

/* CPU buffer */
__declspec(target(mic))
static char* buf;

/* buffer sizes */
static const int bufsizes[] =
{
    4096,
    8192,
    16384,
    32768,
    65536,
    131072,
    262144,
    524288,
    1048576,
    2097152,
    4194304,
    8388608,
    16777216,
    33554432,
    67108864,
    134217728,
    268435456,
    536870912,
    0
};

static void parse_options(int argc, char** argv)
{
    int opt;

    while ((opt = getopt(argc, argv, "ha:d:n:")) != -1) {
        switch (opt) {
            case 'a':
                align = atoi(optarg);
                if (align <= 0 || align & (align-1) != 0) {
                    printf("Invalid alignment %d\n", align);
                    exit(1);
                }
                break;

            case 'd':
                device = atoi(optarg);
                if (device < 0) {
                    printf("Invalid device ID %d\n", device);
                    exit(1);
                }
                break;

            case 'n':
                niters = atoi(optarg);
                if (niters <= 0) {
                    printf("Invalid number of iterations %d\n", niters);
                    exit(1);
                }
                break;

            default:
                printf("Usage:\n\t%s -h -a <buffer alignment> -d <device ID> -n
<number of iterations>\n", argv[0]);
                exit(0);
        }
    }
}

static inline double get_cpu_time()
{
    struct timeval tv;
    if (gettimeofday(&tv, 0)) {
        printf("gettimeofday returned error\n");
        abort();
    }
    return tv.tv_sec + tv.tv_usec/1e6;
}

int main(int argc, char **argv)
{
    int     i, j;
    double  send;
    double  receive;

    parse_options(argc, argv);

    printf("Bandwidth test. Buffer alignment: %d. DeviceID: %d. Number of iterations: %d.\n\n",
           align, device, niters);

    printf("%20s %20s %20s\n",
            "Size(Bytes)", "Send(Bytes/sec)", "Receive(Bytes/sec)");

    for (i = 0; bufsizes[i] > 0; i++) {
        /* alloc CPU buffer */
        buf = (char*) _mm_malloc(bufsizes[i], align);
        if (buf == 0) {
            printf("Cannot not allocate buffer (%d bytes)\n", bufsizes[i]);
            abort();
        }

        /* alloc MIC buffer */
#pragma offload target(mic: device) \
                in(buf : length(bufsizes[i]) free_if(0))
        {}

        /* The main benchmarking loop */
        send = 0;
        receive = 0;

        for (j = 0; j < niters; j++) {
            double start;

            /* send */
            start = get_cpu_time();
#pragma offload target(mic: device) \
                in(buf : length(bufsizes[i]) alloc_if(0) free_if(0))
            {}
            send += get_cpu_time() - start;

            /* receive */
            start = get_cpu_time();
#pragma offload target(mic: device) \
                out(buf : length(bufsizes[i]) alloc_if(0) free_if(0))
            {}
            receive += get_cpu_time() - start;
        }

        send /= niters;
        receive /= niters;

        printf("%20d %20.2f %20.2f\n",
               bufsizes[i], bufsizes[i]/send, bufsizes[i]/receive);

        /* free MIC buffer */
#pragma offload target(mic: device) \
                out(buf : length(bufsizes[i]) alloc_if(0))
        {}

        /* free CPU buffer */
        _mm_free(buf);
    }

    return 0;
}

Take Aways

This article shows how to use _mm_malloc() and _mm_free() instead of malloc() and free() to get data buffers aligned on 4K boundaries. 4K boundaries are optimal for DMA transfers. This article also provides code to measure transfer rates for various buffer sizes. This can assist you in determining the optimal buffer sizes for your data.

NEXT STEPS

It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon Phi™ coprocessor. The paths provided in this guide reflect the steps necessary to get best possible application performance.

Back to Native and Offload Programming Models

Для получения подробной информации о возможностях оптимизации компилятора обратитесь к нашему Уведомлению об оптимизации.