No perfomance boost on reduction using ArBB add_reduce : Looking for reasons?

No perfomance boost on reduction using ArBB add_reduce : Looking for reasons?


I was trying out some of the ArBB code samples. And thought of writing a simple application to use arbb add_reduce function. I'm running Windows 7, Intel Core 2 Duo CPU : E7500, 2.93 GHz.

Running my sample gives me a speedup of just 1.6x on 2 cores by setting ARBB_OPT_LEVEL=O3.

Here is the code :


#define LENGTH 40000000
#define NUM_RUNS 100

#define MICRO   (1000000)

using namespace std;
using namespace arbb;

  LARGE_INTEGER time_t, time_fre;
  return (ULONG64)time_t.QuadPart * MICRO / (ULONG64)time_fre.QuadPart;

float cpu_addReduce(float*a, int size) {
    float sum = 0;
    for(int i=0; i vec_a; bind(vec_a, a, LENGTH);
    f32 d;

//    arbb stuff    
    // Warm-up
    d = add_reduce(vec_a);

    startTime = readTime();
    for( int i=0; i < NUM_RUNS; i++) {
        d = add_reduce(vec_a);

    stopTime = readTime();
    arbbTime = (stopTime - startTime) / NUM_RUNS;
    printf("Value on ArBB reduction : %fn", value(d));
    printf("%10s %12s  %-sn", "Version", "Time(s)", "Speed Up");
    printf("%10s %12.6f  %-16.3fn", "C", (double)cTime / MICRO, (double)cTime / cTime);
    printf("%10s %12.6f  %-16.3fn", "ArBB", (double)arbbTime / MICRO, (double)cTime / arbbTime);


    return EXIT_SUCCESS;
publicaciones de 4 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

I'ma newbiewith ArBB, but it seems to me this is aka memorybandwidthproblem.(the number of LOAD/STORE and FP instructions are about the same)--Gennady

Please take a look at my Knowledge Base article Three Things to Consider After Initial Speedups. That should shed some light. Your kernel is not large enough to overcome the JIT-compile overhead. Please give your kernel more work to do. ArBB is meant to be used for large kernels containing large amounts of data, notsmall, incrementalkernels. Item #1 in the article goes into more detail. And also please do not forget that the first run incurs the highest overhead and all subsequent runs incur little to none.

Furthermore, I see no ArBB functions created in this code. Please take a look at the User's Guide concerning the creation of ArBB functions. You need to use add_reduce inside of an ArBB function in order for it to be evaluated for multiple cores. If an ArBB function with the correct input/output signature is not used with an invocation of call(), it will be evaluated as if it is serial C++ code.

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya