No perfomance boost on reduction using ArBB add_reduce : Looking for reasons?

No perfomance boost on reduction using ArBB add_reduce : Looking for reasons?


I was trying out some of the ArBB code samples. And thought of writing a simple application to use arbb add_reduce function. I'm running Windows 7, Intel Core 2 Duo CPU : E7500, 2.93 GHz.

Running my sample gives me a speedup of just 1.6x on 2 cores by setting ARBB_OPT_LEVEL=O3.

Here is the code :


#define LENGTH 40000000
#define NUM_RUNS 100

#define MICRO   (1000000)

using namespace std;
using namespace arbb;

  LARGE_INTEGER time_t, time_fre;
  return (ULONG64)time_t.QuadPart * MICRO / (ULONG64)time_fre.QuadPart;

float cpu_addReduce(float*a, int size) {
    float sum = 0;
    for(int i=0; i vec_a; bind(vec_a, a, LENGTH);
    f32 d;

//    arbb stuff    
    // Warm-up
    d = add_reduce(vec_a);

    startTime = readTime();
    for( int i=0; i < NUM_RUNS; i++) {
        d = add_reduce(vec_a);

    stopTime = readTime();
    arbbTime = (stopTime - startTime) / NUM_RUNS;
    printf("Value on ArBB reduction : %fn", value(d));
    printf("%10s %12s  %-sn", "Version", "Time(s)", "Speed Up");
    printf("%10s %12.6f  %-16.3fn", "C", (double)cTime / MICRO, (double)cTime / cTime);
    printf("%10s %12.6f  %-16.3fn", "ArBB", (double)arbbTime / MICRO, (double)cTime / arbbTime);


    return EXIT_SUCCESS;
4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I'ma newbiewith ArBB, but it seems to me this is aka memorybandwidthproblem.(the number of LOAD/STORE and FP instructions are about the same)--Gennady

Please take a look at my Knowledge Base article Three Things to Consider After Initial Speedups. That should shed some light. Your kernel is not large enough to overcome the JIT-compile overhead. Please give your kernel more work to do. ArBB is meant to be used for large kernels containing large amounts of data, notsmall, incrementalkernels. Item #1 in the article goes into more detail. And also please do not forget that the first run incurs the highest overhead and all subsequent runs incur little to none.

Furthermore, I see no ArBB functions created in this code. Please take a look at the User's Guide concerning the creation of ArBB functions. You need to use add_reduce inside of an ArBB function in order for it to be evaluated for multiple cores. If an ArBB function with the correct input/output signature is not used with an invocation of call(), it will be evaluated as if it is serial C++ code.

Leave a Comment

Please sign in to add a comment. Not a member? Join today