ArBB Performance Problems

ArBB Performance Problems

I'm running ArBB v1.0.0.030 on my 64bit ubuntu 10.04 with this simple test:

dense maa=fill(3, ARRAY_SIZE);
/*
{
const_range r=maa.read_only_range();

for(int i=0;i<10;i++)
{
printf("maa[%d]=%d\\n",i,maa[i]);
}
for(int i=ARRAY_SIZE-10;i
{
printf("maa[%d]=%d\\n",i,maa[i]);
}
}
*/
dense maa2=maa+1;

dense maa3=maa2*maa2;

for(int p=0;p
{
maa3=maa3*maa3;
}

where ARRAY_SIZE is in the order of 2^10 and PDS_ITER some 10^4.

In addition I execute the same operations using standard arrays (and no
parallelsim except vectorizing using -O3 and sse4.1 on my Intel
Core i7 CPU 860 @ 2.80GHz).

And finally, I execute the same oprations with my own (tbb::parallel_for + SIMD friendly) based array library (it has pretty much the
same goals as ArBB but is not as sophisticated..so I want to compare the
performance of my solution to ArBB).

I have exported:

export ARBB_VERBOSE=1
export ARBB_NUM_CORES=8
export ARBB_OPT_LEVEL=O3

Running the example given above, I don't get any verbose output from
ArBB and I can see that only 1 core is used and it takes about 10s to
execute.

On standard arrays, it takes about 3 s (also one core only).

And my own library needs about 2s utilizing 8 cores and SSE (as can be
seen from an assembly dump). The speedup is not as good as one would
hope for with 8 cores because the arithmetic to memory ratio is bad and
the FSB becomes the bottleneck (I think).

Anyways, so far I found no way to make ArBB actually use more than one
thread and I don't know if SSE is used but the performance is quite bad.
I guess I'm doing something wrong??

Thx,
Alex

publicaciones de 9 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

By switching to the libarbb_dev (instead of libarbb) I now get the verbose output:

>> ArBB GC: the heap is allocated, reserve size = 8192 bytes !
>> ArBB GC: the heap is allocated, allocated size = 4096 bytes !
>> ArBB HLO: Disabled Opt:dsfusion block metacheck wrtcheck invDemote fusemap aggressive_memopt dump_cpp recursive_call accessor scalarize_map
>> ArBB GC: the heap is allocated, reserve size = 1073741824 bytes !
>> ArBB GC: the heap is allocated, allocated size = 134217728 bytes !
>> ArBB HLO: compilation time 546 us>> ArBB HLO: compilation time 445 us
>> ArBB CCG: proc(DPtask_0_2), start compilation...
>> ArBB CCG: DPtask_0_2 compilation time 1616 us
>> ArBB CCG: proc(eu_3_1), start compilation...
>> ArBB CCG: eu_3_1 compilation time 967 us
>> ArBB LLO: LLO compilation time for EU: 1284 us
>> ArBB LLO: The static count of memory allocations in current EU is: 2
>> ArBB LLO: The number of spawned tasks in current EU is: 1
>> ArBB HLO: compilation time 189 us
>> ArBB CCG: proc(eu_0_3), start compilation...
>> ArBB CCG: eu_0_3 compilation time 320 us
>> ArBB LLO: LLO compilation time for EU: 100 us
>> ArBB LLO: The static count of memory allocations in current EU is: 0
>> ArBB LLO: The number of spawned tasks in current EU is: 0
>> ArBB HLO: compilation time 261 us
>> ArBB CCG: proc(DPtask_0_5), start compilation...
>> ArBB CCG: DPtask_0_5 compilation time 1488 us
>> ArBB CCG: proc(eu_8_4), start compilation...
>> ArBB CCG: eu_8_4 compilation time 964 us
>> ArBB LLO: LLO compilation time for EU: 500 us
>> ArBB LLO: The static count of memory allocations in current EU is: 2
>> ArBB LLO: The number of spawned tasks in current EU is: 1
>> ArBB HLO: compilation time 163 us
>> ArBB CCG: proc(eu_7_6), start compilation...
>> ArBB CCG: eu_7_6 compilation time 378 us
>> ArBB LLO: LLO compilation time for EU: 83 us
>> ArBB LLO: The static count of memory allocations in current EU is: 0
>> ArBB LLO: The number of spawned tasks in current EU is: 0
>> ArBB GC: [GC count = 1, Freed Objs = 39, Freed Bytes = 117450496 bytes, Live Objs = 3, Live Bytes = 12583680 bytes, GC Time = 82 us, GC Total Time = 82 us]
>> ArBB GC: [GC count = 2, Freed Objs = 28, Freed Bytes = 117447680 bytes, Live Objs = 3, Live Bytes = 12583680 bytes, GC Time = 75 us, GC Total Time = 157 us]

//then the line "ArBB GC: [GC count = 2, Freed Objs = 28, Freed Bytes = 117447680 bytes,
Live Objs = 3, Live Bytes = 12583680 bytes, GC Time = 75 us, GC Total
Time = 157 us] " is repeated very often..aparently arbb is garbage collecting itself to death? :S

A lot of optimizations are "disabled":

>> ArBB HLO: Disabled Opt:dsfusion block metacheck wrtcheck
invDemote fusemap aggressive_memopt dump_cpp recursive_call accessor
scalarize_map

May that be the cause? Why are they disabled if I export ARBB_.._OPTIM=O3 ?

Alex

Ok..some more information I gathered:

apparently OPT=O3 is picked up correctly, and if I move the the loop of my example
into an application of map (instead of using the * operator on maa3 many times) and make it static (_for) then at last all my cores are used. It still seams that the simpler solution in my own library is about twice as fast...
(when applying a similar map).

Are the operators applied to dense arrays parallelized at all?

Does ArBB (attempt) to use OpenCL currently?

Thx,
Alex

Hi, have you had a chance to read our knowledge base article Things to Consider After Initial Speedups? Please read through that first - it should answer your questions. I'd rather not go into too much detail here because we spent so much time trying to cover all the bases in that article.

I'm afraid the article doesn't answer my question. In the specific use case I presented in my first post, the performance of ArBB is really bad (when compared to a solution that uses sandard c-arrays -O3 and sse4.1 optimizations with icc).
So there is actually slow down rather than speed up. It is clear, that by restructuring this specific computation better performance can be achieved. Still, my use case is not (completely) unrealistic and instances of it may occure in reality where no restructuring is possible (sadly, it is quite common that the ratio between claculation and memory access is not particularly good).

I don't see a specific reason why ArBB should perform that bad in this case, so either I'm abusing it somehow. If so, please tell me how. Or there is a performance problem in ArBB for the use case, which should be fixed because it would make the lib un-useable (at least for me and I guess for some other ppl as well).

Thx,
Alex

Is this performance on your first run or subsequent runs? All first runs will encounter a significant runtime compilation penalty, and once that code is caches, runs 2 through x should run quickly. Let me know if you are timing the first run or subsequent runs. The first run should never be timed due to that penalty, and as mentioned in the KB article I linked you to, ArBB shouldn't be used for applications that only need an ArBB function run once.

Looking at the verbose output I can see that only a small fraction of the time (<0.1s with overall runtime at 10s) is taken by the compilation but I have timed a second run anyways and the results are still as bad.

Alex

Are you encapsulating all of your ArBB computations within an ArBB function and then calling that ArBB function in the proper fashiondemonstrated by the examplesnot once, but at least twice from main()? If you do not use the calling convention or use map(), none of the ArBB code will be threaded or vectorized - it will run serial.

Ah, that was the problem.
I was not aware that something like dense maa3=maa2*maa2; must be placed inside a call(doit)() to get any optimization.

Thx,
Alex

Inicie sesión para dejar un comentario.