I'm running ArBB v1.0.0.030 on my 64bit ubuntu 10.04 with this simple test:
dense maa=fill(3, ARRAY_SIZE);
where ARRAY_SIZE is in the order of 2^10 and PDS_ITER some 10^4.
In addition I execute the same operations using standard arrays (and no
parallelsim except vectorizing using -O3 and sse4.1 on my Intel
Core i7 CPU 860 @ 2.80GHz).
And finally, I execute the same oprations with my own (tbb::parallel_for + SIMD friendly) based array library (it has pretty much the
same goals as ArBB but is not as sophisticated..so I want to compare the
performance of my solution to ArBB).
I have exported:
Running the example given above, I don't get any verbose output from
ArBB and I can see that only 1 core is used and it takes about 10s to
On standard arrays, it takes about 3 s (also one core only).
And my own library needs about 2s utilizing 8 cores and SSE (as can be
seen from an assembly dump). The speedup is not as good as one would
hope for with 8 cores because the arithmetic to memory ratio is bad and
the FSB becomes the bottleneck (I think).
Anyways, so far I found no way to make ArBB actually use more than one
thread and I don't know if SSE is used but the performance is quite bad.
I guess I'm doing something wrong??