Do scalable allocators work like a pool...how deep?

Do scalable allocators work like a pool...how deep?

Is scalable_realloc a pool or does it only really grab the small chunk needed?

My applications scales poorly, though each thread is completely independent and work/overhead ratiois significant. It happens to allocate 250K tiny vectors (about 5 pointers in each) in a little under 2 seconds. 16 threads are doing this simultaneously. My test applicationthat performssimilar "work" but does arithmetic in place of all these allocations scales perfectly.

Does scalable_realloc etc grab significant size chunks and dole them out as needed (ala a pool) or all these little allocations between threadspossibly competing via the OS? I am tempted to make my own pool but I loose lots of other tbb benefits.I do not see an appreciable change when switching from standard allocators to the scalable ones. Does that mean my scaling issues are elsewhere?

Thread profiler seems to think all threads are nearly 100 utilized. Could it perhapsbe mis interprets waiting on memory as "work.

Thanks,

Brian Rundle

4 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.
brundle:

Thread profiler seems to think all threads are nearly 100 utilized. Could it perhapsbe mis interprets waiting on memory as "work.

Yes, it could. Accesses to main memory, cache-line transfers between cores (i.e. sharing or false sharing), pipeline stalls etc, all are considered as CPU useful work. So 100% CPU utilization doesn't mean 100% efficiency. Cache-line transfer can take up to 300 cycles, so in the limit efficiency can be only 0.3%.

As for your main question, I believe it must work like a pool, but I don't know exactly for now.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Allocations up to 8-something kB (currently) work like a pool, bigger ones go straight to malloc() by default.

One thing totry would be to take a flat VTune profile and see where it says the time is being spent.

Connectez-vous pour laisser un commentaire.