I am the author of an odd piece of software that (among other things) replaces the custom heap used internally by a particular 3rd party piece of 32 bit software with a memory manager of the users choice. I have a bunch of reports from users that crashes consistently result if tbbmalloc.dll is used while the large-address-aware flag is enabled on the executable. This does not happen when other heaps are used.
Unfortunately I can't really tell you anything useful about this possible bug, as the code where the crash occurs is in the 3rd party application which is closed source. I can't even guarantee that there is a bug, as the 3rd party application in question has its own (unknown) bugs that interact with the memory management algorithm.
Still, it might be worth checking whatever tests you normally use on 32 bit TBBMM to see if they work w/ LAA when compiled for 32 bit targets, on machines that either have /3GB enabled or are running a 64 bit OS.
edit: Nevermind, it appears to be specific to that particular piece of 3rd party software, and thus likely an unusually deterministic bug in it.
edit2: ignore the above nevermind, it was incorrect
side note: I recently benchmarked a few heaps, the results might interest TBBmalloc developers:
Those are average times in nanoseconds per malloc-or-free for all threads combined, so smaller numbers are better and 2-thread results should ideally be half of the comparable 1-thread result. These were measured on a Core 2 Duo. TCM refers to tcmalloc, TH3 refers to a heap I wrote (which, so far, lacks critical features like basic anti-fragmentation measures). All benchmarks listed there concentrate heavily on small allocation
sizes. The "chained" benchmark is basically a measure of memory manager
impact on thread creation/destruction times - it's roughly the same thing as the isolated benchmark, but from threads that are being continually created and destroyed. The "oblivion" benchmark is times measured using RDTSC from inside of the 3rd party software in question (Oblivion, by Bethesda Softworks).
XP - slow, with a horrible corner-case
FastMM4 - fast, but poor scalability, and with a horrible corner-case
TBBMM - solid performance, faster than FastMM4 here
TCM - solid performance, a little faster than TBBMM, though a touch slow on thread creation/destruction
TH3 - very fast (though missing important functionality atm)
Also I've heard some details that I don't have actual numbers for (or not in comparable units anyway):
Reportedly on a Pentium4 (with hyperthreading enabled), on the "Oblivion" benchmark, FastMM4 outperforms TBBMM by 25% and a predecessor of TH3s by 70%. Nothing known about other benchmarks on that hardware.
Reportedly on Core2s Win7 performance in all benchmarks is similar to WinXP, but the corner-case on the "Oblivion" benchmark was fixed.