"However, I would strongly recommend against relying on this, as the smallest change could affect the behaviour so that it is no longer what you may expect."
You're preaching to the choir. :-)
"In particular, the compiler may choose to generate the write to pInstance before it invokes the constructor of the new Singleton object"
Even with optimisation turned off? Remember, I'm trying to crack through "if it's not broken, don't fix it", so I'll have to be convincing.
"If you are going to rely on compiler and platform specific behaviour, then you are better off using the compiler-specific atomic primitives such as the gcc __sync_xxx builtins. Such primitives will have a documented consequence on the compiler behaviour, and might therefore provide the requisite guarantees."
Seems like overkill: they're documented as "full barrier", which is too expensive (I only need release-store and load-acquire, which shouldn't take more than a single MOV of properly aligned data).