In my project I had to deal with the so called mode collapse. It is usually referred to a problem when all the generator outputs are identical (all of them or most of the samples are equal). But what can cause mode collapse and how to struggle with it?
Unfortunately, as I figured out, mode collapse can be triggered in a seemingly random fashion, making it very difficult to play around with Generative Adversarial Network (GAN) architectures.
In the real world, distributions are complicated and multimodal, for example, the probability distribution which describes data may have multiple “peaks” where different sub-groups of samples are concentrated. In such a case a generator can learn to yield images only from one of the sub-groups, causing mode collapse. This happened in my research and I was getting the same output for different input noises.
I’ve found a few ways how to tackle mode collapse:
1. Directly encourage diversity
For this purpose it’s possible to use minibatch discrimination and feature mapping (see Improved Techniques for Training GANs). Minibatch discrimination gives the discriminator the power of comparing samples across a batch to help determine whether the batch is real or fake. Feature matching modifies the generator cost function to factor in the diversity of generated batches. It does this by matching statistics of discriminator features for fake batches to those of real batches.
2. Unrolled GANs
Unrolled Generative Adversarial Networks. Unrolled GANs allow the generator to “unroll” updates of the discriminator in a fully differentiable way. Now instead of the generator learning to fool the current discriminator, it learns to maximally fool the discriminator after it has a chance to respond, thus taking counterplay into account.
Downsides of this approach are increased training time (each generator update has to simulate multiple discriminator updates) and a more complicated gradient calculation (backprop through an optimiser update step can be difficult).
3. Multiple GANs
Try to use Boosting Generative Models, the mix of boosting and traditional neural network approaches. (see AdaGAN: Boosting Generative Models). But the implementation can be complicated as well training time will increase.
But all the methods described above require completely changing the model. To avoid this the following methods exist
4. Wasserstein GANs
Really interesting ideas are described in the paper about WGANs (Wasserstein GAN).
We use Wasserstein distance: Let Π(Pr,P g)be the set of all joint γ distributions whose marginal distributions are P r and P g . Then
We use this distance as the loss function. It’s just general explanation, WGANs require a new post.
Unfortunately, computing the Wasserstein distance exactly is intractable. The paper shows how we can compute it approximately. As it proved W is equivalent to
In short, as it followed from the paper, the WGAN samples are more detailed and don’t have mode collapse as much as in a standard GAN. In fact, they report never running into mode collapse at all for WGANs!
I tried to use the last approach. Fortunately, I could find the W metric in neon, and it works! My generated samples are not identical.
I’ll appreciate any comments or questions. It’s really useful for understanding.
P.S. By the way, maybe someone knows how to interpret the W loss? I mean, for instance, can it be negative? It's not clear for me. What do you think?
Thanks Aiden Nibali for the article about mode collapse!