# Mode collapse in GANs

By Elena O., published on August 21, 2017

Hello everyone!

In my project I had to deal with the so called mode collapse. It is usually referred to a problem when all the generator outputs are identical (all of them or most of the samples are equal). But what can cause mode collapse and how to struggle with it?

Unfortunately, as I figured out, mode collapse can be triggered in a seemingly random fashion, making it very difficult to play around with Generative Adversarial Network (GAN) architectures.

In the real world, distributions are complicated and multimodal, for example, the probability distribution which describes data may have multiple “peaks” where different sub-groups of samples are concentrated. In such a case a generator can learn to yield images only from one of the sub-groups, causing mode collapse. This happened in my research and I was getting the same output for different input noises.

I’ve found a few ways how to tackle mode collapse:

#### 1. Directly encourage diversity

For this purpose it’s possible to use minibatch discrimination and feature mapping (see Improved Techniques for Training GANs). Minibatch discrimination gives the discriminator the power of comparing samples across a batch to help determine whether the batch is real or fake. Feature matching modifies the generator cost function to factor in the diversity of generated batches. It does this by matching statistics of discriminator features for fake batches to those of real batches.

#### 2. Unrolled GANs

Unrolled Generative Adversarial Networks. Unrolled GANs allow the generator to “unroll” updates of the discriminator in a fully differentiable way. Now instead of the generator learning to fool the current discriminator, it learns to maximally fool the discriminator after it has a chance to respond, thus taking counterplay into account.

Downsides of this approach are increased training time (each generator update has to simulate multiple discriminator updates) and a more complicated gradient calculation (backprop through an optimiser update step can be difficult).

#### 3. Multiple GANs

Try to use Boosting Generative Models, the mix of boosting and traditional neural network approaches. (see AdaGAN: Boosting Generative Models). But the implementation can be complicated as well training time will increase.

But all the methods described above require completely changing the model. To avoid this the following methods exist

#### 4. Wasserstein GANs

Really interesting ideas are described in the paper about WGANs (Wasserstein GAN).

We use Wasserstein distance: Let **Π( P_{}**

_{r}**be the set of all joint**

*)*_{}**,****P**_{ g}**γ**distributions whose marginal distributions are

*P*

**r****and**

**. Then**

**P****g**We use this distance as the loss function. It’s just general explanation, WGANs require a new post.

Unfortunately, computing the Wasserstein distance exactly is intractable. The paper shows how we can compute it approximately. As it proved **W** is equivalent to

In short, as it followed from the paper, the WGAN samples are more detailed and don’t have mode collapse as much as in a standard GAN. In fact, they report never running into mode collapse at all for WGANs!

I tried to use the last approach. Fortunately, I could find the W metric in neon, and it works! My generated samples are not identical.

I’ll appreciate any comments or questions. It’s really useful for understanding.

P.S. By the way, maybe someone knows how to interpret the W loss? I mean, for instance, can it be negative? It's not clear for me. What do you think?

Thanks Aiden Nibali for the article about mode collapse!

## 2 comments

TopPrajjwal said on Jun 11,2018

This may prove to be helpful for clearing your doubt:

https://software.intel.com/en-us/articles/better-generative-modelling-through-wasserstein-gans

Simonetto, Luca said on Mar 26,2018

Nice post, very informative!

Regarding your last question, the W loss can surely be negative. You can see it as the amount of "mass" you need to move from one distribution to the other to match them. WGAN uses the Wasserstein distance because it is informative in the quality of the generated samples, and the lower the better.

I'm currently working with WGANs for time series generation and I can say that mode collapse can be seen also when using this kind of model, so I will make sure to check your methods to see if I can improve diversification!

## Add a Comment

Sign inHave a technical question? Visit our forums. Have site or software product issues? Contact support.