The following research uses Intel® AI DevCloud, a cloud-hosted hardware and software platform available for developers, researchers and startups to learn, sandbox and get started on their Artificial Intelligence projects. This free cloud compute is available for Intel® AI Developer Program members.
The year 2017 was a period of scientific breakthroughs in deep learning, with the publication of numerous research papers. Every year seems like a big leap toward artificial general intelligence, or AGI.
One exciting development involves generative modelling and the use of Wasserstein GANs (Generative Adversarial Networks). An influential paper on the topic has completely changed the approach to generative modelling, moving beyond the time when Ian Goodfellow published the original GAN paper.
Why Wasserstein GANs are such a big deal:
This paper differs from earlier work: the training algorithm is backed up by theory, and few examples exist where theory-justified papers gave good empirical results. The big thing about WGANs is that developers can train their discriminator to convergence, which was not possible earlier. Doing this eliminates the need to balance generator updates with discriminator updates.
When dealing with discrete probability distributions, the Wasserstein Distance is also known as Earth mover's distance (EMD). Imagining different heaps of earth in varying quantities, EMD would be the minimal total amount of work it takes to transform one heap into another. Here, work is defined as the product of the amount of earth being moved and the distance it covers. Two discrete probability distributions are usually defined as Pr and P(theta).
Pr comes from unknown distribution, and the goal is to learn P(theta) that approximates Pr.
Calculation of EMD is an optimization process with infinite solution approaches; the challenge is to find the optimal one.
One approach would be to directly learn probability density function P(theta). This would mean that P(theta) is some differentiable function that can be optimized by maximum likelihood estimation. To do that, minimize the KL (Kullback–Leibler) divergence KL(Pr||(P(theta)) and add a random noise to P(theta) when training the model for maximum likelihood estimation. This ensures that distribution is defined elsewhere; otherwise, if a single point lies outside P(theta), the KL divergence can explode.
Adversarial training makes it hard to see whether models are training. It has been shown that GANs are related to actor-critic methods in reinforcement learning. Learn More.
We drop −H(p) going from (18) − (19) because it is a constant. We can see if we minimize the LHS (Left-hand side), we are maximizing the expectation of log q(x) over the distribution p. Therefore, minimizing the LHS is maximizing the RHS, which is maximizing the log-likelihood of the data.
DKL achieves the minimum zero when p(x) == q(x) everywhere.
It is noticeable from the formula that KL divergence is asymmetric. In cases where P(x) is close to zero, but Q(x) is significantly non-zero, the q’s effect is disregarded. It could cause buggy results when the intention was just to measure the similarity between two equally important distributions.
Given two Gaussian distributions, P with mean=0 and std=1 and Q with mean=1 and std=1. The average of two distributions is labelled as m=(p+q)/2. KL divergence DKL is asymmetric but JS divergence DJS is symmetric.
GAN consists of two models:
It is almost impossible to exhaust all the joint distributions in Π(pr,pg) to compute infγ∼Π(pr,pg). Instead, the authors proposed a smart transformation of the formula based on the Kantorovich-Rubinstein duality:
One big problem involves maintaining the K-Lipschitz continuity of fw during the training to make everything work out. The paper presented a simple but very practical noteworthy trick: after the gradient gets updated, clamping the weights w to a small window is required, such as [−0.01,0.01], resulting in a compact parameter space W; and thus, fw obtains it's lower and upper bounds in order to preserve the Lipschitz continuity.
Compared to the original GAN algorithm, the WGAN undertakes the following changes:
Empirically the authors recommended usage of RMSProp optimizer on the critic, rather than a momentum-based optimizer such as Adam which could cause instability in the model training.
The following suggestions are proposed to help stabilize and improve the training of GANs.
Wasserstein metric is proposed to replace JS divergence because it has a much smoother value space.
In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. As compared to supervised learning, ConvNets have received little attention. Deep convolutional generative adversarial networks (DCGANs) have certain architectural constraints and demonstrate a strong potential for unsupervised learning. Training on various image datasets show convincing evidence that a deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, the learned features were used for novel tasks - demonstrating their applicability as general image representations.
Replacing JS divergence with the Wasserstein metric gives a much smoother value space.
Training a Generative Adversarial Network faces a major problem:
GANs faced the problem of good objective function that can give better insight of the whole training process. A good evaluation metric was needed. Wasserstein Distance sought to address this problem.
GANs are difficult to train since convergence is an issue. Using Intel® AI DevCloud and implementing with TensorFlow* served to hasten the process. The first step was to determine the evaluation metric, followed by getting the generator and discriminator to work as required. Other steps included defining the Wasserstein Distance and making use of Residual Blocks in the generator and discriminator.
Initially the project dealt with GANs, which are powerful models but suffer from training stability. Switching to DCGANs and making this project with WGANs sought to make progress towards stable training of GANs. Images generated with DCGANs were not good quality and failed to converge during training. In the initial WGANs paper instead of optimizing Jenson Shannon Divergence, they proposed using Wasserstein metric (measure of the distance between two probability distributions).
The reason why Wasserstein distance is better than JS or KL divergence is that when two distributions are located in lower dimensional manifolds without overlaps, Wasserstein distance can still provide a meaningful and smooth representation of the distance in-between.
Additionally, there is almost no hyperparameter tuning.
The project made use of Jupyter notebook on the Intel® AI DevCloud (using Intel® Xeon® Scalable processors) to write the code and for visualization purposes. Also made used was information from the Intel® AI Developer Program forum.
These are some very few applications of GANs (just to provide some ideas) but they can be extended to do so much than what we can possibly think of. There are many papers which have made use of different architectures of GANs, some are listed below:
The code can be found in this Github* repository.
Initially the paper (Soumith at al.) demonstrated the real difference between GAN and WGAN. A GAN Discriminator and Wasserstein GAN critic are trained optimality. In the following graph blue depicts real Gaussian distribution and green depicts fake ones then the values are plotted. The red curve depicts the GAN discriminator output.
Both GAN and WGAN will identify which distribution is fake and which ones are real, but GAN Discriminator does this in such a way that gradients vanish over this high dimensional space. WGANs make use of weight clamping which gives them an edge and it which is able to give gradients in almost every point in space. Wasserstein loss seems to correlate well with image quality also.
Signup for the Intel AI Developer Program and access essential learning materials, community, tools and technology to boost your AI development.
Apply to become an AI Student Ambassador and share your expertise with other student data scientists and developer.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804