In this article, we will see some scope for optimization in Cycle-GAN for unpaired image-to-image translation, and come up with a new architecture. Also, we will dive deeper into using Intel® AI DevCloud for further speeding up the training process by using the power of multiple clusters.
Image-to-image translation involves transferring the characteristics of an image from one domain to another. For learning such mapping, the training dataset of images can be either paired or unpaired. Paired images imply that each example is in the form of a pair, having an image from both source and target domain; the dataset is said to be unpaired when there is no one-to-one correspondence between training images from input domain X and target domain Y.
Figure 1. Paired versus unpaired Image dataset. The paired image dataset contains examples such that for every ith example, there is an image pair xi and yi. Here, xi and yi are a sketch and its corresponding actual photograph, respectively. The unpaired image dataset contains a set of images separately for actual photographs (X) and paintings (Y). Source:Cycle-GAN Paper
Previous works such as pix2pix* have offered image-to-image translation using paired training data; for example, converting a photograph from daytime to nighttime. For this we obtained paired data by getting pictures of the same location in the daytime as well as at night.
Figure 2. Some applications of pix2pix* trained on a paired image dataset; that is, when it is possible to obtain images of the same subject under different domains. Source: pix2pix paper
However, obtaining paired training data can be difficult and sometimes impossible. For example, for converting a horse into a zebra from an image, it is impossible to obtain a pair of images of horse and zebra in exactly the same location and in the same posture. This is where unpaired image-to-image translation is desired. Still, it is challenging to convert the image from one domain to another when there are no paired examples available. For example, such a system would have to convert the part of the image, where the horse is detected, but not alter its background, so that one-to-one correspondence exists between the source image and the target image.
Cycle-GAN provides an effective technique for learning mappings from unpaired image data. Some of the applications of using Cycle-GAN are shown below:
Figure 3. Applications of Cycle-GAN. This technique uses an unpaired dataset for training and is still able to effectively learn to translate images from one domain to another. Source: Cycle-GAN Paper
Cycle-GAN has applications in domains where a paired image dataset is not available. Even when a paired image can be obtained, it is easier to collect from both domains separately than by selectively obtaining paired images. Also, a dataset can be built much larger and faster in the case of unpaired images. Cycle-GAN is further discussed in the next section.
Figure 4. Generative adversarial networks for image generation—the generator draws samples from latent random variables and the discriminator tells whether the sample came from the generator or the real world. Source: Deep Learning for Computer Vision: Generative models and adversarial training (UPC 2016)
The generative adversarial network not only involves a neural network for generating content (generator), but also a neural network for determining whether the content is real or fake. It is called the adversarial (discriminator) network. The training of both generator and discriminator is performed simultaneously, such that both optimize against each other in a two-player zero-sum game setting, until both networks lead to an equilibrium point (Nash equilibrium) of such game.
Through the combination of the generator network and the discriminator network (adversarial) there emerged tremendous possibilities for many more creative tasks from a computer than ever possible by another method. Facebook*'s AI research director Yann LeCun referred the adversarial training of GANs as "the most interesting idea in the last 10 years in ML." However, despite the plethora of possibilities in creativity in AI through GANs, one of the weaknesses of early GANs was limited stability for training the model.
The Cycle-GAN architecture was proposed in the paper, Unpaired Image-to-Image Translation Cycle-Consistent Adversarial Networks. If a simple GAN is used for this problem then, Jan-Yan Zhu and his colleagues (2017) suggested:
"With large enough capacity, a network can map the same set of input images to any random permutation of images in the target domain, where any of the learned mappings can induce an output distribution that matches the target distribution. Thus, an adversarial loss alone cannot guarantee that the learned function can map an individual input to a desired output."
In other words, vanilla GAN would not have any sense of direction to maintain the correspondence between the source and the target image. In order to provide this sense of direction to the network, the authors introduced the cycle-consistency loss.
Figure 5. Cycle-consistency loss in Cycle-GAN. If an input image A from domain X is transformed into a target image B from domain Y via some generator G, then when image B is translated back to domain X via some generator F, this obtained image should match the input image A. The difference between these two images is defined as the cycle-consistency loss. This loss is similarly applicable to the image from domain Y. Source: Cycle-GAN Paper
This approach requires creating two pairs of generators and discriminators: one for A2B (source to target conversion) and another for B2A (target to source conversion).
Figure 6. A simplified architecture of Cycle-GAN. Considering an example for converting an image of a horse into a zebra, Cycle-GAN requires two generators. The generator A2B converts a horse into a zebra and B2A converts a zebra into a horse. Both train together to ensure one-to-one correspondence between the input horse image and the generated image of the zebra. The two discriminators determine real or fake images for horse and zebra, respectively. Source: Understanding and Implementing CycleGAN in TensorFlow
I took special interest in Cycle-GAN due to impressive results. Initially, my goal was to implement the approach provided in the paper on the TensorFlow* framework and study the technical aspects in detail.
While implementing it, I noticed that the training process was time consuming and there was a scope for optimization in Cycle-GAN:
Let us reconsider the purpose for introducing cyclic-loss in Cycle-GAN:
While generative adversarial networks are found to be performing very well at generating samples or images, in the case of the unpaired image-to-image translation problem, the correspondence between the input image and the target output image is desired.
Figure 7. When converting an image of a horse into a zebra, if the generator creates an image of a zebra, which has no relation with the input horse image, the discriminator will be okay with such an image too.
It turns out that GANs do not force the output to correspond to its input. To address this issue, that is, to preserve the true content/characteristics of the input image in the translated image, the authors of the Cycle-GAN paper introduced cycle-consistency loss.
I questioned whether there was any other way through which this goal could be achieved without having to create a second generator-discriminator pair.
This idea for optimizing an already well-enough performing architecture came to my mind by taking inspiration from Gandhian Engineering; it talks about reducing the cost of product through innovations. The core of this approach is to create a product that has more value and yet is accessible by more people; that is, more for less for more. The key idea for doing so is nothing goes unquestioned.
For this, I specifically targeted the problem of converting an image of an apple into an image of an orange. Thus, in this case the goal would be to modify the portion of the input image where the apple is detected, but keep the background intact.
This is a different perspective from that taken in Cycle-GAN, which tries to modify the complete image but not make an assumption that the background will remain intact. That way, the second discriminator has to learn and enforce this assumption, which results in the extra time taken for learning this fact.
I figured this goal can be achieved using a single neural network. And for this we essentially have a system that takes images from both domains—A (source) and B (target)—and gives only images from domain B in the output.
Figure 8. Cycle-consistency loss versus deviation loss. By applying deviation loss, only one generator can ensure one-to-one correspondence between source and target image. This eliminates the need for two generator networks.
To ensure that images from domain B do not change I introduced deviation loss, which is defined as the difference between the encodings of an image and the output of the generator network. This loss is introduced as a replacement for the cycle-consistency loss that was present in the Cycle-GAN architecture. The deviation loss regularizes the training of the translator by directing it to translate only the bare-minimum part of the encoded image of domain A, to make it appear like real encoding from domain B. Also, it enforces the spatial features to be kept intact throughout the translation.
I found another opportunity for optimization in the discriminator of Cycle-GAN or convolutional GANs in general.
I started rethinking about generative adversarial networks. As mentioned earlier, GAN essentially turns out to be an optimization problem for two-player zero-sum games, in which the generator tries to fool the discriminator and the discriminator tries to detect fake objects. There is an entire field of research around game theory that has not been applied to GAN, even though it is a game-theoretic problem at its core, and most of the game-playing strategies involve acquiring as much of the information as possible about the opponent's thinking. Thus, it makes more sense that both the competing players share some of their perspective about the game.
So, the generator and discriminator should share as much information as possible while maintaining enough exclusiveness to keep the competition alive.
This lead to another modification in the way the generator and discriminator input the information. In the case of Cycle-GAN, the discriminator takes the whole image and predicts whether the image looks real or fake. We can consider this discriminator to be working in two parts. The first part encodes the input image and the second part predicts from the encoding.
Figure 9. The discriminator in Cycle-GAN. The second half of discriminator (Conv-2) needs feature encodings of the image, which was already available from the output of the translator network; thus, unnecessary upsampling of this encoding from the decoder and again encoding it from the first part of the discriminator (Conv-1) will not just induce an error into the encodings but will also take more training resources.
So, the latter half of the discriminator need not take input from the output of the generator's decoder (a whole image), but it can directly take the translated feature encodings from the translator network as an input.
Figure 10. The discriminator in Cycle-GAN versus the discriminator in Proposed Architecture. Note that the need for the first part of the discriminator is completely eliminated. The generator can provide feature encodings directly to the discriminator.
Due to this change, the decoder part of the generator will be unable to take part in the generator-discriminator optimization. However, it can be optimized separately, along with the encoder, in the same way as an AutoEncoder.
Also, due to this major change of using only one generator-discriminator pair instead of two, some more optimization seems possible. In the cycle-GAN architecture there were two separate encoder-decoder pairs: one pair for encoding-decoding the images from the source domain and the other pair for encoding-decoding the images from the target domain.
Since there is only one generator now, only a single encoder-decoder pair can manage to encode and decode the images from both domains. Therefore, this pair can even be trained separately, which has its own advantages.
The separate step for training can be governed by the cyclic loss or reconstruction loss, which is defined as the difference between the original image and the image obtained when it is encoded and decoded to get the same image. This is similar to AutoEncoder, but between these pairs, the translator (generator) network is sandwiched.
For training the discriminator network, the conventional loss for GAN's discriminator is used. If the discriminator correctly classifies the encodings of the fake image the translator network is penalized. If the discriminator incorrectly classifies either of the real or fake encodings the discriminator network is penalized. This loss is kept similar to that of Cycle-GAN, but the structure of the discriminator has changed in the proposed architecture.
When I started working on implementing Cycle-GAN I soon realized the lack of computational resources for doing so, as generative adversarial networks and Cycle-GAN are very sensitive to initialization and choosing just the right hyperparameters. And training such a big network on a local system using only a CPU is not a good idea.
Intel AI DevCloud works especially well for testing research ideas. This is because it can independently perform computations on multiple nodes of the cluster. Thus, several ideas can be tested simultaneously without waiting for others to complete execution.
For utilizing multiple nodes of a cluster for higher performance, I created several versions of implementation to obtain the right set of hyperparameters. For example, I created a job having the learning rate of 0.005, another job having a learning rate of 0.01, another one of 0.02, and so on. In this way, if three jobs are submitted simultaneously, it effectively speeds up the process by 3x, as compared to sequentially running each version. This technique is very general and can be used for training any model on Intel AI DevCloud.
For specifically this optimized architecture, there emerges further possibilities to speed up the training process. The architecture consists of mainly three modules:
I observed that each of these modules can be trained on separate compute nodes. The only catch is that the translator and discriminator network's inputs depend upon the encoder's output, which needs to be trained. Also, the discriminator network's input is dependent on the translator's output and the translator's loss is dependent on the discriminator's output. Thus, it is required that if each of these three networks train on separate compute nodes, they periodically share their updated weights with other networks. Since all the submitted jobs use the same storage area, I chose to update the weights at the end of each epoch. Three separate checkpoints are created by each job, while the translator and discriminator networks update their encoder-decoder pairs at the end of each epoch and only train their corresponding network's weights. That is, the translator only trains the translator network, but updates the encoder-decoder pair and discriminator network in every epoch, only to use it for inference. Similarly, the discriminator uses other two networks whose weights are periodically updated for inference, while only the discriminator network is trained.
Therefore, this technique can further speed up the training of a single implementation by up to 3x. If combining this technique with submitting multiple jobs for different implementations, three different implementations can result in up to 9x speed-up.
The final proposed architecture for unpaired image-to-image translation:
Figure 11. Proposed architecture. The aim is to obtain image Fake B from image Input A. Neural networks are represented by solid boundaries and those having the same color represent the same network. The orange-colored circles indicate loss terms.
Explanation: Consider an example for converting the image of an apple to an orange. The goal is to perform this task while keeping the background intact. Forward pass involves downsampling of the input image of an apple, translating it to the encoding of an orange and upsampling it, to produce the image of an orange. Deviation loss ensures that the output of the translator is always the feature encodings of the orange. Thus, the image of an orange is unchanged (including the background), whereas the image of an apple changes in such a way that apples are converted into oranges (since the discriminator network is forcing this conversion), while everything in the background is unchanged (since deviation-loss is resisting this change). The key idea is that the translator network learns to not alter the background and the orange but to convert the apple to an orange.
The performance of this architecture is compared with the Cycle-GAN implementation on the TensorFlow Framework on Intel AI DevCloud using Intel® Xeon® Gold 6128 processors.
Table 1. Comparison of time taken by Cycle-GAN and proposed architecture.
|No. of Epoch(s)||Time by Cycle-GAN||Time by Proposed Architecture||Speed-up|
|1||66.27 minutes||32.92 minutes||2.0128x|
|2||132.54 minutes||65.84 minutes||2.0130x|
|3||198.81 minutes||98.76 minutes||2.0138x|
|15||994.09 minutes||493.80 minutes||2.0131x|
Furthermore, this speed-up is achieved by using only a single compute node. By using multiple nodes on Intel AI DevCloud, the speed-up can be as high as 18x. Also, it is observed that due to using the same neural network for encoding and decoding, and also using a less-complex decoder, the proposed system converges nearly twice as fast; that is, it needs nearly half the number of epochs required by Cycle-GAN to produce the same result.
The neural networks were trained on images of apples and oranges collected from ImageNet* and were directly available from Taesung Park's Cycle-GAN Datasets. The images were 256 x 256 pixels. The training set consisted of 1177 images of class apple and 996 images of class orange.
Figure 12. Results. The input images of apples are converted into oranges in the output. Note that the image background has not changed in the process.
Summary of the proposed architecture:
This optimized architecture speeds up the training process by at least 2x; it is also observed that convergence is achieved in fewer epochs than with Cycle-GAN. Also, by using optimization techniques specific to Intel AI DevCloud, up to 18x speed-up can be achieved.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804