DCGANs (Deep Convolutional Adversarial Networks) are a widely studied architeture in the field of generative adversarial networks. They are known for their ability to generate realistic synthetic data. However, one of the main challenges with DCGANs is their instability during training, often resulting in issues such as mode collapse, where the generator produces limited varities of outputs, or the inability of the generator and discriminator to reach a state equilibrium.
To mitigate these challenges, the Wasserstein GAN (WGAN) was introduced, and later improved with techniques such as gradient penalty and spectral normalization of the weight tensors. These changes aims to provide more stability during the training process of the network.
The architeture of a DCGAN is typically composed of a generator and a discriminator, with the generator using de-convolutional layers, and the discriminator using convolutional layers. The generator maps a random noise vector sampled from a latent gaussian distribution to a synthetic image, while the discriminator is trained to classify whether the generated image is real or not.
This training involves a minmax game between the generator and the discriminator, formulated as:
$[ \min_G \max_D V(D, G) = \mathbb{E}{x \sim p{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))] ]$
However, the loss function in this formulation does not correlate with the quality of generated samples, while their objective is to classify the images as real or generated.
The WGAN architeture introduces the Wasserstein distance (also known as Earth Mover's distance) as a measure of distance between the real data distribution
Where
The wasserstein distance needs a 1-Lipschitz continuity condition in the compared distributions. To ensure this condition, we can use gradient penalty. The goal of the gradient penalty is to regularize the norm of the gradients of the discriminator with respect to its inputs, pushing it towards 1.
To compute the norm used in the gradient penalty, a random tensor
For each real data sample
This interpolation ensures that
Next, the interpolated sample
This gradient represents how sensitive the discriminator's output is to changes in
The gradient norm is computed using the
The difference between this norm and 1 is calculated, squared, and then multiplied by the regularization coefficient
$[ \lambda \mathbb{E}{\hat{x} \sim p{\hat{x}}} [(|\nabla_{\hat{x}} D(\hat{x})|_2 - 1)^2] ]$
This penalty encourages the gradient norm to stay close to 1, which helps to stabilize the training process by preventing issues like gradient explosion or vanishing gradients.
Spectral normalization is another technique used to ensure the Lipschitz continuity of the discriminator. It involves normalizing the weights of each layer in the discriminator by their largest singular value. Mathematically, this can be represented as:
where
Both models were implemented using the same base architecture, running on similar hardware and datasets. The training process spanned 10 epochs for both the DCGAN and WGAN.
DCGAN Performance During the final epoch of the DCGAN, the loss per iteration was as follows:
The plot reveals notable instability in the training process, particularly in the generator loss on real data. This is evidenced by sharp peaks scattered throughout the plot, indicating moments of significant volatility in the learning process.
Despite these fluctuations, the DCGAN managed to produce the following images in the last epoch:
WGAN Performance For the WGAN, we analyzed the Wasserstein distances per iteration:
The Wasserstein distance, which serves as the loss function in WGANs, showed a much more stable profile during training. While the initial stages of training saw a rapid spike in this value—likely due to the effects of spectral normalization on the first weights—the model quickly stabilized. This early instability can be attributed to the larger norms of the weights during the initial learning of edge biases. However, once these weights stabilized, the Wasserstein distance remained consistent across iterations, reflecting a more controlled training process for both the generator and discriminator.
But this stability comes with a tradeoff. The iteration time for each epoch was around 120s for each epoch, while this same time was around 40s in the DCGAN.
Here are the images generated by the WGAN at the last epoch:
For further implementation details, refer to the notebooks in this repository