WGANvsDCGAN ⚔️🤖

DCGANs (Deep Convolutional Adversarial Networks) are a widely studied architeture in the field of generative adversarial networks. They are known for their ability to generate realistic synthetic data. However, one of the main challenges with DCGANs is their instability during training, often resulting in issues such as mode collapse, where the generator produces limited varities of outputs, or the inability of the generator and discriminator to reach a state equilibrium.

To mitigate these challenges, the Wasserstein GAN (WGAN) was introduced, and later improved with techniques such as gradient penalty and spectral normalization of the weight tensors. These changes aims to provide more stability during the training process of the network.

DCGAN Architeture 🛠️

The architeture of a DCGAN is typically composed of a generator and a discriminator, with the generator using de-convolutional layers, and the discriminator using convolutional layers. The generator maps a random noise vector sampled from a latent gaussian distribution to a synthetic image, while the discriminator is trained to classify whether the generated image is real or not.

This training involves a minmax game between the generator and the discriminator, formulated as:

$[ \min_G \max_D V(D, G) = \mathbb{E}{x \sim p{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))] ]$

However, the loss function in this formulation does not correlate with the quality of generated samples, while their objective is to classify the images as real or generated.

WGAN 🚀

The WGAN architeture introduces the Wasserstein distance (also known as Earth Mover's distance) as a measure of distance between the real data distribution $(p_{data})$ and the generated data distribution $(p_g)$. This distance is defined as:

$[ W(p_{data}, p_g) = \inf_{\gamma \in \Pi(p_{data}, p_g)} \mathbb{E}_{(x, y) \sim \gamma} [|x - y|] ]$

Where $( \Pi(p_{data}, p_g) )$ denotes the set of all joint distributions whose marginals are $( p_{data} )$ and $( p_g )$. Since the Wasserstein distance is a direct measure of the distance between the distribution of real and fake data, the loss will be able to also judge the quality of generated samples by measuring their distributions compared to the real ones, giving a more meaningful loss function.

Gradient Penalty ⚖️

The wasserstein distance needs a 1-Lipschitz continuity condition in the compared distributions. To ensure this condition, we can use gradient penalty. The goal of the gradient penalty is to regularize the norm of the gradients of the discriminator with respect to its inputs, pushing it towards 1.

Detailed Computation of the Gradient Penalty 🧮

To compute the norm used in the gradient penalty, a random tensor $( \epsilon )$ is sampled from a uniform distribution in the range $( ]0, 1[ )$. This tensor $( \epsilon )$ serves as a mixing coefficient to interpolate between real and generated samples.

For each real data sample $( x_{real} )$ and generated data sample $( x_{fake} )$, we compute the interpolated sample $( \hat{x} )$ as:

$[ \hat{x} = \epsilon \cdot x_{real} + (1 - \epsilon) \cdot x_{fake} ]$

This interpolation ensures that $( \hat{x} )$ lies on the line connecting $( x_{real} )$ and $( x_{fake} )$ in the input space.

Next, the interpolated sample $( \hat{x} )$ is passed through the discriminator $( D(\hat{x}) )$, and the gradient of the discriminator's output with respect to $( \hat{x} )$ is computed:

$[ \nabla_{\hat{x}} D(\hat{x}) ]$

This gradient represents how sensitive the discriminator's output is to changes in $( \hat{x} )$. The goal of the gradient penalty is to enforce that the norm of this gradient is close to 1, which is a requirement for the discriminator to be 1-Lipschitz continuous, as needed for the Wasserstein distance calculation.

The gradient norm is computed using the $( L_2 )$ norm (also known as the Euclidean norm):

$[ |\nabla_{\hat{x}} D(\hat{x})|_2 ]$

The difference between this norm and 1 is calculated, squared, and then multiplied by the regularization coefficient $( \lambda )$. The final gradient penalty term added to the loss function is:

$[ \lambda \mathbb{E}{\hat{x} \sim p{\hat{x}}} [(|\nabla_{\hat{x}} D(\hat{x})|_2 - 1)^2] ]$

This penalty encourages the gradient norm to stay close to 1, which helps to stabilize the training process by preventing issues like gradient explosion or vanishing gradients.

Spectral Normalization 🎚️

Spectral normalization is another technique used to ensure the Lipschitz continuity of the discriminator. It involves normalizing the weights of each layer in the discriminator by their largest singular value. Mathematically, this can be represented as:

$[ \hat{W} = \frac{W}{\sigma(W)} ]$

where $( W )$ is the weight matrix, and $( \sigma(W) )$ is its largest singular value. This normalization limits the maximum gradient value that can be backpropagated, further stabilizing the training process.

Results 📊

Both models were implemented using the same base architecture, running on similar hardware and datasets. The training process spanned 10 epochs for both the DCGAN and WGAN.

DCGAN Performance During the final epoch of the DCGAN, the loss per iteration was as follows:

The plot reveals notable instability in the training process, particularly in the generator loss on real data. This is evidenced by sharp peaks scattered throughout the plot, indicating moments of significant volatility in the learning process.

Despite these fluctuations, the DCGAN managed to produce the following images in the last epoch:

WGAN Performance For the WGAN, we analyzed the Wasserstein distances per iteration:

The Wasserstein distance, which serves as the loss function in WGANs, showed a much more stable profile during training. While the initial stages of training saw a rapid spike in this value—likely due to the effects of spectral normalization on the first weights—the model quickly stabilized. This early instability can be attributed to the larger norms of the weights during the initial learning of edge biases. However, once these weights stabilized, the Wasserstein distance remained consistent across iterations, reflecting a more controlled training process for both the generator and discriminator.

But this stability comes with a tradeoff. The iteration time for each epoch was around 120s for each epoch, while this same time was around 40s in the DCGAN.

Here are the images generated by the WGAN at the last epoch:

For further implementation details, refer to the notebooks in this repository

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data/FashionMNIST/raw		data/FashionMNIST/raw
logs/tensorflow/DCGAN/train_data		logs/tensorflow/DCGAN/train_data
DCGAN.ipynb		DCGAN.ipynb
LICENSE		LICENSE
README.md		README.md
SN-WGAN.ipynb		SN-WGAN.ipynb
condacmds.txt		condacmds.txt
dcganloss.png		dcganloss.png
generateddc.png		generateddc.png
requirements.txt		requirements.txt
wdistances.png		wdistances.png
wgangen.png		wgangen.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WGANvsDCGAN ⚔️🤖

DCGAN Architeture 🛠️

WGAN 🚀

Gradient Penalty ⚖️

Detailed Computation of the Gradient Penalty 🧮

Spectral Normalization 🎚️

Results 📊

About

Releases

Packages

Languages

License

gustavovvbs/WGANvsDCGAN

Folders and files

Latest commit

History

Repository files navigation

WGANvsDCGAN ⚔️🤖

DCGAN Architeture 🛠️

WGAN 🚀

Gradient Penalty ⚖️

Detailed Computation of the Gradient Penalty 🧮

Spectral Normalization 🎚️

Results 📊

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages