Vanilla GAN with Numpy

July 9, 2019

Generative Adversarial Networks (GANs) have achieved tremendous success in generating high-quality synthetic images and efficiently internalising the essence of the images that they learn from. Their potential is enormous, as they can learn to do that for any distribution of data.

In order to keep up with the latest advancements, I decided to explore their theoretical underpinnings by implementing a simple GAN in Python using Numpy. In this post, I will go through the implementation steps based on Ian Goodfellow’s Generative Adversarial Nets paper.

The full code is available on my GitHub repository.

Figure 1: Sample digit generated by the GAN that we will implement in this tutorial

Architecture

GAN

Figure 2: GAN architecture

We will implement a GAN that generates handwritten digits. The basic principle of GANs is inspired by the two-player zero-sum game, in which the total gains of two players are zero, and each player’s gain or loss of utility is exactly balanced by the loss or gain of the utility of another player [1]. It comprises of two models:

  1. Generator: learns to generate new images, which have a similar data distribution as the real dataset. Crucially, it has no direct access to the real images; it only learns through its interaction with the discriminator.

  2. Discriminator: learns to distinguish candidates produced by the generator from the real data distribution. It outputs the probability that an input image belongs to the real data distribution rather than the generator distribution.

Although both models are typically described by convolutional neural nets, they could be implemented by any form of differentiable system that maps data from one space to another. This implementation uses multilayer preceptors as they are less computationally prohibitive and much easier to code from scratch.

Generator

The generator, $G$, is fed random noise, $z$, from a normal distribution with zero mean and standard deviation of 1 (range [-1,1]). As we will train both the generator and discriminator using mini-batch gradient descent, the input noise will be a numpy array of size [batch size, input layer size].

The output of the generator will be a batch of flattened images of size [batch size, image dimension^2] . The image dimension corresponds to the number of pixels of the training images (for MNIST, each image has size 28×28 pixels).

Figure 3: Generator architecture

Discriminator

The discriminator, D, will be fed a batch of real images from the MNIST dataset and a batch of fake images from the generator. It will output the probability that the input images are real or fake.

The original paper suggests training the discriminator for k steps before training the generator for one step. We will choose the least computationally expensive solution, k=1, therefore training the discriminator and generator equally.

Figure 4: Discriminator architecture

Practitioners have been experimenting with the more sophisticated Deep Convolutional GANs to determine the optimal activation functions for the hidden and output layers [2]. I have found their recommendations to be very effective for a simple GAN as well.

In the generator network, it is recommended to use a ReLU activation in the hidden layers and tanh activation in the output layer. It was observed that using a bounded activation allowed the model to learn more quickly to saturate [2]. In the discriminator network, leaky ReLU was found to work well [3,4] in the hidden layers, as it prevents the vanishing gradient problem. At the output layer, a sigmoid activation is commonly used which squeezes pixels that would appear grey toward either black or white, resulting in crisper images. This is in contrast to the original GAN paper, which used the maxout activation.

Cost function

Training of GANs involves balancing two conflicting objectives:

  1. Training $D$ to maximise the probability of assigning the correct label to the training examples, $D(x)$ and samples from the generator, $D(G(z))$. The discriminator therefore wants to maximise:

    $$ \begin{aligned} J^{(D)} = \mathbb{E}\_{x \sim p\_{data}(x)}logD(x) + \mathbb{E}\_{z \sim p\_{z}(z)}log(1-D(G(z)) \end{aligned} $$

    where $p_{data}$ is the training data distribution and $p_{z}$ the noise prior. This is equivalent to minimising:

    $$ \begin{aligned} J^{(D)} = -\mathbb{E}\_{x \sim p\_{data}(x)}logD(x) -\mathbb{E}\_{z \sim p\_{z}(z)}log(1-D(G(z)) \end{aligned} $$

This is just the standard cross-entropy cost that is minimised when training a binary classifier with a sigmoid output. The only difference is that the classifier is trained on two mini-batches of data; one coming from the dataset, where the label is 1 for all examples, and one coming from the generator, where the label is 0 for all examples.

  1. Training $G$ to minimise the likelihood of the generated images not coming from the real data distribution. In other words, $G$ is trying to maximally confuse the discriminator. It tries to minimise:

    $$ \begin{aligned} J^{(G)} = \mathbb{E}\_{z \sim p\_{z}(z)}log(1-D(G(z))) \end{aligned} $$

    Typically, an alternative, non-saturating training criterion is used for the generator:

    $$ \begin{aligned} J^{(G)} = -\mathbb{E}\_{z \sim p\_{z}(z)}log(D(G(z)) \end{aligned} $$

Implementation

Imports

Let’s start by importing numpy, matplotlib.pyplot and other useful libraries.

Keras.datasets is imported to get access to the MNIST dataset, imageio to generate a gif from sample images at each training iteration and Path to define the location where sample images will be exported that are used to generate the gif.

import imageio
from keras.datasets import mnist
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
Data Loading

We need a set of real handwritten digits to give the discriminator a starting point in distinguishing between real and fake images. We’ll use MNIST, a benchmark dataset in deep learning. It consists of 70k images of handwritten digits compiled by the U.S. National Institute of Standards and Technology from Census Bureau employees and high school students.

As we will only use the train data, the test data (10k images) will be ignored.

(x_train, y_train), (_, _) = mnist.load_data()

print("y_train.shape",y_train.shape)
print("x_train.shape",x_train.shape)
y_train.shape (60000,)  
x_train.shape (60000, 28, 28)
Initialisation

We will wrap all functions in the GAN class.

On one GPU, a GAN will need hours to train, and on one CPU more than a day. In addition, GANs are difficult to optimise. For these reasons, I recommend trying to generate one digit at a time, by limiting the training data from the digits 0-9 to the digit specified in the numbers list.

We will use a mini-batch size of 64 (batch_size). The input layer of the discriminator is determined by the size of the training images [batch_size, image dimension^2], which needs to match the output of the generator, i.e. the fake images. The number of neurons at the input layer of the generator (input_layer_size_g) as well as the hidden layers of both models (hidden_layer_size_g, hidden_layer_size_d) need to be defined by us.

Next, to visualise training performance, we can generate a gif of sample images. If create_gif is enabled, a grid of sample images will be saved in your local directory by default and their filename will be stored in the filenames list to enable sourcing the images for stitching at the end of training.

While GANs are commonly used with momentum to adapt the learning rate or the Adam optimiser, we will use a simple step decay quantified by the decay_rate.

Finally, all weights will be initialised from a zero-centered Normal distribution with standard deviation determined by the Xavier algorithm[5]. It makes sure the weights are ‘just right’ by keeping the signal in a reasonable range of values through many layers.

class GAN:
    def __init__(self, numbers, epochs=100, batch_size=64, input_layer_size_g=100,
                 hidden_layer_size_g=128, hidden_layer_size_d=128, learning_rate=1e-3,
                 decay_rate=1e-4, image_size=28, display_epochs=5, create_gif=True):
        # -------- Initialise hyperparameters --------#
        self.numbers = numbers # chosen numbers to be generated
        self.epochs = epochs # #training iterations
        self.batch_size = batch_size # #of training examples in each batch
        self.nx_g = input_layer_size_g # #neurons in the generator's input layer
        self.nh_g = hidden_layer_size_g # #neurons in the generator's hidden layer
        self.nh_d = hidden_layer_size_d # #neurons in the discriminator's hidden layer
        self.lr = learning_rate # how much newly acquired info. overrides old info.
        self.dr = decay_rate # learning rate decay after every epoch
        self.image_size = image_size # # pixels of training images
        self.display_epochs = display_epochs # interval for displaying results
        self.create_gif = create_gif # if true, a gif of sample images will be made
        
        self.image_dir = Path('./GAN_sample_images') # new folder in current directory
        
        if not self.image_dir.is_dir():
            self.image_dir.mkdir()

        self.filenames = [] # stores filenames of sample images if create_gif is True

        # -------- Initialise weights with Xavier method --------#
        # -------- Generator --------#
        self.W0_g = np.random.randn(self.nx_g, self.nh_g) \
                    * np.sqrt(2. / self.nx_g)  #100x128
        self.b0_g = np.zeros((1, self.nh_g))  # 1x100

        self.W1_g = np.random.randn(self.nh_g, self.image_size ** 2) \
                    * np.sqrt(2. / self.nh_g) #128x784
        self.b1_g = np.zeros((1, self.image_size ** 2))  #1x784

        # -------- Discriminator --------#
        self.W0_d = np.random.randn(self.image_size ** 2, self.nh_d) \
                    * np.sqrt(2. / self.image_size ** 2) #784x128
        self.b0_d = np.zeros((1, self.nh_d))  # 1x128

        self.W1_d = np.random.randn(self.nh_d, 1) \
                    * np.sqrt(2. / self.nh_d)  # 128x1
        self.b1_d = np.zeros((1, 1))  # 1x1
Data Preprocessing

Five pre-processing steps were applied to the training data:

  1. Limiting it to the subset of digits selected by the user through the numbers list
  2. Removing images that can’t be part of a full training batch
  3. Flattening the images in an array with 784 values representing each pixel’s intensity
  4. Scaling the images the range of the tanh activation function [-1,1]
  5. Shuffling it to enable convergence
def preprocess_data(self, x, y):
    x_train = []
    y_train = []

    # limit the data to a subset of digits from 0-9
    for i in range(y.shape[0]):
        if y[i] in self.numbers:
            x_train.append(x[i])
            y_train.append(y[i])

    x_train = np.array(x_train)
    y_train = np.array(y_train)

    # limit the data to full batches only
    num_batches = x_train.shape[0] // self.batch_size
    x_train = x_train[: num_batches * self.batch_size]
    y_train = y_train[: num_batches * self.batch_size]
    
    # flatten the images (_,28,28)->(_, 784)
    x_train = np.reshape(x_train, (x_train.shape[0], -1))
    
    # normalise the data to the range [-1,1]
    x_train = (x_train.astype(np.float32) - 127.5) / 127.5

    # shuffle the data
    idx = np.random.permutation(len(x_train))
    x_train, y_train = x_train[idx], y_train[idx]
    return x_train, y_train, num_batches

GAN.preprocess_data = preprocess_data
Activation Functions

Here, we will implement the activation functions that will be used in forward propagation. Numpy’s tanh is used directly. The leaky ReLU function (lrelu) effectively acts as the relu function when the alpha parameter is set to zero.

def lrelu(self, x, alpha=1e-2):
    return np.maximum(x, x * alpha)
GAN.lrelu = lrelu

def sigmoid(self, x):
    return 1. / (1. + np.exp(-x))
GAN.sigmoid = sigmoid

As usual, the derivatives of the activation functions will be needed in backward propagation.

def dlrelu(self, x, alpha=1e-2):
    dx = np.ones_like(x)
    dx[x < 0] = alpha
    return dx
GAN.dlrelu = dlrelu

def dsigmoid(self, x):
    y = self.sigmoid(x)
    return y * (1. -y)
GAN.dsigmoid = dsigmoid

def dtanh(self, x):
    return 1. - np.tanh(x)** 2
GAN.dtanh = dtanh
Forward propagation

Next, we will implement forward propagation for the generator and discriminator network. After the input layer, each layer applies the affine transformation $z = W^{T}x + b$ followed by an activation function $a(z)$.

In the generator, the random noise, $z$, is propagated through the network to produce a batch of fake images, a1_g.

In the discriminator, a batch of images (real or fake), $x$, are propagated through the network to predict a classification (real or fake).

def forward_generator(self, z):
    self.z0_g = np.dot(z, self.W0_g) + self.b0_g
    self.a0_g = self.lrelu(self.z0_g, alpha=0)

    self.z1_g = np.dot(self.a0_g, self.W1_g) + self.b1_g
    self.a1_g = np.tanh(self.z1_g)  # range [-1,1]
    return self.z1_g, self.a1_g
GAN.forward_generator = forward_generator

def forward_discriminator(self, x):
    self.z0_d = np.dot(x, self.W0_d) + self.b0_d
    self.a0_d = self.lrelu(self.z0_d)

    self.z1_d = np.dot(self.a0_d, self.W1_d) + self.b1_d
    self.a1_d = self.sigmoid(self.z1_d)  # output probability [0,1]
    return self.z1_d, self.a1_d
GAN.forward_discriminator = forward_discriminator
Backward propagation

The GAN setup is reminiscent of reinforcement learning, where the generator is receiving a reward signal from the discriminator letting it know whether the generated data is accurate or not. The key difference with GANs, however, is that we can backward propagate gradient information from the discriminator to the generator, so the generator knows how to adapt its parameters in order to produce output data that can mislead the discriminator.

We will start by backward propagating the real image gradients through the discriminator and then the fake image gradients through the generator. To do so, we need to pass following information to backward_discriminator:

  1. x_real: a batch of real images from the training data
  2. z1_real: logit output from the discriminator D(x)
  3. a1_real: the discriminator’s output predictions for the real images
  4. x_fake: a batch with fake images produced by the generator
  5. z1_fake: logit output from the discriminator D(G(z))
  6. a1_fake: the discriminator’s output predictions for the fake images

The gradients are derived by simply differentiating the loss function with respect to each parameter. I will not derive the gradients here as there are many tutorials online, for instance Andrew Ng’s video lectures. For an intuitive understanding of backward propagation, I recommend Andrej Karpathy’s blog.

def backward_discriminator(self, x_real, z1_real, a1_real, x_fake, z1_fake, a1_fake):
    # -------- Backprop through Discriminator --------#
    # J_D = np.mean(-np.log(a1_real) - np.log(1 - a1_fake))
    
    # real input gradients -np.log(a1_real)
    da1_real = -1. / (a1_real + 1e-8)  # 64x1

    dz1_real = da1_real * self.dsigmoid(z1_real)  # 64x1
    dW1_real = np.dot(self.a0_d.T, dz1_real)
    db1_real = np.sum(dz1_real, axis=0, keepdims=True)

    da0_real = np.dot(dz1_real, self.W1_d.T)
    dz0_real = da0_real * self.dlrelu(self.z0_d)
    dW0_real = np.dot(x_real.T, dz0_real)
    db0_real = np.sum(dz0_real, axis=0, keepdims=True)

    # fake input gradients -np.log(1 - a1_fake)
    da1_fake = 1. / (1. - a1_fake + 1e-8)

    dz1_fake = da1_fake * self.dsigmoid(z1_fake)
    dW1_fake = np.dot(self.a0_d.T, dz1_fake)
    db1_fake = np.sum(dz1_fake, axis=0, keepdims=True)

    da0_fake = np.dot(dz1_fake, self.W1_d.T)
    dz0_fake = da0_fake * self.dlrelu(self.z0_d, alpha=0)
    dW0_fake = np.dot(x_fake.T, dz0_fake)
    db0_fake = np.sum(dz0_fake, axis=0, keepdims=True)

    # -------- Combine gradients for real & fake images--------#
    dW1 = dW1_real + dW1_fake
    db1 = db1_real + db1_fake

    dW0 = dW0_real + dW0_fake
    db0 = db0_real + db0_fake

    # -------- Update gradients using SGD--------#
    self.W0_d -= self.lr * dW0
    self.b0_d -= self.lr * db0

    self.W1_d -= self.lr * dW1
    self.b1_d -= self.lr * db1
    
GAN.backward_discriminator = backward_discriminator

In backward_generator, we will calculate the gradients at the beginning and end of the discriminator but we won’t update the discriminator weights.

def backward_generator(self, z, x_fake, z1_fake, a1_fake):
    # -------- Backprop through Discriminator --------#
    # J_D = np.mean(-np.log(a1_real) - np.log(1 - a1_fake))
    
    # fake input gradients -np.log(1 - a1_fake)
    da1_d = -1.0 / (a1_fake + 1e-8)  # 64x1

    dz1_d = da1_d * self.dsigmoid(z1_fake)
    da0_d = np.dot(dz1_d, self.W1_d.T)
    dz0_d = da0_d * self.dlrelu(self.z0_d)
    dx_d = np.dot(dz0_d, self.W0_d.T)

    # -------- Backprop through Generator --------#
    # J_G = np.mean(-np.log(1 - a1_fake))
    dz1_g = dx_d * self.dtanh(self.z1_g)
    dW1_g = np.dot(self.a0_g.T, dz1_g)
    db1_g = np.sum(dz1_g, axis=0, keepdims=True)

    da0_g = np.dot(dz1_g, self.W1_g.T)
    dz0_g = da0_g * self.dlrelu(self.z0_g, alpha=0)
    dW0_g = np.dot(z.T, dz0_g)
    db0_g = np.sum(dz0_g, axis=0, keepdims=True)

    # -------- Update gradients using SGD --------#
    self.W0_g -= self.lr * dW0_g
    self.b0_g -= self.lr * db0_g

    self.W1_g -= self.lr * dW1_g
    self.b1_g -= self.lr * db1_g
    
GAN.backward_generator = backward_generator
Sampling & GIF generation

Sample_images will enable us to view digits from the generator’s distribution at the frequency defined by the user through the display_epoch hyperparameter. After training each batch, we will generate a grid of sample images (but not show it if the frequency criterion is not met) and save it in the GAN_sample_images folder in your current directory.

def sample_images(self, images, epoch, show):
    images = np.reshape(images, (self.batch_size, self.image_size, self.image_size))
    
    fig = plt.figure(figsize=(4, 4))

    for i in range(16):
        plt.subplot(4, 4, i + 1)
        plt.imshow(images[i] * 127.5 + 127.5, cmap='gray')
        plt.axis('off')

    # saves generated images in the GAN_sample_images folder
    if self.create_gif:
        current_epoch_filename = self.image_dir.joinpath(f"GAN_epoch{epoch}.png")
        self.filenames.append(current_epoch_filename)
        plt.savefig(current_epoch_filename)

    if show == True:
        plt.show()
    else:
        plt.close()

GAN.sample_images = sample_images

At the end of training, we will generate a gif from the sample images of the generator if create_gif is initialised to True. This can be achieved with imageio in a few lines of code, which can read from filenames, file objects, http, zipfiles and bytes.

def generate_gif(self):
    images = []
    for filename in self.filenames:
        images.append(imageio.imread(filename))
    imageio.mimsave("GAN.gif", images)
    
GAN.generate_gif = generate_gif
Training

Finally, we will define a function to train the model. train takes as input raw images and labels and outputs the loss of the Generator and Discriminator at each training step.

In order to speed up training, we will train our data in batches. The number of batches (num_batches) is determined by the total number of training images divided by the user-defined batch_size.

def train(self, x, y):
    J_Ds = []  # stores the disciminator losses
    J_Gs = []  # stores the generator losses

    # preprocess input; note that the labels aren't needed
    x_train, _, num_batches = self.preprocess_data(x, y)

    for epoch in range(self.epochs):
        for i in range(num_batches):
            # ------- PREPARE INPUT BATCHES & NOISE -------#
            x_real = x_train[i * self.batch_size: (i + 1) * self.batch_size] # 64x784
            z = np.random.normal(0, 1, size=[self.batch_size, self.nx_g])  # 64x100

            # ------- FORWARD PROPAGATION -------#
            z1_g, x_fake = self.forward_generator(z)

            z1_d_real, a1_d_real = self.forward_discriminator(x_real)
            z1_d_fake, a1_d_fake = self.forward_discriminator(x_fake)

            # ------- CROSS ENTROPY LOSS -------#
            # ver1 : max log(D(x)) + log(1 - D(G(z))) (in original paper)
            # ver2 : min -log(D(x)) min -log(1 - D(G(z))) (implemented here)
            J_D = np.mean(-np.log(a1_d_real) - np.log(1 - a1_d_fake))
            J_Ds.append(J_D)

            # ver1 : minimize log(1 - D(G(z))) (in original paper)
            # ver2 : maximize log(D(G(z)))
            # ver3 : minimize -log(D(G(z))) (implemented here)
            J_G = np.mean(-np.log(a1_d_fake))
            J_Gs.append(J_G)
            # ------- BACKWARD PROPAGATION -------#
            self.backward_discriminator(x_real, z1_d_real, a1_d_real,
                                        x_fake, z1_d_fake, a1_d_fake)
            self.backward_generator(z, x_fake, z1_d_fake, a1_d_fake)

        if epoch % self.display_epochs == 0:
            print(f"Epoch:{epoch:}|G loss:{J_G:.4f}|D loss:{J_D:.4f}|D(G(z))avg:{np.mean(a1_d_fake):.4f}|D(x)avg:{np.mean(a1_d_real):.4f}|LR:{self.lr:.6f}")
            self.sample_images(x_fake, epoch, show=True) # display sample images
        else:
            self.sample_images(x_fake, epoch, show=False)

        # reduce learning rate after every epoch
        self.lr = self.lr * (1.0 / (1.0 + self.dr * epoch))

    # generate gif
    if self.create_gif:
        self.generate_gif()
        
    return J_Ds, J_Gs

GAN.train = train

We can now train our GAN by alternating the training of the discriminator and the generator. As discussed earlier, to get quick results, I recommend running the model for one digit only, which is defined in the numbers list.

numbers = [3]
model = GAN(numbers, learning_rate = 1e-3, decay_rate = 1e-4, epochs = 100)
J_Ds, J_Gs = model.train(x_train, y_train)

The next figure visualises the loss of the discriminator and generator at each training step. As training progresses the generator error decreases, implying that the images it generates are improving. While the generator improves, the discriminator’s error increases, because the synthetic images are becoming more realistic each time.

plt.plot([i for i in range(len(J_Ds))], J_Ds)
plt.plot([i for i in range(len(J_Gs))], J_Gs)

plt.xlabel("# training steps")
plt.ylabel("training cost")
plt.legend(['Discriminator', 'Generator'])
plt.show()

Figure 5: Evolution of the discriminator and generator training losses

Remarks

GANs are known to be difficult to optimise. Without the right network architecture, hyperparameters, and training procedure, the discriminator can overpower the generator, or vice-versa. You can experience this yourself by trying to optimise the GAN implemented in this tutorial for all digits (0-9). The two most common failure modes are:

  1. The generator overpowers the discriminator (mode collapse). The generator can collapse to a parameter setting where it always emits the same samples that the discriminator believes are highly realistic. You can recognise mode collapse in your GAN if it generates very similar images. Mode collapse can sometimes be corrected by strengthening the discriminator in some way—for instance, by adjusting its learning rate or by reconfiguring its layers.
  2. The discriminator overpowers the generator, classifying generated images as fake with absolute certainty. When the discriminator responds with absolute certainty, it leaves no gradient for the generator to descend.

Practitioners have amassed many strategies to mitigate these instabilities and improve the performance of GANs [5, 6] . A summary of key strategies can be found at this GitHub repository. These should be regarded as techniques that are worth trying out, not as best practices. As implementing and testing these techniques with Numpy would be extremely time-consuming, I recommend using a deep learning library like TensorFlow. You can find my improved version of a GAN, implemented with TensorFlow 2.0, in my GitHub repository.

References

[1] Wang Kunfeng, Gou Chao, Duan Yanjie, Lin Yilun, Zheng Xinhu and Wang Fei-Yue. (2017). “Generative Adversarial Networks: Introduction and Outlook”.

[2] Radford Alec, Metz Luke and Chintala Soumith. (2015). “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”.

[3] Maas Andrew L, Hannun Awni Y, and Ng Andrew Y. (2013). “Rectifier nonlinearities improve neural network acoustic models”.

[4] Xu Bing, Wang Naiyan, Chen Tianqi, and Li Mu. (2015). “Empirical evaluation of rectified activations in convolutional network”.

[5] Glorot Xavier and Bengio Yoshua. (2010). “Understanding the difficulty of training deep feedforward neural networks”.

[6] Salimans Tim, Goodfellow Ian, Zaremba Wojciech, Cheung Vicki, Radford Alec and Chen Xi. (2016). “Improved techniques for training GANs”.