Autoencoder in TensorFlow

Introduction
Autoencoders for Image Compression
- The Simplest Autoencoder
- Convolutional Autoencoder
Autoencoder for Denoising

Introduction

Autoencoder is a data compression algorithm that consists of the encoder, which compresses the original input, and the decoder that reconstructs the input from the compressed representation. The image below shows the basic idea of autoencoders. Additionally, in almost all contexts where the term “autoencoder” is used, the compression and decompression functions are implemented with neural networks.

There are some interesting aspects about autoencoders, which we will see below:

Autoencoders are data-specific, which means that they will only be able to compress data similar to what they have been trained on.
Autoencoders are lossy since we use the compressed representation of the original input.
Autoencoders usually are worse than basic compression algorithms, such as jepg compression algorithms.
Autoencoders are rarely used in practical applications. Today two interesting practical applications of autoencoders are data denoising (which we feature later in this post), and dimensionality reduction for data visualization. Another reason why autoencoders still attracts attention is its potential application in solving unsupervised learning problems.

Autoencoders for Image Compression

As we mentioned above that the encoder and decoder are usually implemented by neural networks, we shall see how that is done in TensorFlow. In the following examples, we are going to see how autoencoders compress the MNIST dataset. Let’s dive in!

The Simplest Autoencoder

The simplest encoder and decoder would be a 1-layer neural network, whose structure is shown below:

Following the above architecture, we can easily build the network in TensorFlow.

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow.examples.tutorials.mnist import input_data
# validation_size=0 since we are not predicting anything so no need for validation dataset
mnist = input_data.read_data_sets('MNIST_data', validation_size=0)

tf.reset_default_graph()
encoding_dim = 32

# Input and target placeholders
inputs_ = tf.placeholder(tf.float32, [None, 784], name='inputs')
targets_ = tf.placeholder(tf.float32, [None, 784], name='targets')

# Output of hidden layer, single fully connected layer here with ReLU activation
encoded = tf.layers.dense(inputs_, encoding_dim, activation=tf.nn.relu)

# Output layer logits, fully connected layer with no activation
logits = tf.layers.dense(encoded, 784, activation=None)

# the mnist dataset are already normalized so we need to use a function to
# squash the logits and make them between 0 and 1
# Sigmoid output from logits
decoded = tf.sigmoid(logits, name='output')

# Sigmoid cross-entropy loss
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=targets_)
# Mean of the loss
cost = tf.reduce_mean(loss)

# Adam optimizer
opt = tf.train.AdamOptimizer().minimize(loss)

After defining the autoencoder network, we can start training it with images.

epochs = 20
batch_size = 200
sess.run(tf.global_variables_initializer())
for e in range(epochs):
    for ii in range(mnist.train.num_examples//batch_size):
        batch = mnist.train.next_batch(batch_size)
        feed = {inputs_: batch[0], targets_: batch[0]}
        batch_cost, _ = sess.run([cost, opt], feed_dict=feed)

        print("Epoch: {}/{}...".format(e+1, epochs),
              "Training loss: {:.4f}".format(batch_cost))

After the training is done, we can compare the compressed representation with the original images:

fig, axes = plt.subplots(nrows=2, ncols=10, sharex=True, sharey=True, figsize=(20,4))
in_imgs = mnist.test.images[:10]
reconstructed, compressed = sess.run([decoded, encoded], feed_dict={inputs_: in_imgs})

for images, row in zip([in_imgs, reconstructed], axes):
    for img, ax in zip(images, row):
        ax.imshow(img.reshape((28, 28)), cmap='Greys_r')
        ax.get_xaxis().set_visible(False)
        ax.get_yaxis().set_visible(False)

fig.tight_layout(pad=0.1)

Convolutional Autoencoder

A network with many convolutional layers performs better than a 1-layer perceptron network, so we shall see how convolutional antoencoders compare to our simple version above. The encoder part of the network will be a typical convolutional pyramid. Each convolutional layer will be followed by a max-pooling layer to reduce the dimensions of the layers. The decoder needs to convert from a narrow representation to a wide reconstructed image. One approach is to do this is to use transposed convolution layers used to increase the width and height of the layers. They work almost exactly the same as convolutional layers, but in reverse. However, transposed convolution layers can lead to artifacts in the final images, such as checkerboard patterns. This is due to overlap in the kernels which can be avoided by setting the stride and kernel size equal. In this case, we will upsample the images to make them wider and it works better than transposed convolution layers. One of typical schematics for convolutional network is shown below:

We will, again, use the mnist data here. Even though we usually we omit the default values for parameters in functions, it is better to know what default values are used there. It is also a good practice to give each layer a specific name. We used higher level TensorFlow api here, but it is good to know what the parameters are so that we know what they are doing.

tf.reset_default_graph()
learning_rate = 0.001

# Input and target placeholders
inputs_ = tf.placeholder(tf.float32, [None, 28,28,1], name='inputs')
targets_ = tf.placeholder(tf.float32, [None, 28,28,1], name='targets')


### Encoder 28x28x16
conv1 = tf.layers.conv2d(inputs_, filters=16, kernel_size=(3,3),strides=1, 
                         padding='SAME',activation=tf.nn.relu,
                        kernel_initializer=tf.glorot_normal_initializer(),
                        bias_initializer=tf.zeros_initializer(), name='conv1')
print('conv1.shape',conv1.shape)

# 14x14x16
maxpool1 = tf.layers.max_pooling2d(conv1, pool_size=(2,2),strides=2,padding='SAME',name='maxpool1')
print('maxpool1.shape',maxpool1.shape)

# 14x14x8
conv2 = tf.layers.conv2d(maxpool1,filters=8,kernel_size=(3,3),strides=1,
                        padding='SAME',activation=tf.nn.relu,
                        kernel_initializer=tf.glorot_normal_initializer(),
                        bias_initializer=tf.zeros_initializer(), name='conv2')
print('conv2.shape',conv2.shape)

# 7x7x8
maxpool2 = tf.layers.max_pooling2d(conv2,pool_size=(2,2),strides=2,padding='SAME',name='maxpool2')
print('maxpool2.shape',maxpool2.shape)

# 7x7x8
conv3 = tf.layers.conv2d(maxpool2, filters=8,kernel_size=(3,3),strides=1,
                        padding='SAME',activation=tf.nn.relu,
                        kernel_initializer=tf.glorot_normal_initializer(),
                        bias_initializer=tf.zeros_initializer(),name='conv3')
print('conv3.shape',conv3.shape)

# 4x4x8
encoded = tf.layers.max_pooling2d(conv3,pool_size=(2,2),strides=2,padding='SAME',name='maxpool2')
print('encoded.shape',encoded.shape)


### Decoder
# 7x7x8
upsample1 = tf.image.resize_nearest_neighbor(encoded, (7,7), name='upsample1')
print('upsample1.shape',upsample1.shape)

# 7x7x8
conv4 = tf.layers.conv2d(upsample1, filters=8, kernel_size=(3,3),strides=1,
                        padding='SAME',activation=tf.nn.relu,
                        kernel_initializer=tf.glorot_normal_initializer(),
                        bias_initializer=tf.zeros_initializer(),name='conv4')
print('conv4.shape',conv4.shape)

# 14x14x8
upsample2 = tf.image.resize_nearest_neighbor(conv4, (14,14), name='upsample2')
print('upsample2.shape',upsample2.shape)

# 14x14x8
conv5 = tf.layers.conv2d(upsample2,filters=8,kernel_size=(3,3),strides=1,
                        padding='SAME',activation=tf.nn.relu,
                        kernel_initializer=tf.glorot_normal_initializer(),
                        bias_initializer=tf.zeros_initializer(),name='conv5')
print('conv5.shape',conv5.shape)

# 28x28x8
upsample3 = tf.image.resize_nearest_neighbor(conv5, (28,28), name='upsample3')
print('upsample3.shape',upsample3.shape)

# 28x28x16
conv6 = tf.layers.conv2d(upsample3,filters=16,kernel_size=(3,3),strides=1,
                        padding='SAME',activation=tf.nn.relu,
                        kernel_initializer=tf.glorot_normal_initializer(),
                        bias_initializer=tf.zeros_initializer(),name='conv6')
print('conv6.shape',conv6.shape)

# 28x28x1
logits = tf.layers.conv2d(conv6,filters=1,kernel_size=(3,3),strides=1,
                        padding='SAME',activation=None,
                        kernel_initializer=tf.glorot_normal_initializer(),
                        bias_initializer=tf.zeros_initializer(),name='logits')
print('logits.shape',logits.shape)


# reconstructed image
decoded = tf.sigmoid(logits)

# calculate the cross-entropy loss
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=targets_)

# cost and optimizer
cost = tf.reduce_mean(loss)
opt = tf.train.AdamOptimizer(learning_rate).minimize(cost)

The training process and results-showing part are the same as the simple autoencoder. The result for the above convolutional autoencoder is shown below:

In convolutional neural networks we usually initialize the weights and bias with the truncated normal distribution, but in this case they failed to converge and produced much worse results. The first one shows the result of weights and bias both using the truncated normal distribution. The second one shows the result of weights using truncated normal distribution and bias using zeros.

we used the glorot_normal_initializer for the weights and zeros_initializer for the bias. The glorot_normal_initializer draws samples from a truncated normal distribution centered on 0 with stddev = sqrt(2 / (fan_in + fan_out)) where fan_in is the number of input units in the weight tensor and fan_out is the number of output units in the weight tensor.

Autoencoder for Denoising

As we mentioned that autoencoders like the ones we’ve built so far aren’t too useful in practice. However, they can be used to denoise images quite successfully just by training the network on noisy images. We can create the noisy images ourselves by adding Gaussian noise to the training images, then clipping the values to be between 0 and 1. We’ll use noisy images as input and the original, clean images as targets.

We can define a similar schematic like the one above for this task.

learning_rate = 0.001
tf.reset_default_graph()
inputs_ = tf.placeholder(tf.float32, (None, 28, 28, 1), name='inputs')
targets_ = tf.placeholder(tf.float32, (None, 28, 28, 1), name='targets')

### Encoder
# 28x28x32
conv1 = tf.layers.conv2d(inputs_, filters=32, kernel_size=(3,3),strides=1,
                         padding='SAME', activation =tf.nn.relu,
                        kernel_initializer = tf.glorot_normal_initializer(),
                        bias_initializer = tf.zeros_initializer(), name='conv1')
# 14x14x32
maxpool1 = tf.layers.max_pooling2d(conv1,pool_size=(2,2),strides=2,padding='SAME',name='maxpool1')

# 14x14x32
conv2 = tf.layers.conv2d(maxpool1,filters=32, kernel_size=(3,3),strides=1,
                        padding='SAME', activation=tf.nn.relu,
                        kernel_initializer=tf.glorot_normal_initializer(),
                        bias_initializer=tf.zeros_initializer(),name='conv2')
# 7x7x32
maxpool2 = tf.layers.max_pooling2d(conv2,pool_size=(2,2),strides=2,padding='SAME',name='maxpool2')

# 7x7x16
conv3 = tf.layers.conv2d(maxpool2,filters=16, kernel_size=(3,3),strides=1,
                        padding='SAME', activation=tf.nn.relu,
                        kernel_initializer=tf.glorot_normal_initializer(),
                        bias_initializer=tf.zeros_initializer(),name='conv3')

# 4x4x16
encoded = tf.layers.conv2d(conv3,filters=16, kernel_size=(3,3),strides=2,
                        padding='SAME', activation=tf.nn.relu,
                        kernel_initializer=tf.glorot_normal_initializer(),
                        bias_initializer=tf.zeros_initializer(),name='encoded')


### Decoder
# 7x7x16
upsample1 = tf.image.resize_nearest_neighbor(encoded,(7,7),name='upsample1')

# 7x7x16
conv4 = tf.layers.conv2d(upsample1,filters=16, kernel_size=(3,3),strides=1,
                        padding='SAME', activation=tf.nn.relu,
                        kernel_initializer=tf.glorot_normal_initializer(),
                        bias_initializer=tf.zeros_initializer(),name='conv4')

# 14x14x16
upsample2 = tf.image.resize_nearest_neighbor(conv4,(14,14),name='upsample2')

# 14x14x32
conv5 = tf.layers.conv2d(upsample2,filters=32, kernel_size=(3,3),strides=1,
                        padding='SAME', activation=tf.nn.relu,
                        kernel_initializer=tf.glorot_normal_initializer(),
                        bias_initializer=tf.zeros_initializer(),name='conv5')

# 28x28x32
upsample3 = tf.image.resize_nearest_neighbor(conv5,(28,28),name='upsample2')

# 28x28x32
conv6 = tf.layers.conv2d(upsample3,filters=32, kernel_size=(3,3),strides=1,
                        padding='SAME', activation=tf.nn.relu,
                        kernel_initializer=tf.glorot_normal_initializer(),
                        bias_initializer=tf.zeros_initializer(),name='conv6')


logits = tf.layers.conv2d(conv6,filters=1, kernel_size=(3,3),strides=1,
                        padding='SAME', activation=None,
                        kernel_initializer=tf.glorot_normal_initializer(),
                        bias_initializer=tf.zeros_initializer(),name='logits')
# 28x28x1

# reconstructed image
decoded = tf.sigmoid(logits)

# the cross-entropy loss
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=logits,labels=targets_)

# cost and optimizer
cost = tf.reduce_mean(loss)
opt = tf.train.AdamOptimizer(learning_rate).minimize(cost)

There are several advanced topics about autoencoders I have not included in this post, such as sequence-to-sequence autoencoder, variational autoencoder, and contractive autoencoder. You can find more about them in this blog.

Disclaimer: This post includes my personal reflections and notes on learning Deep Learning Nanodegree from Udacity. Some texts and images are from the learning materials for better educational purposes.