Basics about Convolutional Neural Networks II

Parameter Sharing
Terms in CovNets
TensorFlow Convolution Layer
Pooling
1x1 Convolutions
Put it together

When we are trying to classify a picture of a cat, we don’t care where in the image a cat is. If it’s in the top left or the bottom right, it’s still a cat in our eyes. We would like our CNNs to also possess this ability known as translation invariance. How can we achieve this?

As we saw earlier, the classification of a given patch in an image is determined by the weights and biases corresponding to that patch.

If we want a cat that’s in the top left patch to be classified in the same way as a cat in the bottom right patch, we need the weights and biases corresponding to those patches to be the same, so that they are classified the same way.

This is exactly what we do in CNNs. The weights and biases we learn for a given output layer are shared across all patches in a given input layer. Note that as we increase the depth of our filter, the number of weights and biases we have to learn still increases, as the weights aren’t shared across the output channels.

There’s an additional benefit to sharing our parameters. If we did not reuse the same weights across all patches, we would have to learn new parameters for every single patch and hidden layer neuron pair. This does not scale well, especially for higher fidelity images. Thus, sharing parameters not only helps us with translation invariance, but also gives us a smaller, more scalable model.

An example of sharing weights would be:

In this example, we have three weights, the red blue and green arrows, that are reused 3 times each, once for each layer m neuron.

Terms in CovNets

CovNets are neural networks that share their parameters across space.

If we take a small patch (also called kernels or filter) of the input image and run a tiny neural network on it, which produces, say K, outputs. Let’s represent the outputs in a verticle column that has depth of K (also called filter depth). Then, we slide the little neural network horizontally or vertically through the entire input image without changing the weights. After the whole process is done, we end up a small output that has samller width and height but deeper depth(shown in the graph below). The process is called convolution.

If we keep building convolution outputs layer after layer, we will end up having a neural network with a structure like the graph below and on top of the last output we can put a classifier so that it can be trained to classify objects.

Each depth in the stack is called feature map. For example, in the first convolution layers we map R, G, B 3 feature maps to K feature maps.

Another term we need to know is stride, which is the number of pixels we shift during the convolution. For example, the graph below shows the stride of 1 pixel.

We have two different new terms related to stride – valid padding and same padding. As we shift the patch through the image to the edge of the image, we can choose either not go beyond the edge, in which case we will have valid padding, or we can go beyond the edge and pad what’s beyond with 0s, in which case we end up a output with the same size as the input image and it is called same padding (shown in the graph below).

The graph below shows another example to look at the relations between stride, valid and same padding, and output image size.

Let’s look at how dimensions change across layers. Suppose we are given:

our input layer has a width of W and a height of H.
our convolutional layer has a filter size F.
we have a stride of S.
a padding of P.
and the number of filters K.

So, we will know:

The width of the next layer would be: [ (W−F+2P)/S] + 1.
The output height would be: [(H-F+2P)/S] + 1.
And the output depth would be equal to the number of filter K.

In TensorFlow, it uses the following equation for ‘SAME’ vs ‘VALID’:

SAME Padding, the output height and width are computed as:
out_height = ceil(float(in_height) / float(strides[1]))
out_width = ceil(float(in_width) / float(strides[2]))

VALID Padding, the output height and width are computed as:
out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
out_width = ceil(float(in_width - filter_width + 1) / float(strides[2]))

Let ask several questions about the process of making convolutions.
First, how many neurons does each patch connect to?

Answer: That’s dependent on our filter depth. If we have a depth of k, we connect each patch of pixels to k neurons in the next layer. This gives us the height of k in the next layer.

Second, why connect a single patch to multiple neurons in the next layer? Isn’t one neuron good enough?

Answer: Multiple neurons can be useful because a patch can have multiple interesting characteristics that we want to capture. Having multiple neurons for a given patch ensures that our CNN can learn to capture whatever characteristics the CNN learns are important. Remember that the CNN isn’t “programmed” to look for certain characteristics. Rather, it learns on its own which characteristics to notice.

After we build all the convolutional layers and connect them together, we can connect the entire structure with a few fully connected layers and we will be ready to train our classifier(shown in the graph below).

Having figured out the dimensions of the output layer. Let’s see how many parameters we save by sharing parameters.

Assume the size of input image is 32 * 32 * 3, 20 filters of size 8 * 8 * 3. Output layer size 14 * 14 * 20.

Without parameter sharing, we would have (8 * 8 * 3 + 1) * (14 * 14 * 20) = 756560 parameters. + 1 is for the bias. We bascially have to connect every pixel from the filter to the output layer.

With parameter sharing, we would only have (8 * 8 * 3 + 1) * 20 = 3860 parameters. Beacause we share parameters among all the filters and output. Note that we should not share features across feature maps because that is for learning different features. That why we get rid of the 14 * 14, but still keep the depth 20.

TensorFlow Convolution Layer

TensorFlow provides the tf.nn.conv2d() and tf.nn.bias_add() functions to create our own convolutional layers. An example is given below:

# output depth
k_output = 64

# input image dimensions
image_width = 10
image_height = 10
color_channels = 3

# convolutional filter
filter_width = 5
filter_height = 5

# input in TF
input = tf.placeholder(tf.float32, 
                      shape=[None, image_height, 
                      image_width, color_channels])

# weight and bias
weight = tf.Variable(tf.truncated_normal([filter_height,
                                          filter_height,
                                          color_channels,
                                          k_output]))
bias = tf.Variable(tf.zeros(k_output))

# apply convolution
conv_layer = tf.nn.conv2d(input, weight, 
                          strides=[1,2,2,1], 
                          padding='SAME')
# add bias
conv_layer = tf.nn.bias_add(conv_layer, bias)
# apply activation
conv_layer = tf.nn.relu(conv_layer)

TensorFlow uses a stride for each input dimension, [batch, input_height, input_width, input_channels]. We are generally always going to set the stride for batch and input_channels (i.e. the first and fourth element in the strides array) to be 1.

Pooling

The idea of max pooling is that we take the max pixel value in filter area we are look at (shown in the graph below).

Conceptually, the benefit of the max pooling operation is to reduce the size of the input, and allow the neural network to focus on only the most important elements. Max pooling does this by only retaining the maximum value for each filtered area, and removing the remaining values. Pooling can also help prevent overfitting.

TensorFlow provides the tf.nn.max_pool() function to apply max pooling to the convolution layers. An example is given:

conv_layer = tf.nn.conv2d(input, weights, strides=[1,2,2,1], padding='SAME')
conv_layer = tf.nn.bias_add(conv_layer, bias)
conv_layer = tf.nn.relu(conv_layer)

# apply max pooling
conv_layer = tf.nn.max_pool(conv_layer, ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME')

Similar to the strides in convolution layers, the ksize and strides parameters are also structured as 4-element lists, with each element corresponding to a dimension of the input tensor ([batch, height, width, channels]). For both ksize and strides, the batch and channel dimensions are typically set to 1.

There is also another pooling technique – average pooling, which conceptually is the same as max pooling, except the fact that instead of taking the max in the filter, it takes the average value. Average pooling can be thought as getting a lower resolution and blur image of the feature map.

1x1 Convolutions

Recall that running a convolutional layer on a small patch is like having a linear classifier for that patch. However, if we add a 1x1 convolution in the middle, then we have a mini neural network running over the patch instead of a linear classifier. Interspersing the convolutions with 1x1 convolutions is a very inexpensive way to make the models deeper and have more parameters, without completely changing their structure. They are very cheap to add on because in fact they are just matrix multipliers, and they have relatively few parameters.

Put it together

Having addressed all the concepts and small details about convolutional neural networks, it is time to see how to implement it in TF. We will experiment it with the mnist digits data.

from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf

mnist = input.data.read_data_sets(".", one_hot=True, reshap=False)
# define parameters
learning_rate = 0.00001
epochs = 10
batch_size = 128

# number of samples to calculate validation and accuracy
test_valid_size = 256

# network parameter
n_classes = 10 # 0-9 digits
dropout = 0.75

# weights and bias
weights = {
'wc1': tf.Variable(tf.random_normal([5,5,1,32])), # [filter_height, filter_height, color_channels,k_output]
'wc2': tf.Variable(tf.random_normal([5,5,32,64])),
'wd1': tf.Variable(tf.random_normal([7*7*64,1024])),
'out': tf.Variable(tf.random_normal([1024, n_classes]))}
biases = {
'bc1': tf.Variable(tf.random_normal([32])),
'bc2': tf.Variable(tf.random_normal([64])),
'bd1': tf.Variable(tf.random_normal([1024])),
'out': tf.Variable(tf.random_normal[n_classes])}

# calculate convolution
def conv2d(x, w, b, strides=1):
    conv_layer = tf.nn.conv2d(x, w, strides=[1, strides, strides, 1], padding='SAME')
    conv_layer = tf.nn.bias_add(conv_layer, b)
    return tf.nn.relu(conv_layer)

# calculate max pooling
def maxpool2d(x, k=2):
    return tf.nn.max_pool(x, ksize=[1,k,k,1],strides=[1,k,k,1], padding='SAME')

# put it together
def conv_net(x, weights, biases, dropout):
    # from 28*28*1 to 14*14*32
    conv1 = conv2d(x, weights['wc1'], biases['bc1'])
    conv1 = maxpool2d(conv1, k=2)
    
    # from 14*14*32 to 7*7*64
    conv2 = conv2d(conv1, weights['wc2'], biases['bc2'])
    conv2 = maxpool2d(conv2, k=2)
    
    # fully connected layer from 7*7*64 to 1024
    fc1 = tf.reshap(conv2, [-1, weights['wd1'].get_shape().as_list()[0]])
    fc1 = tf.add(tf.matmul(fc1, weights['wd1']), biases['bd1'])
    fc1 = tf.nn.relu(fc1)
    fc1 = tf.nn.dropout(fc1, dropout)
    
    # output layer from 1024 to 10
    out = tf.add(tf.matmul(fc1, weights['out']), biases['out'])
    
# put in session
# tf graph input
x = tf.placeholder(tf.float32, [None, 28, 28, 1])
y = tf.placeholder(tf.float32, [None, n_classes])

# conv model
logits = conv_net(x, weights, biases, keep_prob)

# loss and optimizer
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits), labels=y)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

# accuracy
correct_pred = tf.equal(tf.argmax(logits,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

# initialize all tf variables
init = tf.global_variables_initializer()

# lunch training
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(epochs):
        for batch in range(mnist.train.num_examples/batch_size):
            batch_x, batch_y = mnist.train.next_batch(batch_size)
            sess.run(optimizer, feed_dict={x: batch_x, y:batch_y, keep_prob:dropout})
            
            loss = sess.run(loss, feed_dict={x:batch_x, y:batch_y, keep_prob:1.}) # keep_prob = 1.
            valid_acc = sess.run(accuracy, feed_dict={x:batch_x, y:batch_y, keep_prob:1.})
            
            print('Epoch {:>2}, Batch {:>3}, Loss: {:>10.4f}, Validation Accuracy: {:6.f}'.format(epoch+1, batch+1, loss, valid_acc))
        test_acc = sess.run(accuracy, feed_dict={x:mnist.test.images[:test_valid_size],
        y:mnist.test.labels[:test_valid_size], keep_prob:1.})
        print('Testing Accuracy: {}'.format(test_acc))

Disclaimer: This post includes my personal reflections and notes on learning Deep Learning Nanodegree from Udacity. Some texts and images are from the learning materials for better educational purposes.