Lei Luo Machine Learning Engineer

Introduction to TensorFlow

Linear Functions

Hello World!
import tensorflow as tf
#define a tensorflow object
hello = tf.constant('Hello World')

with tf.Session() as sess:
    output = sess.run(hello_constant)

In TensorFlow, data isn’t stored as integers, floats, or strings. These values are encapsulated in an object called a tensor. Some more examples:

# A is a 0-dimensional int32 tensor
A = tf.constant(1234) 
# B is a 1-dimensional int32 tensor
B = tf.constant([123,456,789]) 
# C is a 2-dimensional int32 tensor
C = tf.constant([ [123,456,789], [222,333,444] ])

TensorFlow’s api is built around the idea of a computational graph, which is a way of visualizing a mathematical process. So it is an environment for running a graph.

The most common operation in neural networks is calculating the linear combination of inputs, weights, and biases. Linear functions can be written as:

where $W$ is a is a matrix of the weights connecting two layers. The output $y$, the input $x$, and the biases $b$ are all vectors.

Weights and Bias in TensorFlow The goal of training a neural network is to modify weights and biases to best predict the labels. In order to use weights and bias, you’ll need a Tensor that can be modified. This leaves out tf.placeholder() and tf.constant(), since those Tensors can’t be modified. This is where tf.Variable class comes in. Tensor variable can be defined as: x = tf.Variable(5).

Tensor stores its state in the session, so you must initialize the state of the tensor manually. You’ll use the tf.global_variables_initializer() function to initialize the state of all the Variable tensors, which can be done like this:

init = tf.global_variables_initializer()
with tf.Session() as sess:

The tf.global_variables_initializer() call returns an operation that will initialize all TensorFlow variables from the graph. You call the operation using a session to initialize all the variables as shown above. Using the tf.Variable class allows us to change the weights and bias, but an initial value needs to be chosen.

Initializing the weights with random numbers from a normal distribution is good practice. Randomizing the weights helps the model from becoming stuck in the same place every time you train it.

Similarly, choosing weights from a normal distribution prevents any one weight from overwhelming other weights. We can use tf.truncated_normal() like this:

n_features = 120
n_labels = 5
weights = tf.Variable(tf.truncated_normal((n_features, n_labels)))

The tf.truncated_normal() function returns a tensor with random values from a normal distribution whose magnitude is no more than 2 standard deviations from the mean.

Since the weights are already helping prevent the model from getting stuck, you don’t need to randomize the bias. Let’s use the simplest solution, setting the bias to 0. It can be defined as this:

n_labels = 5
bias = tf.Variable(tf.zeros(n_labels))

The tf.zeros() function returns a tensor with all zeros.


Linear functions in TensorFlow can be implemented as:

def linear(input, n_features, n_labels):
    # define weights
    weights = tf.Variable(tf.truncated_normal((n_features, n_labels)))
    # define bias
    bias = tf.Variable(tf.zeros(n_labels))
    # put it together
    output = tf.add(tf.matmul(input, w), b)
    return output

Activation functions

Besides one the commonly used activation functions – sigmoid, we have other activation functions. Before we look at them, let’s look at the drawbacks of sigmoid.

Frome the graph above we can see that the derivative of the sigmoid maxes out at 0.25 . This means when you’re performing backpropagation with sigmoid units, the errors going back into the network will be shrunk by at least 75% at every layer. For layers close to the input layer, the weight updates will be tiny if you have a lot of layers and those weights will take a really long time to train. Due to this, sigmoids have fallen out of favor as activations on hidden units.

Rectified Linear Units

Instead of sigmoids, most recent deep learning networks use rectified linear units (ReLUs) for the hidden layers. A rectified linear unit has output 0 if the input is less than 0, and raw output otherwise. That is, if the input is greater than 0, the output is equal to the input. Mathematically, that looks like:

Graphically, it looks like:

ReLU activations are the simplest non-linear activation function you can use. When the input is positive, the derivative is 1, so there isn’t the vanishing effect you see on backpropagated errors from sigmoids.

Research has shown that ReLUs result in much faster training for large networks. Most frameworks like TensorFlow and TFLearn make it simple to use ReLUs on the the hidden layers, so you won’t need to implement them yourself. The drawbacks are also obvious: It’s possible that a large gradient can set the weights such that a ReLU unit will always be 0. These “dead” units will always be 0 and a lot of computation will be wasted in training.


In many classification problems, we can use softmax as an alternative to sigmoid. The softmax function squashes the outputs of each unit to be between 0 and 1, just like a sigmoid. It also divides each output such that the total sum of the outputs is equal to 1. The output of the softmax function is equivalent to a categorical probability distribution, it tells you the probability that any of the classes are true.

The only real difference between this and a normal sigmoid is that the softmax normalizes the outputs so that they sum to one. In both cases you can put in a vector and get out a vector where the outputs are a vector of the same size, but all the values are squashed between 0 and 1. You would use a sigmoid with one output unit for binary classification. But if you’re doing multinomial classification, you’d want to use multiple output units (one for each class) and the softmax activation on the output.

Mathematically the softmax function is shown below, where $z$ is a vector of the inputs to the output layer, $j$ is the index for the output units.

In TensorFlow, softmax can be used as:

import tensorflow as tf

def run():
    data = [1.0, 2.0, 3.0]
    logits = tf.placeholder(tf.float32)
	# calculate softmax
    softmax = tf.nn.softmax(logits)
	# ensemble into tf.session
    with tf.Session() as sess:
        output = sess.run(softmax, feed_dict = {logits: data})
    return output


From my previous post on Decision Tree that entropy is defined as:

Cross Entropy in TF

In TensorFlow, we can use tf.reduce_sum() to calculate sumation and use tf.log() to calculate the natural log. Put it all together it can be presented like this:

import tensorflow as tf
input = [0.1, 0.2, 0.7]

tf_input = tf.placeholder(tf.float32)
tf_sum = tf.reduce_sum(tf.multiply(tf.input, tf.log(tf.input)))

with tf.Session() as sess:
    output = sess.run(tf_sum, feed_dict = {tf_input:input})
    return output
Logistic Regression

Logistic Regression introduces an extra non-linearity over a linear classifier, $f(x)=wx+b$, by using a logistic (or sigmoid) function $\sigma()$. It can be defined as the graph below and the loss function is also derived:

Minimize the Loss

In order to minimize the loss function, we usually calculate the derivatives. In order to do so, we need to compute loss of all points, which we assume take $x$ times of float computation, computing the derivatives would take about $3x$ times of computation. As a result, this approach does not scale well on big datasets. Instead, we can get around this and cheat by using Stochastic Gradient Descent(SGD), which scales well with data and model size.

The basic idea of Stochastic Gradient Descent is that we randomly pick a small number of data points and we calculate the loss and derivatives as described above, and we step towards the opposite direction of the derivative. Then we repeat the process. Note that the direction we take might not be the best direction, and sometimes is the wrong direction, so we incease the loss not decrease it. But we are going to compensate it by doing this process many times. Each step is a lot cheaper to compute, but we have to do it many times. Overall it effectively decrease the loss in a relatively small amount of time.

Momentum and Learning Rate Decay

In SGD, each time we are taking a small step towords the opposite direction of the derivatives, instead directly taking that direction, we can take advantage of the cumulative steps we have taken before by using the running average of all the gradients. This is called Momentum. This technique works well and often leads to better convergence. In later iterations of SGD also benefits from taking smaller and smaller steps each time. We can do this by using Learning Rate Decay. We can use exponential decay to the learning rate and there are many others ways to go about it.

In SGD there are several hyper-parameters we need to tune, including initial learning rate, learning rate decay, momentum, batch size, and weight initialization. With so many hyper-parameters, SGD can be hard to tune. We can use a modified version of SGD, called AdaGrad that emplicitly does momentum, initial learning rate, and learning rate decay, so it is less sensitive to hyper-parameters.

Putting it together

Here, we are going to see an example of using logistic regression with gradient descent to train and identify digits from the mnist datasets.

from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np
from helper import batches

learning_rate = 0.001
n_input = 784 # mnist data shape 28*28
n_classes = 10 # 0-9 digits

# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
test_features = mnist.test.images
train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

# Weights and bias
weights = tf.Variable(tf.random_normal([n_features, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

# logits = wx + b
logits = tf.add(tf.matmul(weights, features), bias)

# define loss and optimizer
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.GradientDescentOptimize(learning_rate=learning_rate).minimize(loss)

# calculate accuracy
correct_prediction = tf.equal(tf.argmx(logits, 1), tf.argmx(labels,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction,tf.float32))

batch_size = 128
init = tf.global_variables_initializer()

with tf.Session() as sess:
    for batch_features, batch_labels in batches(batch_size, train_features, train_labels):
        sess.run(optimizer, feed_dict={features:batch_features, labels:labels})
    # calculate accuracy for test dataset
    test_accuracy = sess.run(accuracy, feed_dict:{features:test_features, labels:test_labels})
print('Test Accuracy: {}'.format(test_accuracy))

Disclaimer: This post includes my personal reflections and notes on learning Deep Learning Nanodegree from Udacity. Some texts and images are from the learning materials for better educational purposes.

Previous post KMeans