KMeans
A simple analogy to KMeans can be described as: At a beautiful tour site, say the Great Wall, there are three different tourist groups, each of which is led by a tour guide. The tourists are following and moving around their corresponding tour guide. This is a dynamic version of KMeans. In reality, the KMeans is simpler in that the data points (tourists) don’t move.
Suppose that we want to divide our input data into k categories, and we allocate k cluster centers to our input space, and we would like to position these centers so that there is one cluster center in the middle of each cluster. However, we don’t know where the clusters are, let alone where their middle is, so we need an algorithm that will find them. Learning algorithms generally try to minimize some sort of error, so we need to think of an error criterion that describes this aim. Here, we will measure the distances and try to minimize the average distances. A natural way to think about this is that: we compute the mean point of each cluster, and put the cluster center there. This is equivalent to minimising the Euclidean distance (which is the sum-of-squares error again) from each datapoint to its cluster center.
We can start by positioning the cluster centers randomly through the input space, since we don’t know where to put them, and then we update their positions according to the data. We decide which cluster each datapoint belongs to by computing the distance between each datapoint and all of the cluster centers, and assigning it to the cluster that is the closest. For all of the points that are assigned to a cluster, we then compute the mean of them, and move the cluster center to that place. We iterate the algorithm until the cluster centers stop moving. Here is the pseudo code:
The Python implementation is provided on my github repo, and was tested on the iris dataset:
X, y = load_iris(return_X_y = True)
np.random.shuffle(X)
train_data = X[:int(0.8*X.shape[0]),]
test_data = X[int(0.8*X.shape[0])+1:,]
kmeans = KMeans(k=3)
kmeans.train(train_data,100)
kmeans.predict(test_data)
K-Means Neural Network
It seems a long way from KMeans to Neural Networks, it isn’t. If we think about the cluster centers that we optimize the positions of as locations in weight space, then we could position neurons in those places and use neural network training. The computation that happened in the k-means algorithm was that each input decided which cluster center it was closest to by calculating the distance to all of the centers. We could do this inside a neural network, too: the location of each neuron is its position in weight space, which matches the values of its weights. So for each input, we just make the activation of a node be the distance between that node in weight space and the current input. Then training is just moving the position of the node, which means adjusting the weights.
We will use just one layer of neurons, together with some input nodes, and no bias node. The first layer (the graph below) will be the inputs, which don’t do any computation, as usual, and the second layer will be a layer of competitive neurons, that is, neurons that compete to fire, with only one of them actually succeeding. Only one cluster center can represent a particular input vector, and so we will choose the neuron with the highest activation h to be the one that fires. This is known as winner-takes-all activation, and it is an example of competitive learning.
We will choose $k$ neurons and fully connect the inputs to the neurons, as usual. We will use neurons with a linear transfer function, computing the activation of the neurons as simply the product of the weights and inputs:
Providing that the inputs are normalized so that their absolute size is the same. This effectively measures the distance between the input vector and the cluster center represented by that neuron, with larger numbers (higher activations) meaning that the two points are closer together.
The question is how can we then change the position of that neuron in weight space, that is, how do we update its weights? With the inputs being normalized, we can use the following weight update rule:
which has the effect of moving the weight $w_{ij}$ directly towards the current input. Remember that the only weights that we are updating are those of the winning unit.
For many of our supervised learning algorithms we minimized the sum-of-squares difference between the output and the target. This was a global error criterion that affected all of the weights together. Here, we are feeding data point one by one to the network, so it affects each weight independently. This makes it very difficult to analyse the behavior of the algorithm, which is a general problem for competitive learning algorithms. However, they do tend to work well.
Now that we have a weight update rule that works, we can consider the entire algorithm for the on-line k-means network:
The Python implementation is provided on my github repo, and was tested on the iris dataset:
X, y = load_iris(return_X_y = True)
np.random.shuffle(X)
train_data = X[:int(0.8*X.shape[0]),]
test_data = X[int(0.8*X.shape[0])+1:,]
kmeans = KMeansNetwork(k=3)
kmeans.train(train_data)
kmeans.predict(test_data)
Disclaimer: This post includes my personal reflections and notes on reading Machine Learning: An Algorithmic Perspective., 2nd Edition by Stephen Marsland. Some texts and images are from the book for better educational purposes.