Lei Luo Machine Learning Engineer

# Support Vector Machines

2018-07-09

## Introdction

In SVMs, we are trying to find the maximum-margin classifier. The math to find the maximum margin is presented below: Assuming the two target class are labelled as -1, +1, then we can have $t_{i} \times y_{i}$. Since we want to classify correctly we would like the following: ## Kernels

For some non-linearly separable dataset, we might be able to linearly separate the data if we can use more dimensions, then we might be able to find a linear decision boundary that separates the classes. We can’t invent new data, and we need to transform them. Assume we transform using ∅(x) on the input features. Now our prediction becomes:

Think about a basis that consists of the polynomials of everything up to degree 2. It contains the constant value 1, each of the individual (scalar) input elements x1, x2, . . . , xd, has been transformed as:

The above equation as be written as $(1+x^Ty)^2$. The dot product here is in the original space and is obviously much better. The important thing is that we remove the problem of computing the dot products of all the extended basis vectors, which is expensive, with the computation of a kernel matrix (also known as the Gram matrix) K that is made from the dot product of the original vectors, which is only linear in cost. This is sometimes known as the kernel trick.

## Choosing Kernels

So how do we go about finding a suitable kernel? Any symmetric function that is positive definite (meaning that it enforces positivity on the integral of arbitrary functions) can be used as a kernel. This is a result of Mercer’s theorem, which also says that it is possible to convolve kernels together and the result will be another kernel. There are three different types of basis functions that are commonly used, and they have nice kernels that correspond to them:

• polynomials up to some degree s in the elements $x_{k}$ of the input vector (e.g., $x_{3}^3 \;\; or \;\; x_{1}\times x_{4}$) with kernels (s = 1 gives a linear kernel):
• sigmoid functions of the $x_{k}s$ with parameters $\mathit{k}$ and $\delta$, and the kernel:
• radial basis function expansions of the $x_{k}s$ with parameters $\delta$ and the kernel:

Choosing which kernel to use and the parameters in these kernels is a tricky problem. Most people just experiment with different values and find one that works, using a validation set.

## The SVM Algorithm

Quadratic programming solvers tend to be very complex (lots of the work is in identifying the active set), and we would be a long way off topic if we tried to write one. We will use cvxopt, which is a convex optimisation package that includes a wrapper for Python. There is a link to the relevant website on the book webpage. Cvxopt has a nice and clean interface so we can use this to do the computational heavy lifting for an implementation of the SVM. In essence, the approach is fairly simple: we choose a kernel and then for given data, assemble the relevant quadratic problem and its constraints as matrices, and then pass them to the solver, which finds the decision boundary and necessary support vectors for us. These are then used to build a classifier for that training data. The pseudo code is provided as: ## Implementation in Python

One of core problems SVMs need to solve is the quadratic problem:

Covert this to the dual problem, we need to solve(for the full derivation please refer to my origin post):

We will use cvxopt, which is a convex optimisation package that includes a wrapper for Python. the approach is fairly simple: we choose a kernel and then for given data, assemble the relevant quadratic problem and its constraints as matrices, and then pass them to the solver, which finds the decision boundary and necessary support vectors for us. These are then used to build a classifier for that training data.

The cvxopt quadratic program solver is cvxopt.solvers.qp(). This method takes the following inputs cvxopt.solvers.qp(P, q, G, h, A, b) and then solves:

where $x$ is the variable we are solving for, which is $\lambda$ for us. We need to multiply the objective function by -1, and set $P=t_{i}t_{j}K$ and $q$ is just a column vector containing −1s. $A=t$.
We also want to include two constraints $(0\leq\lambda_{i} \; \textrm{and} \; \lambda_{i} \leq C)$. To do this, we double up on the number of constraints, multiplying the ones where we want $\geq$ instead of $\leq$ by -1.

The full code was post on my github repo.

Here is a snippet of how to use it. I tested on the iris data. I only selected two features in order to show results on a 2-d image. The results are as follows:

# load data
# select 2 feature for show results on 2-d map
features = iris[:,:2]
target = iris[:,-1]
features.shape = (features.shape,features.shape)
target.shape = (target.shape,1)

# training and testing split
from sklearn.cross_validation import train_test_split
from sklearn import svm
X_train, X_test, y_train, y_test = train_test_split(features,
target,
test_size = 0.2,
random_state = 123)

# plot contours
def plot_contours(ax, clf, xx, yy, **params):
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
out = ax.contourf(xx, yy, Z, **params)
return out

# make meshgrid
def make_meshgrid(x, y, h=.05):
x_min, x_max = x.min() - 1, x.max() + 1
y_min, y_max = y.min() - 1, y.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
return xx, yy

import matplotlib.pyplot as plt
# define three SVMs with different kernel and parameters
models = [SVM(kernel = 'rbf', C = 1),
SVM(kernel = 'polynomial', degree = 3, C = 1),
SVM(kernel = 'sigmoid', gamma = 0.7, C = 0)]

# training models
models = [clf.train(X_train,y_train) for clf in models]
titles = ('SVC with RBF kernel',
'SVC with polynomial kernel',
'SVC with sigmoid kernel')
fig, sub = plt.subplots(1, 3,figsize=(15,5))
X0, X1 = X_train[:, 0], X_train[:, 1]
xx, yy = make_meshgrid(X0, X1)

for clf, title in zip(models,titles):
output = clf.predict(X_test,soft=False)
err1 = np.where((output==1.) & (y_test==-1.))
err2 = np.where((output==-1.) & (y_test==1.))
print('*******************************')
print(title)
print("Class 1 errors ",len(err1)," from ",len(y_test[y_test==-1]))
print("Class 2 errors ",len(err2)," from ",len(y_test[y_test==1]))
print("Test accuracy ",1. -(float(len(err1)+len(err2)))/ (len(y_test[y_test==1]) + len(y_test[y_test==-1])))
print()

for clf, title, ax in zip(models, titles, sub.flatten()):
plot_contours(ax, clf, xx, yy,
cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y_train.ravel(), cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xlabel('Sepal length')
ax.set_ylabel('Sepal width')
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(title)
plt.show() Disclaimer: This post includes my personal reflections and notes on reading Machine Learning: An Algorithmic Perspective., 2nd Edition by Stephen Marsland. Some texts and images are from the book for better educational purposes.

Previous post Multi-Layer Perceptron

Next post Decision Tree