Lei Luo Machine Learning Engineer

Ensemble Learning


The basic idea of ensemble learning is that by having lots of learners that each get slightly different results on a dataset – some learning certain things well and some learning others – and putting them together, the results that are generated will be significantly better than any one of them on its own.

Naturally we need to ask three questions: which learners should we use, how should we ensure that they learn different things, and how should we combine their results? In fact, we can use any classifier at all. Although in general they only use one type of classifier at a time, they do not have to. A common choice of classifier is the decision tree. Ensuring that the learners see different things can be performed in different ways, and it is the actually primary difference between the algorithms. Some use different subsets of the data, and some use different features from the data. When it comes to combining results one very simple way is to use majority voting.

I find the quote from Stephen Marsland amusing and educational at the same time, which goes “if majority voting is good enough for electing governments in elections, it’s good enough for machine learning.” Majority voting has the interesting property that for binary classification, the combined classifier will only get the answer wrong if more than half of the classifiers were wrong. Hopefully, this isn’t going to happen too often.


The principal algorithm of boosting is named AdaBoost, which stands for adaptive boosting and was first described in the mid-1990s by Freund and Shapiro. The innovation that AdaBoost uses is to give weights to each datapoint according to how difficult previous classifiers have found to get it correct. These weights are given to the classifier as part of the input when it is trained.

The AdaBoost algorithm is conceptually very simple. At each iteration a new classifier is trained on the training set, with the weights that are applied to the training set for each datapoint being modified at each iteration according to how successfully that datapoint has been classified in the past. The weights are initially all set to the same value, $1/N$, where $N$ is the number of datapoints in the training set. Then, at each iteration, the error $\epsilon $ is computed as the sum of the weights of the misclassified points, and the weights for incorrect examples are updated by being multiplied by $\alpha =\epsilon(1-\epsilon)$. Weights for correct examples are left alone, and then the whole set is normalised so that it sums to 1 (which is effectively a reduction in the importance of the correctly classified datapoints). Training terminates after a set number of iterations, or when either all of the datapoints are classified correctly, or one point contains more than half of the available weight. Before the pseudo code is given, we need to spend some time looking at the math behind it in order to better understand why the code is written like that. Assume we have chosen some weak learners $h_{t}$ with different weights $\alpha_{t}$ at some iteration $t$, so the training error is illustrate in the graph below:

where $m$ is the number of data points, $y_{i}$ is the true value for a data point, $z_i$ is the normalization factor to make sure the sum of weights of all points add up to 1. $\epsilon_{t}$ is the training error at iteration $t$.

If we follow the math above, the pseudo code is not difficult to formulate:

The implementation is hosted on my github repo.

The results of AdaBoost running two class classification on Hastie dataset is shown below:

X, y = make_hastie_10_2()
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

for i in range(10, 600, 10):    
    er_i = AdaBoostClassifier(n_estimators=i).train(X_train,Y_train, X_test, Y_test)
# Compare error rate vs number of iterations
plot_error_rate(er_train, er_test)


The simplest method of combining classifiers is known as bagging, which stands for bootstrap aggregating. A bootstrap sample is a sample taken from the original dataset with replacement, so that we may get some data several times and others not at all. The bootstrap sample is the same size as the original, and lots and lots of these samples are taken: B of them, where B is at least 50, and could even be in the thousands. One benefit of bagging is that we will get lots of learners that perform slightly differently, which is exactly what we want for an ensemble method.

The Python implementation of Bagging is hosted on my github repo.

The testing example was on Hastie dataset and the results are shown below:

X, y = make_hastie_10_2()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)

bagging = Bagging()

time_spend = []
accuracies = []
for i in np.arange(0.1, 1.0, 0.1):
    cur_time = bagging.train(X_train, y_train, percentage = i)
    accuracy = bagging.predict(X_test, y_test)[1]


Unlike bagging that uses the same size of sample data, it is the fairly obvious idea in subagging that we don’t need to produce samples that are the same size as the original data. If we make smaller datasets, then it makes sense to sample without replacement, but otherwise the implementation is only very slightly different from the bagging one.

Random Forest

The basic idea of random forests is that if one tree is good, then many trees (a forest) should be better, provided that there is enough variety between them. The most interesting thing about a random forest is the ways that it creates randomness from a standard dataset.

One method is to use bagging. If we wish to create a forest then we can make the trees different by training them on slightly different data, so we take bootstrap samples from the dataset for each tree. However, this isn’t enough randomness yet. The other obvious place where it is possible to add randomness is to limit the choices that the decision tree can make. At each node, a random subset of the features is given to the tree, and it can only pick from that subset rather than from the whole set.

Since we use fewer features to search at each stage, not only we create randomness, but also decrease computation. Surely, it does introduce a new parameter (how many features to consider), but the random forest does not seem to be very sensitive to this parameter; in practice, a subset size that is the square root of the number of features seems to be common. The effect of these two forms of randomness is to reduce the variance without affecting the bias.

Another benefit of this is that there is no need to prune the trees. There is another parameter that we don’t know how to choose yet, which is the number of trees to put into the forest. However, this is fairly easy to pick if we want optimal results: we can keep on building trees until the error stops decreasing.

Once the set of trees are trained, the output of the forest is the majority vote for classification, as with the other committee methods that we have seen, or the mean response for regression. And those are pretty much the main features needed for creating a random forest. The pseudo code is provided below:

The Python implementation of Random Forest is hosted on my github repo.

Sample code of using random forest to perform classification on the Car dataset is illustrated below:

data = pd.read_csv('car.csv')
feature_names = data.columns.values[0:-1].tolist()
X = data.iloc[:,0:-1].values
y = data.iloc[:,-1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 123)

forest = RandomForest()
forest.train(X_train, y_train, feature_names)
prediction = forest.predict(X_test, y_test)[1]
print('Random Forest Accuracy: ', prediction)

The classification result is shown:

Comparison with Boosting

The most general thing is that boosting is exhaustive, in that it searches over the whole set of features at each stage, and each stage depends on the previous one. This means that boosting has to run sequentially, and the individual steps can be expensive to run.

Since the algorithm only searches a small subset of the data at each stage, it cannot be expected to be as good as boosting for the same number of trees. However, since the trees are cheaper to train, we can make more of them in the same computational time, and often the results are amazingly good even on very large and complicated datasets.

In fact, the most amazing thing about random forests is that they seem to deal very well with really big datasets and they seem to also produce good outputs based on surprisingly small parts of the problem space seen by each tree.

Ways to Combine Classifiers

Both boosting and bagging take a vote from amongst the classifiers, although they do it in different ways: boosting takes a weighted vote, while bagging simple takes the majority vote. For regression problems, rather than taking the majority vote, it is common to take the mean of the outputs. However, the mean is heavily affected by outliers, with the result that the median is a more common average to use.

Disclaimer: This post includes my personal reflections and notes on reading Machine Learning: An Algorithmic Perspective., 2nd Edition by Stephen Marsland. Some texts and images are from the book for better educational purposes.

Previous post Decision Tree

Next post KMeans