## Entropy

In 1948 Claude Shannon proposed the measure of information entropy in the paper “A Mathematical Theory of Communication.” Information entropy describes the amount of impurity in a set of features. The entropy $H$ of a set of probabilities $p_{i}$ is:

where the logarithm is base 2 because we are imagining that we encode everything using binary digits (bits), and we define $0log0=0$. A graph of the entropy is given in the graph below:

If all of the examples are positive, then we don’t get any extra information from knowing the value of the feature for any particular example, since whatever the value of the feature, the example will be positive. Thus, the entropy of that feature is 0. However, if the feature separates the examples into 50% positive and 50% negative, then the amount of entropy is at a maximum, and knowing about that feature is very useful to us. The basic concept is that it tells us how much extra information we would get from knowing the value of that feature.

For our decision tree, the best feature to pick as the one to classify on now is the one that gives you the most information, i.e., the one with the highest entropy. After using that feature, we re-evaluate the entropy of each feature and again pick the one with the highest entropy.

## ID3

The important idea is to work out how much the entropy of the whole training set would decrease if we choose each particular feature for the next classification step. This is known as the information gain, and it is defined as:

where $S$ is the set of examples, $F$ is a possible feature out of the set of all possible ones, and $S_{f}$ is a count of the number of members of S that have value $f$ for feature $F$.

The ID3 algorithm computes this information gain for each feature and chooses the one that produces the highest value. Then the best feature is selected and then removed from the dataset. It searches the space of possible trees in a greedy way by choosing the feature with the highest information gain at each stage. The pseudo code for ID3 is presented:

It uses a method known as the inductive bias. The choice of the next feature to add into the tree is the one with the highest information gain, which biases the algorithm towards smaller trees, since it tries to minimise the amount of information that is left. This is consistent with a well-known principle that short solutions are usually better than longer ones (not necessarily true, but simpler explanations are usually easier to remember and understand). The principle of ‘Occam’s Razor’ can also be thought as KISS (Keep It Simple, Stupid). In fact, there is a sound information theoretic way to write down this principle. It is known as the Minimum Description Length (MDL) and was proposed by Rissanen in 1989. In essence it says that the shortest description of something, i.e., the most compressed one, is the best description.

Another benefit of decision trees is that they can deal with missing data. we can skip that node of the tree and carry on without it, summing over all the possible values that that feature could have taken. When it comes to dealing with missing data, we can impute any missing values, either by identifying similar data points and using their value or by using the mean or median of the data values for that feature. This assumes that the data that is missing is randomly distributed within the dataset, not missing because of some unknown process.

Saying that ID3 is biased towards short trees is only partly true. The algorithm uses all of the features that are given to it, even if some of them are not necessary. This obviously runs the risk of overfitting, indeed it makes it very likely. There are a few things that you can do to avoid overfitting, the simplest one being to limit the size of the tree. You can also use a variant of early stopping by using a validation set and measuring the performance of the tree so far against it. However, the approach that is used in more advanced algorithms (most notably C4.5, which Quinlan invented to improve on ID3) is pruning.

C4.5 uses a different method called rule post-pruning. This consists of taking the tree generated by ID3, converting it to a set of if-then rules, and then pruning each rule by removing preconditions if the accuracy of the rule increases without it. The rules are then sorted according to their accuracy on the training set and applied in order. The advantages of dealing with rules are that they are easier to read and their order in the tree does not matter, just their accuracy in the classification.

## Classification and Regression Trees (CART)

The only real difference with classification in CART is that a different information measure is commonly used. It uses gini impurity instead of information gain. The “impurity” in the name suggests that the aim of the decision tree is to have each leaf node represent a set of data points that are in the same class, so that there are no mismatches. For feature $k$, it is computed as:

where $c$ is the number of classes, $N(i)$ is the the fraction of the number of data points that belong to a class $i$. If the node is pure, it should be 0 for all except one value of i. The above equation is equivalent to:

The information measure can be changed in another way, which is to add a weight to the misclassifications. The idea is to consider the cost of misclassifying an instance of class $i$ as class $j$ and add a weight that says how important each datapoint is. It is typically labelled as $ij$ and is presented as a matrix, with element $ij$ representing the cost of misclassifying $i$ as $j$. Using it is simple, modifying the Gini impurity equation to be:

## Regression in Trees

Suppose that the outputs are continuous, so that a regression model is appropriate. None of the node impurity measures that we have considered so far will work. Instead, we’ll go back to our old favourite the sum-of-squares error. To evaluate the choice of which feature to use next, we also need to find the value at which to split the dataset according to that feature.

The output is a value at each leaf, which is just a constant value computed as the mean average of all the data points that are situated in that leaf. This is the optimal choice in order to minimise the sum-of-squares error, but it also means that we can choose the split point quickly for a given feature, by choosing it to minimise the sum-of-squares error. We can then pick the feature that has the split point that provides the best sum-of-squares error, and continue to use the algorithm as for classification.

The full implementation of ID3 algorithm using Python is provided on my github repo. The example dataset came from Ilya Grigorik in his github repo (https://github.com/igrigorik/decisiontree).

```
import json
import pandas as pd
# bring in the data
data = pd.read_csv('discrete-training.csv',encoding='utf-8-sig')
train_X = data.drop(labels=['will_buy?'], axis=1).values.tolist()
feature_names = ['Age','Education','Income','Marital Status']
train_y = data['will_buy?'].values.tolist()
# define a tree and start training
dtree = DTree()
tree = dtree.train(X, y,feature_names=feature_names)
# print the trained tree
print()
print('The tree is: ',json.dumps(tree, indent=1))
print()
# define testing data
test_data = pd.read_csv('discrete-test.csv',encoding='utf-8-sig')
test_X = test_data.drop(labels=['will_buy?'], axis=1).values.tolist()
test_y = test_data['will_buy?'].values.tolist()
# compute accuracy
total_correct = 0
for i in range(len(test_X)):
prediction = dtree.predict(tree,test_X[i])
if test_y[i] == prediction:
total_correct += 1
print('test accuracy is:',total_correct/len(test_X))
```

**Disclaimer:** This post includes my personal reflections and notes on reading Machine Learning: An Algorithmic Perspective., 2nd Edition by Stephen Marsland. Some texts and images are from the book for better educational purposes.