Decision Trees

The unreasonable power of nested decision rules.

By Jared Wilber & Lucía Santamaría



Let's Build a Decision Tree

Let's pretend we're farmers with a new plot of land. Given only the Diameter and Height of a tree trunk, we must determine if it's an Apple, Cherry, or Oak tree. To do this, we'll use a Decision Tree.

Start Splitting

Almost every tree with a Diameter ≥ 0.45 is an Oak tree! Thus, we can probably assume that any other trees we find in that region will also be one.

This first decision node will act as our root node. We'll draw a vertical line at this Diameter and classify everything above it as Oak (our first leaf node), and continue to partition our remaining data on the left.

Split Some More

We continue along, hoping to split our plot of land in the most favorable manner. We see that creating a new decision node at Height ≤ 4.88 leads to a nice section of Cherry trees, so we partition our data there.

Our Decision Tree updates accordingly, adding a new leaf node for Cherry.

And Some More

After this second split we're left with an area containing many Apple and some Cherry trees. No problem: a vertical division can be drawn to separate the Apple trees a bit better.

Once again, our Decision Tree updates accordingly.

And Yet Some More

The remaining region just needs a further horizontal division and boom - our job is done! We've obtained an optimal set of nested decisions.

That said, some regions still enclose a few misclassified points. Should we continue splitting, partitioning into smaller sections?

Hmm...

Don't Go Too Deep!

If we do, the resulting regions would start becoming increasingly complex, and our tree would become unreasonably deep. Such a Decision Tree would learn too much from the noise of the training examples and not enough generalizable rules.

Does this ring familiar? It is the well known tradeoff that we have explored in our explainer on The Bias Variance Tradeoff ! In this case, going too deep results in a tree that overfits our data, so we'll stop here.

We're done! We can simply pass any new data point's Height and Diameter values through the newly created Decision Tree to classify them as either an Apple, Cherry, or Oak tree!

Where To Partition?

We saw how a Decision Tree works at a basic level: it starts from the top and makes rules to split the data into clear groups for sorting. But with so many ways to split, how does the computer choose the best spot? To answer that, we need to learn about Entropy.

Entropy is a way to measure how mixed up or uncertain a group of things is. We'll use it to find areas with mostly similar items (pure) or a mix of different ones (impure).

For a set of things that happen with certain chances , the total entropy is calculated as the negative sum of those chances multiplied by their weights:

The quantity has a number of interesting properties:

Entropy Properties

  1. is zero only if all but one of the are zero, and that one is 1. This happens when there's no uncertainty, meaning everything is predictable.
  2. is highest when all the are equal. This is the most uncertain or mixed situation.
  3. Making the chances more equal increases .

We can use entropy to measure how mixed up a group of data points is: a group with many different types is impure, while one with just one type is pure.

Above, you can calculate the entropy of a group of data points from two categories, which is common in yes/no problems. Click on the Add and Remove buttons to change the mix in the bubble.

Did you notice that pure groups have zero entropy, while mixed ones have higher values? Entropy helps us see how pure or mixed a group is. We'll use this to teach Decision Trees by creating Information Gain.

Information Gain

With the intuition gained with the above animation, we can now describe the logic to train Decision Trees. As the name implies, information gain measures an amount the information that we gain. It does so using entropy. The idea is to subtract from the entropy of our data before the split the entropy of each possible partition thereafter. We then select the split that yields the largest reduction in entropy, or equivalently, the largest increase in information.

The core algorithm to calculate information gain is called ID3. It's a recursive procedure that starts from the root node of the tree and iterates top-down on all non-leaf branches in a greedy manner, calculating at each depth the difference in entropy:

To be specific, the algorithm's steps are as follows:

ID3 Algorithm Steps

  1. Calculate the entropy associated to every feature of the data set.
  2. Partition the data set into subsets using different features and cutoff values. For each, compute the information gain as the difference in entropy before and after the split using the formula above. For the total entropy of all children nodes after the split, use the weighted average taking into account , i.e. how many of the samples end up on each child branch.
  3. Identify the partition that leads to the maximum information gain. Create a decision node on that feature and split value.
  4. When no further splits can be done on a subset, create a leaf node and label it with the most common class of the data points within it if doing classification or with the average value if doing regression.
  5. Recurse on all subsets. Recursion stops if after a split all elements in a child node are of the same type. Additional stopping conditions may be imposed, such as requiring a minimum number of samples per leaf to continue splitting, or finishing when the trained tree has reached a given maximum depth.

Of course, reading the steps of an algorithm isn't always the most intuitive thing. To make things easier to understand, let's revisit how information gain was used to determine the first decision node in our tree.

Recall our first decision node split on Diameter ≤ 0.45. How did we choose this condition? It was the result of maximizing information gain.

Each of the possible splits of the data on its two features (Diameter and Height) and cutoff values yields a different value of the information gain.

The line chart displays the different split values for the Diameter feature. Move the decision boundary yourself to see how the data points in the top chart are assigned to the left or right children nodes accordingly. On the bottom you can see the corresponding entropy values of both children nodes as well as the total information gain.

The ID3 algorithm will select the split point with the largest information gain, shown as the peak of the black line in the bottom chart of 0.574 at Diameter = 0.45.

Recall our first decision node split on Diameter ≤ 0.45. How did we choose this condition? It was the result of maximizing information gain.

Each of the possible splits of the data on its two features (Diameter and Height) and cutoff values yields a different value of the information gain.

The visualization on the right allows to try different split values for the Diameter feature. Move the decision boundary yourself to see how the data points in the top chart are assigned to the left or right children nodes accordingly. On the bottom you can see the corresponding entropy values of both children nodes as well as the total information gain.

The ID3 algorithm will select the split point with the largest information gain, shown as the peak of the black line in the bottom chart of 0.574 at Diameter = 0.45.

A Note On Information Measures

An alternative to the entropy for the construction of Decision Trees is the Gini impurity. This quantity is also a measure of information and can be seen as a variation of Shannon's entropy. Decision trees trained using entropy or Gini impurity are comparable, and only in a few cases do results differ considerably. In the case of imbalanced data sets, entropy might be more prudent. Yet Gini might train faster as it does not make use of logarithms.

Another Look At Our Decision Tree

Let's recap what we've learned so far. First, we saw how a Decision Tree classifies data by repeatedly partitioning the feature space into regions according to some conditional series of rules. Second, we learned about entropy, a popular metric used to measure the purity (or lack thereof) of a given sample of data. Third, we learned how Decision Trees use entropy in information gain and the ID3 algorithm to determine the exact conditional series of rules to select. Taken together, the three sections detail the typical Decision Tree algorithm.

To reinforce concepts, let's look at our Decision Tree from a slightly different perspective.

The tree below maps exactly to the tree we showed in How to Build a Decision Tree section above. However, instead of showing the partitioned feature space alongside our trees structure, let's look at the partitioned data points and their corresponding entropy at each node itself:


From the top down, our sample of data points to classify shrinks as it gets partitioned to different decision and leaf nodes. In this manner, we could trace the full path taken by a training data point if we so desired. Note also that not every leaf node is pure: as discussed previously (and in the next section), we don't want the structure of our Decision Trees to be too deep, as such a model likely won't generalize well to unseen data.

The Problem of Pertubations

Without question, Decision Trees have a lot of things going for them. They're simple models that are easy to interpret. They're fast to train and require minimal data preprocessing. And they hand outliers with ease. Yet they suffer from a major limitation, and that is their instability compared with other predictors. They can be extremely sensitive to small perturbations in the data: a minor change in the training examples can result in a drastic change in the structure of the Decision Tree.

Check for yourself how small random Gaussian perturbations on just 5% of the training examples create a set of completely different Decision Trees:



Why Is This A Problem?

In their vanilla form, Decision Trees are unstable.

If left unchecked, the ID3 algorithm to train Decision Trees will work endlessly to minimize entropy. It will continue splitting the data until all leaf nodes are completely pure - that is, consisting of only one class. Such a process may yield very deep and complex Decision Trees. In addition, we just saw that Decision Trees are subject to high variance when exposed to small perturbations of the training data.

Both issues are undesirable, as they lead to predictors that fail to clearly distinguish between persistent and random patterns in the data, a problem known as overfitting. This is problematic because it means that our model won't perform well when exposed to new data.

There are ways to prevent excessive growth of Decision Trees by pruning them, for instance constraining their maximum depth, limiting the number of leaves that can be created, or setting a minimum size for the amount of items in each leaf and not allowing leaves with too few items in them.

As for the issue of high variance? Well, unfortunately it's an intrinsic characteristic when training a single Decision Tree.

The Need to Go Beyond Decision Trees

Perhaps ironically, one way to alleviate the instability induced by perturbations is to introduce an extra layer of randomness in the training process. In practice this can be achieved by creating collections of Decision Trees trained on slightly different versions of the data set, the combined predictions of which do not suffer so heavily from high variance. This approach opens the door to one of the most successful Machine Learning algorithms thus far: random forests.

Stay tuned for our future article!

The End

Thanks for reading! We hope that the article is insightful no matter where you are along your Machine Learning journey, and that you came away with a better understanding of the Decision Tree algorithm.

To make things compact, we skipped over some relevant topics, such as using Decision Trees for regression, end-cut preference in tree models, and other tree-specific hyperparameters. Check out the resources listed below to learn more about those topics.

To learn more about Machine Learning, check out our self-paced courses, our YouTube videos, and the Dive into Deep Learning textbook. If you have any comments or ideas related to MLU-Explain articles, feel free to reach out directly to Jared or Lucía. The code for this article is available here.

A special thanks goes out to Brent Werness for valuable contributions to this article.






References + Open Source

This article is a product of the following resources + the awesome people who made (& contributed to) them:

A Mathematical Theory Of Communication
(Claude E. Shannon, 1948).

Induction of decision trees
(John Ross Quinlan, 1986).

A Study on End-Cut Preference in Least Squares Regression Trees
(Luis Torgo, 2001).

The Origins Of The Gini Index: Extracts From Variabilità e Mutabilità (Corrado Gini, 1912)
(Lidia Ceriani & Paolo Verne, 2012).

D3.js
(Mike Bostock & Philippe Rivière)

d3-annotation
(Susie Lu)

KaTeX
(Emily Eisenberg & Sophie Alpert)