MLU-explAIn

Decision Trees

The unreasonable power of nested decision rules.

Let's Build a Decision Tree

Let's pretend we're farmers with a new plot of land. Given only the Diameter and Height of a tree trunk, we must determine if it's an Apple, Cherry, or Oak tree. To do this, we'll use a Decision Tree.

Start Splitting

Almost every tree with a Diameter ≥ 0.45 is an Oak tree! Thus, we can probably assume that any other trees we find in that region will also be one.

This first decision node will act as our root node. We'll draw a vertical line at this Diameter and classify everything above it as Oak (our first leaf node), and continue to partition our remaining data on the left.

Split Some More

We continue along, hoping to split our plot of land in the most favorable manner. We see that creating a new decision node at Height ≤ 4.88 leads to a nice section of Cherry trees, so we partition our data there.

Our Decision Tree updates accordingly, adding a new leaf node for Cherry.

And Some More

After this second split we're left with an area containing many Apple and some Cherry trees. No problem: a vertical division can be drawn to separate the Apple trees a bit better.

Once again, our Decision Tree updates accordingly.

And Yet Some More

The remaining region just needs a further horizontal division and boom - our job is done! We've obtained an optimal set of nested decisions.

That said, some regions still enclose a few misclassified points. Should we continue splitting, partitioning into smaller sections?

Hmm...

Don't Go Too Deep!

If we do, the resulting regions would start becoming increasingly complex, and our tree would become unreasonably deep. Such a Decision Tree would learn too much from the noise of the training examples and not enough generalizable rules.

Does this ring familiar? It is the well known tradeoff that we have explored in our explainer on The Bias Variance Tradeoff ! In this case, going too deep results in a tree that overfits our data, so we'll stop here.

We're done! We can simply pass any new data point's Height and Diameter values through the newly created Decision Tree to classify them as either an Apple, Cherry, or Oak tree!

Where To Partition?

We saw how a Decision Tree works at a basic level: it starts from the top and makes rules to split the data into clear groups for sorting. But with so many ways to split, how does the computer choose the best spot? To answer that, we need to learn about Entropy.

Entropy is a way to measure how mixed up or uncertain a group of things is. We'll use it to find areas with mostly similar items (pure) or a mix of different ones (impure).

For a set of things that happen with certain chances , the total entropy is calculated as the negative sum of those chances multiplied by their weights:

The quantity has a number of interesting properties:

Entropy Properties

is zero only if all but one of the are zero, and that one is 1. This happens when there's no uncertainty, meaning everything is predictable.
is highest when all the are equal. This is the most uncertain or mixed situation.
Making the chances more equal increases .

We can use entropy to measure how mixed up a group of data points is: a group with many different types is impure, while one with just one type is pure.

Above, you can calculate the entropy of a group of data points from two categories, which is common in yes/no problems. Click on the Add and Remove buttons to change the mix in the bubble.

Did you notice that pure groups have zero entropy, while mixed ones have higher values? Entropy helps us see how pure or mixed a group is. We'll use this to teach Decision Trees by creating Information Gain.

MLU-explAIn

Decision Trees

Let's Build a Decision Tree

Start Splitting

Split Some More

And Some More

And Yet Some More

Don't Go Too Deep!

Where To Partition?

Entropy Properties

Information Gain

ID3 Algorithm Steps

A Note On Information Measures

Another Look At Our Decision Tree

The Problem of Pertubations

Why Is This A Problem?

The Need to Go Beyond Decision Trees

The End

References + Open Source