W4 - Decision trees

Explore decision trees, from the basics of their structure and learning process to advanced concepts like tree ensembles, random forests, and XGBoost.

This week, you’ll learn about a practical and very commonly used learning algorithm the decision tree. You’ll also learn about variations of the decision tree, including random forests and boosted trees (XGBoost).

Learning Objectives

See what a decision tree looks like and how it can be used to make predictions
Learn how a decision tree learns from training data
Learn the “impurity” metric “entropy” and how it’s used when building a decision tree
Learn how to use multiple trees, “tree ensembles” such as random forests and boosted trees
Learn when to use decision trees or neural networks

Decision trees

Decision tree model

Here’s an example of a model that you might get after training a decision tree learning algorithm on the data set that you just saw.

Terminology:

Every one of these ovals or rectangles is called a node in the tree.
Top node is called root node
ovals nodes are called decision nodes (beacaude, based on the features, they decide to go roght or left in the tree)
Rectangle at the bottom are call leaf nodes and they make prediction

Among these different decision trees, some will do better and some will do worse on the training sets or on the cross-validation and test sets. The job of the decision tree learning algorithm is, out of all possible decision trees, to try to pick one that hopefully does well on the training set, and then also ideally generalizes well to new data such as your cross-validation and test sets as well

Learning Process

One nodes is completely pure, when there’s no longer a mix of cats and dogs

At the root node, as well as on the left branch and the right branch of the decision tree, we had to decide if there were a few examples at that node comprising a mix of cats and dogs. That decision should maximize purity. By purity, I mean, you want to get to what subsets, which are as close as possible to all cats or all dogs.

Because it is if you can get to a highly pure subsets of examples, then you can either predict cat or predict not cat and get it mostly right.

If you had decided that the maximum depth of the decision, then you would decide not to split any nodes below this level for 2 reasons:

make sure the tree doesn’t get too big and unwieldy
it makes it less prone to overfitting

We stop if purity improvement is below a threshold, or if number of examples is below a threshold

It could ssem complicated, but these different pieces fit together into a very effective learning algorithm. Open source packages can help so in the complicated procedure for making all these decisions, like how do I decide to stop splitting

Decision trees learning

Measuring purity

Entropy is a measure of the impurity of a set of data

Note that :

entropy function is define with $log_2$ (base 2)
log(0) is technically undefined, it’s actually negative infinity, but by convention for the purposes of computing entropy, we’ll take 0*log(0)
definition of entropy looks a like the definition of the logistic loss (mathematical rationale not seen here)
entropy could be replaced by gini criteria

Choosing a split: Information Gain

At the root node, we have started off with all 10 examples, with five cats and dogs, and so at the root node, we had $p_1$ equals 5/10 This formula, is called information gain, and it measures the reduction in entropy

A more formal definition is:

Putting it together

The way we built the right subtree was by, again, building a decision tree on a subset of five examples. In computer science, this is an example of a recursive algorithm.

Using one-hot encoding of categorical features

When we have a feature with three possible values instead of just two possible values you can create a three subsets of the data and end up building three sub branches for this tree.

But there is a different way of addressing features that can take on more than two values, which is to use the one-hot encoding.

In our exemple

The idea of using one-hot encodings to encode categorical features also works for training neural networks. In particular if you were to take the face shape feature and replace round and not round with 1 and 0, etc.

Continuous valued features

We calculate the entropy for different threshold

Regression Trees (optional)

Tree ensembles

Using multiple decision trees

One of the weaknesses of using a single decision tree is that that decision tree can be highly sensitive to small changes in the data. The fact that changing just one training example causes the algorithm to come up with a different split at the root and therefore a totally different tree, that makes this algorithm just not that robust.

The majority votes of the predictions among these three trees

Sampling with replacement

The term replacement means we choose one token in the bag, put that token back in the bag

We build a training set using sampling with replacement

That is part of the sampling with replacement procedure. The process of sampling with replacement, lets you construct a new training set that’s a little bit similar to, but also pretty different from your original training set. This is the key building block for building an ensemble of trees.

Random forest algorithm

Typical choice of capital B the number of such trees you built might be around a 100 Having built an ensemble of B different trees, these trees all votes on the correct final prediction

It’s not uncommon that for many or even all capital B training sets, you end up with the same choice of feature at the root node and at a few of the notes near the root note. This new algotithm called random Forest algorithm is much more robust than just a single decision tree

XGBoost

And one of the innovations in XGBoost is that it also has built in regularization to prevent overfitting.

XGBoost is often a highly competitive algorithm. In fact XGBoost and deep learning algorithms seem to be the two types of algorithms that win a lot of competitions (like competition Kaggle)

The details of XGBoost are quite complex to implement, which is why many practitioners will use the open source libraries that implement XGBoost.

When to use decision trees

Acknowledgements

Last modified February 4, 2024: meta description on coursera (b2d9a0d)