Classification And Regression Trees (CART)

Simple decision trees

Tree traversal

Training decision trees with information gain

We can train a decision tree by starting out with the most simple tree - all outcomes in same node.

We can then do a greedy search to identify which split on the node is best.

We can then iterate this process on future nodes.

Training with information gain

We split nodes to increase maximum entropy.

Entropy is:

\(E= -\sum_i^n p_{i=1}\log_2 p_i\)

Where we are summing across all nodes.

Information gain

The gain in entropy is the original entropy - weighted by size entropy of each branch

Information gain ratio

Training decision trees with Gini impurity

Pruning decision trees

Training a decision tree until there is only one entry from the training set will result in overfitting.

We can use pruning to regularise trees.


Reduced error pruning

From bottom, replace each node with a leaf of the most popular class. Accept if no reduction in accuracy.

Cost complexity pruning

Take full tree \(T_0\)

Iteratively find a subtree to replace with a leaf. Cost function is accuracy and number of leaves.

Remove this generating \(T_{i+1}\)

When we have just the root, choose a single tree using CV.

Growing and pruning

Generally we would split the data up. Grow the tree with one set and then prune with the other.

We can split our data up and iterate between growing and pruning.

When pruning, for each pair of leaves we test to see if they should be merged.

If our two sets are \(A\) and \(B\) we can do:

  • \(A\): Grow

  • \(B\): Prune

  • \(B\): Grow

  • \(A\): Prune

And repeat this process.

Partial regression trees

Once we have built a tree, we keep a single leaf and disard the rest.

Decision trees with many classes

Training decision trees with many classes

Regression trees

Classic regression trees

In a classical regression tree, we follow a decison process as before, but the outcome is real number.

Within each leaf, all inputs are assigned that same number.


With a regression problem we cannot split nodes the same way as we did for classification.

Instead by split by the residual sum of squares.

Training decision trees with Mean Squared Error (MSE)

Mixed regression trees

In classical trees all items in a leaf are assigned the same values. In this model, all are given \(\theta \) for a parametric model.

This makes the resulting trees smoother.

We have some \(\hat y_i = f(\mathbf x_i, \theta ) + \epsilon \)

The approach generalises classic regression trees. There the estimate was \(\bar y\). Here it’s a regression.


At each node we do OLS. If the \(R^2\) of the model is less than some constant, we find a split which maximises the minimum of the two new \(R^2\).

Classifying with probabilistic decision trees

Previously our decision tree classifier was binary.

We can instead adapt the mixed tree model and using a probit model at each leaf.

Bayesian trees

Priors of trees

Priors for simple trees

We can define a tree as a set of nodes: \(T\).

For each node we define a splitting variable \(k\) and a splitting threshold \(r\).

Our prior is \(P(T, k, r)\).

We split this up to:

\(P(T, k, r)=P(T)P(k, r|T)\)

\(P(T, k, r)=P(T)P(k|T)P(r|T, k)\)

So we want to estimate:

  • \(P(T)\) - The number of nodes.

  • \(P(k|T)\) - Which variables we split by, given the tree size.

  • \(P(r|T, k)\) - The cutoff, given the tree size and the variables we are splitting by.

Priors for mixed trees

If at the leaf we have a parametric model, our prior is instead:

\(P(T, k, r, \theta )=P(T)P(k|T)P(r|T, k)P(\theta | T, k, r)\)

We then need to additionally estimate \(P(\theta | T, k, r)\).

The pinball prior

We can generate a tree with a fixed number of leaves, according to our prior.

As we start the tree we assocaite the root node with a count of all leaves.

As we split a node, the remaining leaf counts are divided between the directions. If there is only one leaf left, we do no further splitting.

Estimating other priors

Bayesian CART

Our prior

Call the collective parameters of the tree \(\Theta =(T, k, r)\) and \(\theta \).

Collectively our prior is defined by \(P(\Theta )\) and \(P(\theta )\)

Bayes’ theorem

We want to know the posterior given our data \(X\).

\(P(\Theta | X)=\dfrac{P(X|\Theta )P(\Theta )}{P(X)}\)

\(P(\Theta | X)\propto P(X|\Theta )P(\Theta )\)

Expanded posterior

We know explore \(P(X|\Theta )\)

\(P(X|\Theta )=\int P(X| \theta , \Theta)P(\theta)d\theta \)

This means our posterior is:

\(P(\Theta | X)\propto P(\Theta )\int P(X| \theta , \Theta)P(\theta)d\theta \)


This can be estimated with MCMC.

Causal trees

easuring treatment effects in leaves

Sample splitting for treatment effects

Honest trees

We use part of the sample to estimate \(\Theta \), and another part of the sample to estimate the treatment effect.

Estimating ATE using MCMC


Training with unbalanced data

Unbalanced dataset: more of one class than others.

Can reduce sample of majorit, or synthetically generate minority.