We can train a decision tree by starting out with the most simple tree - all outcomes in same node.

We can then do a greedy search to identify which split on the node is best.

We can then iterate this process on future nodes.

We split nodes to increase maximum entropy.

Entropy is:

\(E= -\sum_i^n p_{i=1}\log_2 p_i\)

Where we are summing across all nodes.

The gain in entropy is the original entropy - weighted by size entropy of each branch

Training a decision tree until there is only one entry from the training set will result in overfitting.

We can use pruning to regularise trees.

From bottom, replace each node with a leaf of the most popular class. Accept if no reduction in accuracy.

Take full tree \(T_0\)

Iteratively find a subtree to replace with a leaf. Cost function is accuracy and number of leaves.

Remove this generating \(T_{i+1}\)

When we have just the root, choose a single tree using CV.

Generally we would split the data up. Grow the tree with one set and then prune with the other.

We can split our data up and iterate between growing and pruning.

When pruning, for each pair of leaves we test to see if they should be merged.

If our two sets are \(A\) and \(B\) we can do:

\(A\): Grow

\(B\): Prune

\(B\): Grow

\(A\): Prune

And repeat this process.

Once we have built a tree, we keep a single leaf and disard the rest.

In a classical regression tree, we follow a decison process as before, but the outcome is real number.

Within each leaf, all inputs are assigned that same number.

With a regression problem we cannot split nodes the same way as we did for classification.

Instead by split by the residual sum of squares.

In classical trees all items in a leaf are assigned the same values. In this model, all are given \(\theta \) for a parametric model.

This makes the resulting trees smoother.

We have some \(\hat y_i = f(\mathbf x_i, \theta ) + \epsilon \)

The approach generalises classic regression trees. There the estimate was \(\bar y\). Here it’s a regression.

At each node we do OLS. If the \(R^2\) of the model is less than some constant, we find a split which maximises the minimum of the two new \(R^2\).

Previously our decision tree classifier was binary.

We can instead adapt the mixed tree model and using a probit model at each leaf.

We can define a tree as a set of nodes: \(T\).

For each node we define a splitting variable \(k\) and a splitting threshold \(r\).

Our prior is \(P(T, k, r)\).

We split this up to:

\(P(T, k, r)=P(T)P(k, r|T)\)

\(P(T, k, r)=P(T)P(k|T)P(r|T, k)\)

So we want to estimate:

\(P(T)\) - The number of nodes.

\(P(k|T)\) - Which variables we split by, given the tree size.

\(P(r|T, k)\) - The cutoff, given the tree size and the variables we are splitting by.

If at the leaf we have a parametric model, our prior is instead:

\(P(T, k, r, \theta )=P(T)P(k|T)P(r|T, k)P(\theta | T, k, r)\)

We then need to additionally estimate \(P(\theta | T, k, r)\).

We can generate a tree with a fixed number of leaves, according to our prior.

As we start the tree we assocaite the root node with a count of all leaves.

As we split a node, the remaining leaf counts are divided between the directions. If there is only one leaf left, we do no further splitting.

Call the collective parameters of the tree \(\Theta =(T, k, r)\) and \(\theta \).

Collectively our prior is defined by \(P(\Theta )\) and \(P(\theta )\)

We want to know the posterior given our data \(X\).

\(P(\Theta | X)=\dfrac{P(X|\Theta )P(\Theta )}{P(X)}\)

\(P(\Theta | X)\propto P(X|\Theta )P(\Theta )\)

We know explore \(P(X|\Theta )\)

\(P(X|\Theta )=\int P(X| \theta , \Theta)P(\theta)d\theta \)

This means our posterior is:

\(P(\Theta | X)\propto P(\Theta )\int P(X| \theta , \Theta)P(\theta)d\theta \)

This can be estimated with MCMC.

We use part of the sample to estimate \(\Theta \), and another part of the sample to estimate the treatment effect.

Unbalanced dataset: more of one class than others.

Can reduce sample of majorit, or synthetically generate minority.