If we generate different classifiers then each can give different predictions for the same input.

If we have different predictions how should we proceed?

It is not obvious that we should want to use more than one model. If one model was superior to another then there may be no benefit to using the information from the additional model.

However if the errors in different models are varied, then combining multiple models can lead to better performance as each individual model can have unique information.

One approach to using different predictions is to use majority voting.

If we have a collection of hard classifiers then we choose the classification with the most votes.

Consider a collection of classifiers. For each classifier there is a chance \(p_i\) of the classification being correct.

If the voters are independent

If voters are independent, and the chance of one vote being right is greater than \(0.5\), then the more voters, the better.

A weak model can still be useful, if it is independent.

If we have multiple predictions, we can take an average of these, possibly weighted.

We have \(m\) regressors \(g_j(\mathbf x_i)\).

Our output is:

\(h(\mathbf x_i)=\sum_j w_jg_j(\mathbf x_i)\)

With stacking we take the predictions from each of our classifiers, and then train a new model using these predictions as inputs.

Hard classification (\(0\) or \(1\)) and soft classification (between \(0\) and \(1\)) can be used as inputs.

We can select hyper-parameters using cross validation, however there is an issue of using cross validation twice on the same data. Once for the underyling classifiers, and again for the stacked model.

We have \(m\) models.

Part 1: Train each of them on all data.

Part 2: Split the data into k sets

For each set, associate all other data as training

For each fold:

Fit each model on the training data

Predict on other

Create weighted predictor. choose weightings to minimise error

Part 3: Use these weightings on the original (unrestricted data) model

Boosting is a way to create multiple learners for use in an ensemble predictor.

The goal is to create many predictors, which may not be themselves very accuracte, but have a high degree of independence.

AdaBoost is a popular algorithm for boosting.

It works by:

Creating a set of weak learners using different restrictions on features in the training data.

Choosing the weak learner that most reduces the error of the combined learners, and give it a weighting which most reduces the error of the combined learners.

Creating a new weighting for the dataset, where ones poorly predicted (by the combination of learners) are given high weights.

Repeating the process a fixed number of times.

Gradient boosting does not iterative change the weights for the learners. Instead, it trains on different errors.

While AdaBoost trains to reduce the absolute error for each weak classifier, gradient boosting trains on the difference between the actual classiffication and the current classification.

Bagging, or Bootstrap AGGregation, is a way of generating weak learners.

Bootstraping refers to taking samples with repacement from the training set. This is how the datasets for each of the weak learners are formed.

We take samples from the training set, with replacement, and train each of these separately. This gives us our weak learners.

This applies gradient boosting to tree.

These use bagging techniques with random trees.

At each node, rather than sample the whole data we sample a random selection.

Get \(d\) dimensions, and sample \(m\) of them at each node.

Choose \(m\le \sqrt d\)

Need to know estimate is unbiased for this.