# Ensemble methods

## Combining learners

### Majority voting

#### Combining classifiers

If we generate different classifiers then each can give different predictions for the same input.

If we have different predictions how should we proceed?

#### Using one model or multiple models

It is not obvious that we should want to use more than one model. If one model was superior to another then there may be no benefit to using the information from the additional model.

However if the errors in different models are varied, then combining multiple models can lead to better performance as each individual model can have unique information.

#### Majority voting

One approach to using different predictions is to use majority voting.

If we have a collection of hard classifiers then we choose the classification with the most votes.

#### Condorcet’s Jury Theorem

Consider a collection of classifiers. For each classifier there is a chance $$p_i$$ of the classification being correct.

If the voters are independent

If voters are independent, and the chance of one vote being right is greater than $$0.5$$, then the more voters, the better.

A weak model can still be useful, if it is independent.

### Averaging regression predictions

If we have multiple predictions, we can take an average of these, possibly weighted.

We have $$m$$ regressors $$g_j(\mathbf x_i)$$.

Our output is:

$$h(\mathbf x_i)=\sum_j w_jg_j(\mathbf x_i)$$

### Stacking and SuperLearner

#### Introduction

With stacking we take the predictions from each of our classifiers, and then train a new model using these predictions as inputs.

#### Hard and soft inputs

Hard classification ($$0$$ or $$1$$) and soft classification (between $$0$$ and $$1$$) can be used as inputs.

#### Cross validation

We can select hyper-parameters using cross validation, however there is an issue of using cross validation twice on the same data. Once for the underyling classifiers, and again for the stacked model.

#### SuperLearner

We have $$m$$ models.

Part 1: Train each of them on all data.

Part 2: Split the data into k sets

For each set, associate all other data as training

For each fold:

• Fit each model on the training data

• Predict on other

• Create weighted predictor. choose weightings to minimise error

Part 3: Use these weightings on the original (unrestricted data) model

## Generating learners with boosting

### $$L_2$$ boosting

#### Introduction

Boosting is a way to create multiple learners for use in an ensemble predictor.

The goal is to create many predictors, which may not be themselves very accuracte, but have a high degree of independence.

AdaBoost is a popular algorithm for boosting.

It works by:

• Creating a set of weak learners using different restrictions on features in the training data.

• Choosing the weak learner that most reduces the error of the combined learners, and give it a weighting which most reduces the error of the combined learners.

• Creating a new weighting for the dataset, where ones poorly predicted (by the combination of learners) are given high weights.

• Repeating the process a fixed number of times.

Gradient boosting does not iterative change the weights for the learners. Instead, it trains on different errors.

While AdaBoost trains to reduce the absolute error for each weak classifier, gradient boosting trains on the difference between the actual classiffication and the current classification.

## Generating learners with Bootstrapped AGGregation (bagging)

### Bagging

#### Introduction

Bagging, or Bootstrap AGGregation, is a way of generating weak learners.

#### Bootstrapping

Bootstraping refers to taking samples with repacement from the training set. This is how the datasets for each of the weak learners are formed.

#### How to do bagging

We take samples from the training set, with replacement, and train each of these separately. This gives us our weak learners.

## Ensemble methods for trees

This applies gradient boosting to tree.

### Random forests

These use bagging techniques with random trees.

At each node, rather than sample the whole data we sample a random selection.

Get $$d$$ dimensions, and sample $$m$$ of them at each node.

Choose $$m\le \sqrt d$$

## Bootstrapping moments of ensemble statistics

### Bootstrapping confidence intervals

Need to know estimate is unbiased for this.