Regularising linear regression for prediction

OLS predictions with many parameters

Too many variables

If there are more independent variables than samples then OLS will not work. There will be an infinite number of perfect fits.

For example if we regression genetic information on height with \(1000\) people, there will be too little data to fit using OLS.

This is due to colinearity.

We could also have too many variables through the use of derived variables. For example if we choose to use \(x\), \(x^2\), \(x^3\) etc.

Optimal sparce regression

Optimal is \(\lambda = \sigma 2\sqrt {{2\log (pn)}{n}}\)

Relies on knowing \(\sigma\), which we may not.

Instead we can use root LASSO.

Minimise the squareroot of the sum of squares loss (over n) , and use \(\lambda = \sqrt{2\log (pn)/n}\)

Doesn’t have \(\sigma\)

Lasso biased, estimators \(0\) for many.

Post-LASSO

We can use LASSO for model selection, then use OLS on only those estimators.

Least Absolute Shrinkage and Selection Operator (LASSO)

Least Absolute Shrinkage and Selection Operator (LASSO)

Introduction

With LASSO we add a constraint to \(\hat \theta\).

\(\sum_i \hat \theta_i \le t\)

Regularisation of LLS. Sum of thetas are constrained to be below hyperparameter \(t\)

L1 regularisation

This is also known as sparce regression, because many weights are set to \(0\).

This now looks like:

\(w_{lasso} = \arg \min ||y-Xw||^2_2+\lambda ||w||_1\)

Hyperparameter

\(t\) is a hyperparameter.

Optimal hyperparameter for LASSO

Feature scaling

Ridge regression

Ridge regression

Regularisation of LLS. The cost function now includes a norm on \(M\theta\).

\(L_2\) regularisation

This allows us to solve problems where there are too many features. \(L_1\) also allows us to to do this.

Overspecified

If \(n>d\) we can minimise weights subject to Xw=y. This is the same as the least norm.

Maximum A-Priori (MAP) estimator for linear regression

Maximum a priori estimation. equiv to ridge regression with a priori estimate of \(0\)

\(W_{RR}=(\lambda I+X^TX)^{-1}X^Ty\)

\(E_[w{RR}]=(\lambda I+X^TX)^{-1}X^TXw\)

\(Var[W_{RR}]=a\)

Elasticnet

Regularisation of LLS. Combines lasso and ridge regression.

\(L_1\) and \(L_2\) regularisation

\(L_p\) regularisation

Introduction

We can generalise this to:

\(w_{l_p} = \arg \min ||y-Xw||^2_2+\lambda ||w||^p_p\)

For ridge regression there is always a solution.

For least squares there is a solution if \(X^TX\) is invertible

For Lasso we must use numerical optimisation.

lasso and \(L_1\) induces sparcity

Goal is \(min ||y - f(x)|| + \lambda g(w)\)

Ridge regression: \(g(w)=||w||\)

If \(\lambda =0\), OLS, if infinite, \(w\) goes to \(0\).

Normal equation changes to: \(\lambda I + X^TX)^{-1}X^Ty\)

We can preprocess to avoid processing of 1s. shift mean of \(y\) to \(0\). normalise \(x\) mean \(0\) var \(1\).

Lava

Introduction

Alternative to ElasticNet

Each parameter is split into

\(\theta_i=\rho_i+\phi_i\)

There is \(L_2\) loss on \(rho\) and \(L_1\) loss on \(\phi\).

This means that large coefficients can be penalised like \(L_1\) and small coefficients like \(L_2\).

Tests

The Ramsey RESET test

The Ramsey Regression Equation Specification Error Test (RESET)

Once we have done our OLS we have \(\hat y\).

The Ramsey RESET test is an additional stage, which takes these predictions and estimates:

\(y=\theta x+\sum_{i=1}^3\alpha_i \hat {y^i}\)

We then run an F-test on \(\alpha\), with the null that \(\alpha = 0\).

Introduction

Alternative to RESET

We have \(\hat y\).

We regress \(y=\alpha + \beta \hat y + \gamma \hat y^2\).

We test that \(\gamma =0\).

If it is not, then this suggests the model is misspecified.

Bias trade-off

Introduction

Trade-off between parameter accuracy and prediction accuracy.