Regularising linear regression for prediction

OLS predictions with many parameters

Too many variables

If there are more independent variables than samples then OLS will not work. There will be an infinite number of perfect fits.

For example if we regression genetic information on height with \(1000\) people, there will be too little data to fit using OLS.

This is due to colinearity.

We could also have too many variables through the use of derived variables. For example if we choose to use \(x\), \(x^2\), \(x^3\) etc.

Optimal sparce regression

Optimal is \(\lambda = \sigma 2\sqrt {{2\log (pn)}{n}}\)

Relies on knowing \(\sigma \), which we may not.

Instead we can use root LASSO.

Minimise the squareroot of the sum of squares loss (over n) , and use \(\lambda = \sqrt{2\log (pn)/n}\)

Doesn’t have \(\sigma \)

Lasso biased, estimators \(0\) for many.


We can use LASSO for model selection, then use OLS on only those estimators.

Least Absolute Shrinkage and Selection Operator (LASSO)

Least Absolute Shrinkage and Selection Operator (LASSO)


With LASSO we add a constraint to \(\hat \theta \).

\(\sum_i \hat \theta_i \le t\)

Regularisation of LLS. Sum of thetas are constrained to be below hyperparameter \(t\)

L1 regularisation

This is also known as sparce regression, because many weights are set to \(0\).

This now looks like:

\(w_{lasso} = \arg \min ||y-Xw||^2_2+\lambda ||w||_1\)


\(t\) is a hyperparameter.

Optimal hyperparameter for LASSO

Feature scaling

Ridge regression

Ridge regression

Regularisation of LLS. The cost function now includes a norm on \(M\theta \).

\(L_2\) regularisation

This allows us to solve problems where there are too many features. \(L_1\) also allows us to to do this.


If \(n>d\) we can minimise weights subject to Xw=y. This is the same as the least norm.

Maximum A-Priori (MAP) estimator for linear regression

Maximum a priori estimation. equiv to ridge regression with a priori estimate of \(0\)

\(W_{RR}=(\lambda I+X^TX)^{-1}X^Ty\)

\(E_[w{RR}]=(\lambda I+X^TX)^{-1}X^TXw\)



Regularisation of LLS. Combines lasso and ridge regression.

\(L_1\) and \(L_2\) regularisation

\(L_p\) regularisation


We can generalise this to:

\(w_{l_p} = \arg \min ||y-Xw||^2_2+\lambda ||w||^p_p\)

For ridge regression there is always a solution.

For least squares there is a solution if \(X^TX\) is invertible

For Lasso we must use numerical optimisation.

lasso and \(L_1\) induces sparcity

Goal is \(min ||y - f(x)|| + \lambda g(w)\)

Ridge regression: \(g(w)=||w||\)

If \(\lambda =0\), OLS, if infinite, \(w\) goes to \(0\).

Normal equation changes to: \(\lambda I + X^TX)^{-1}X^Ty\)

We can preprocess to avoid processing of 1s. shift mean of \(y\) to \(0\). normalise \(x\) mean \(0\) var \(1\).



Alternative to ElasticNet

Each parameter is split into


There is \(L_2\) loss on \(rho\) and \(L_1\) loss on \(\phi \).

This means that large coefficients can be penalised like \(L_1\) and small coefficients like \(L_2\).


The Ramsey RESET test

The Ramsey Regression Equation Specification Error Test (RESET)

Once we have done our OLS we have \(\hat y\).

The Ramsey RESET test is an additional stage, which takes these predictions and estimates:

\(y=\theta x+\sum_{i=1}^3\alpha_i \hat {y^i}\)

We then run an F-test on \(\alpha\), with the null that \(\alpha = 0\).


Alternative to RESET

We have \(\hat y\).

We regress \(y=\alpha + \beta \hat y + \gamma \hat y^2\).

We test that \(\gamma =0\).

If it is not, then this suggests the model is misspecified.

Bias trade-off


Trade-off between parameter accuracy and prediction accuracy.