Ordinary Least Squares for prediction

Constructing a linear model

Defining linear models

Defining

One option for \(f(X)\) is a linear model.

\(f(X_i)=\hat{Y_i}= \beta_0+\sum_{j=1}^p\beta_iX_{ij}\)

The values for \(\beta\) are the regression coefficients.

So we have:

\(Y_i=\beta_0+\sum_{j=1}^p\beta_iX_{ij}+e(X_i)+e_i\)

We define the error of the estimate as:

\(\epsilon_i=Y_i-\hat{Y_i}\)

\(\epsilon_i=e(X_i)+e_i\)

So:

\(Y_i=\beta_0+\sum_{j=1}^p\beta_iX_{ij}+\epsilon_i\)

The linear model could be wrong for two reasons. No linear model could be appropriate, or the wrong coefficients could be provided for a linear model.

Linear regression if \(f\) is a linear function on \(w\). NB: not linear in \(x\) necessarily. could have \(x^2\) etc, but still linear in \(w\).

Intercept

Modelling non-linear functions as linear

Polynomials

The function \(y=x^2\) is not linear, however we can model is as linear, by including \(x^2\) as a variable.

We can expand this, and using linear models to estimate parameters for functions such as:

\(y=ax^3+bx^2+xz\)

Logarithms and exponentials

We can also transform data using logarithms and exponents.

For example we can model

\(\ln y=\theta \ln x\)

Geometric interpretation of OLS

Best Approximation Theorem

Calculating Ordinary Least Squares (OLS) estimators

Normal equation

Least squares

The square error is \(\sum_i (\hat{y_i}-y_i)^2\).

The differential of this with respect to \(\hat{\theta_j }\) is:

\(2\sum_i \dfrac{\delta \hat{y_i}}{\delta \hat{\theta_j}}(\hat{y_i}-y_i)\)

The stationary point is where this is zero:

\(\sum_i \dfrac{\delta \hat{y_i}}{\delta \hat{\theta_j}}(\hat{y_i}-y_i)=0\)

Linear least squares

Here, \(\hat{y_i}= \sum_j x_{ij}\hat{\theta_j}\)

Therefore: \(\dfrac{\delta \hat{y_i}}{\delta \hat{\theta_j}}=x_{ij}\)

And so the stationary point is where

\(\sum_i x_{ij}( \sum_j x_{ij}\hat{\theta_j }-y_i)=0\)

\(\sum_i x_{ij}( \sum_j x_{ij}\hat{\theta_j)}= \sum_i x_{ij}y_i\)

Normal equation

We can write this in matrix form.

\(X^TX\hat{\theta }=X^Ty\)

We can solve this as:

\(\hat{\theta }=(X^TX)^{-1}X^Ty\)

Perfectly correlated variables

If variables are perfectly correlated then we cannot solve the normal equation.

Intuitively, this is because for perfectly correlated variables there is no single best parameter, as changes to one parameter can be counteracted by changes to another.

Mean and variance of predictions

Bias

\(\hat y =\theta x\)

\(E[\hat y-y]=E[\theta x-y]\)

\(y=\hat y + \epsilon\)

\(E[y-\hat y |X]\)

\(E[\epsilon |X]\)

Unbiased so long as independent of error term.

Variance

\(Var [\hat y-y]=Var [\theta x-y]\)

\(Var[y-\hat y |X]\)

\(Var[\epsilon |X]\)

The Moore-Penrose pseudoinverse

For a matrix \(X\), the pseudoinverse is \((X^*X)^{-1}X^*\).

For real matrices, this is: \((X^TX)^{-1}X^T\)

The pseudoinverse can be written as \(X^+\)

Therefore \(\theta\) is the pseudoinverse of the inputs, multiplied by the outputs. Or:

\(\theta = X^+y\)

The pseudoinverse satisfies:

\(XX^+X=X\)

\(X^+XX^+=X^+\)

Leverage

Introduction

Leverage measures how much the predicted value of \(y_i\), \(\hat y_i\), changes as \(y_i\) changes.

We have:

\(\mathbf y = \mathbf X \theta +\mathbf u\)

\(\hat \theta =X(X^TX)^{-1}X^Ty\)

\(\hat \theta =P_Xy\)

The leverage score is defined as:

\(h_i=P_{ii}\)

Making forecasts with OLS

The projection and annihilation matrices

The projection matrix

We have \(X\).

The projection matrix is \(X(X^TX)^{-1}X^T\)

The projection matrix maps from actual y to predicted y

\(\hat y = Py\)

Each entry refers to the covariance between actual and fitted

\(p_{ij}=\dfrac{Cov (\hat y_i, y_j}{Var (y_j)}\)

The annihilation matrix

We can get residuals too:

\(u=y-\hat y=y-py=(1-P)y\)

\(1-P\) is called the annihilator matrix

We can now use the propagation of uncertainty

\(\Sigma^f = A\Sigma^x A^T\)

To get:

\(\Sigma^u = (I-P)\Sigma^y (I-P)\)

Annihilator matrix is:

\(M_X=I-X(X^TX)^{-1}X^T\)

Called this because:

\(M_XX=X-X(X^TX)^{-1}X^TX\)

\(M_XX=0\)

Is called residual maker

Frisch-Waugh-Lovell theorem

Introduction

If we have a partitioned linear regression model:

\(\mathbf y=\mathbf X\theta+\mathbf Z\beta+\mathbf \mu\)

Use the annihilator matrix:

\(M_X\mathbf y=M_X\mathbf X\theta+M_X\mathbf Z\beta+M_X\mathbf \mu\)

\(M_X\mathbf y=M_X\mathbf Z\beta+M_X\mathbf \mu\)

We can then estimate \(\beta\).

Frisch-Waugh-Lovell theorem says that this is the same estimate as the original regression.

Trimming

Introduction

OLS:

\(\hat \theta =\frac{\sum_i (X_i-\mu_X)(y_i-\mu_y)}{\sum_i(x_i-\mu_X)^2}\)

Trimming

\(\hat \theta =\frac{n^{-1}\sum_i (X_i-\mu_X)(y_i-\mu_y)\mathbf 1_i}{n^{-1}\sum_i(x_i-\mu_X)^2\mathbf 1_i}\)

Where:

\(\mathbf 1_i=\mathbf 1(\hat f(z_i)\ge b)\)

Where \(b=b(n)\) is a trimming parameter, where:

\(b\rightarrow 0\) as \(n\rightarrow \infty\)

Best linear predictor

Introduction

The best linear predictor is the one which minimises:

\(E[Y-X\theta ]\)

Under what circumstances is this the same as OLS? When n>>p. When n is not, then other linear estimators (like LASSO) can be better.

Other

Cook’s distance

Cook’s distance measures the effect of deleting outliers. work out predictions if outlier was removed, sum all differneces in y hat

Outliers have a high Cook’s distance.

Bayesian linear regression

In linear regression we have

\(P(y|X, \theta, \sigma^2_\epsilon )\)

For Bayesian linear regression we want:

\(P(\theta, \sigma^2_\epsilon |y, X)\)

We can use Bayes rule:

\(P(\theta, \sigma^2_\epsilon |y, X)\propto P(y|X, \theta, \sigma^2_\epsilon )P(\theta |\sigma^2_\epsilon )P(\sigma^2_\epsilon )\)