Ordinary Least Squares for inference

Bias of OLS estimators

Expectation of OLS estimators

Expectation in terms of observables

We have: \(\hat{\theta }=(X^TX)^{-1}X^Ty\)

Let’s take the expectation.

\(E[\hat{\theta }]=E[(X^TX)^{-1}X^Ty]\)

Expectation in terms of errors

Let’s model \(y\) as a function of \(X\). As we place no restrictions on the error terms, this is not as assumption.

\(y=X\theta +\epsilon\).

\(E[\hat{\theta }]=E[(X^TX)^{-1}X^T(X\theta +\epsilon)]\)

\(E[\hat{\theta }]=E[(X^TX)^{-1}X^TX\theta ]+E[(X^TX)^{-1}X^T \epsilon)]\)

\(E[\hat{\theta }]=\theta +E[(X^TX)^{-1}X^T \epsilon)]\)

\(E[\hat{\theta }]=\theta +E[(X^TX)^{-1}X^T]E[ \epsilon]+cov [(X^TX)^{-1}X^T ,\epsilon]\)

The Gauss-Markov: Expected error is \(0\)

\(E[\epsilon =0]\)

This means that:

\(E[\hat{\theta }]=\theta + cov [(X^TX)^{-1}X^T ,\epsilon]\)

The Gauss-Markov: Errors and indepedent variables are uncorrelated

If the error terms and \(X\) are uncorrelated then \(E[\epsilon|X]=0\) and therefore:

\(E[\hat{\theta }]=\theta\)

So this is an unbiased estimator, so long as the condition holds.

Variance of OLS estimators

Variance of OLS estimators

Variance-covariance matrix

We know:

\(\hat \theta =(X^TX)^{-1}X^Ty\)

\(y=X\theta +\epsilon\)


\(\hat \theta =(X^TX)^{-1}X^T(X\theta +\epsilon)\)

\(\hat \theta =\theta +(X^TX)^{-1}X^T\epsilon\)

\(\hat \theta -\theta =(X^TX)^{-1}X^T\epsilon\)

\(Var [\hat \theta ]=E[(\hat \theta -\theta)(\hat \theta -\theta )^T]\)

\(Var [\hat \theta ]=E[(X^TX)^{-1}X^T\epsilon(X^TX^{-1}X^T\epsilon )^T]\)

\(Var [\hat \theta ]=E[(X^TX)^{-1}X^T\epsilon \epsilon^T X(X^TX)^{-1}]\)

\(Var [\hat \theta ]=(X^TX)^{-1}X^TE[\epsilon \epsilon^T ]X(X^TX)^{-1}\)

We write:

\(\Omega=E[\epsilon \epsilon^T]\)

\(Var [\hat \theta ]=(X^TX)^{-1}X^T\Omega X(X^TX)^{-1}\)

Depending on how we estimate \(\Omega\), we get different variance terms.

Variance under IID


\(\Omega = I\sigma^2_{\epsilon }\)

\(Var [\hat \theta ]=(X^TX)^{-1}X^TI\sigma^2_{\epsilon } X(X^TX)^{-1}\)

\(Var [\hat \theta ]=\sigma^2_\epsilon (X^TX)^{-1}\)

Heteroskedasticity-Consistent (HC) standard errors

Variance of OLS estimators

\(Var [\hat \theta ]=(X^TX)^{-1}X^T\Omega X(X^TX)^{-1}\)

Robust standard errors for heteroskedasticity


These are also known as the Eicker-Huber-White standard errors, or the White correction.

These are also refered to as robust standard errors.

Properties of the OLS estimator

Maximum Likelihood Estimator (MLE) and OLS equivalence

The OLS estimator

\(\hat \theta_{OLS}=(X^TX)^{-1}X^Ty\)

\(E[\hat \theta_{OLS}]=w\)

\(Var[\hat \theta_{OLS}]=\sigma^2 (X^TX)^{-1}\)

The MLE estimator

\(y_i=\mathbf x_i\theta +\epsilon_i \)

\(P(y=y_i|x=x_i)=P(\epsilon_i=y_i-\mathbf x_i \theta )\)

If we assume \(\epsilon_i \sim N(0, \sigma^2_\epsilon )\) we have:

\(P(y=y_i|x=x_i)=\dfrac{1}{\sqrt {2\pi \sigma^2_\epsilon }}e^{-\dfrac{(y_i-\mathbf x_i\theta )^2}{2\sigma_\epsilon^2}}\)

\(L(X, \theta )=\prod_{i=1}^n\dfrac{1}{\sqrt {2\pi \sigma^2_\epsilon }}e^{-\dfrac{(y_i-\mathbf x_i\theta )^2}{2\sigma_\epsilon^2}}\)

\(l(X, \theta )=\sum_{i=1}^n -\dfrac{1}{2}\ln (2\pi \sigma_\epsilon^2)-\dfrac{(y_i-\mathbf x_i\theta )^2}{2\sigma_\epsilon^2}\)

\(\dfrac{\delta l}{\delta \theta_j }=\sum_{i=1}^n2x_{ij}\dfrac{y_i-\mathbf x_{i}\theta }{2\sigma^2_\epsilon}\)

\(\sum_{i=1}^nx_{ij}(y_i-\hat \theta_{MLE}\mathbf x_{i} )=0\)

\(X^T(y-X\hat \theta_{MLE} )=0\)

\(X^Ty=X^TX\hat \theta_{MLE} \)

\(\hat \theta_{MLE}=(X^TX)^{-1}X^Ty\)


If errors are normally IID then:

\(\hat \theta_{OLS}=\hat \theta_{MLE}\)

Gauss-Markov theorem

Mean of errors zero + If the model should only have errors on upside or downside for some reason, OLS will not provide this.

Homoscedastic (all have the same variance) + The results aren’t biased, but variances etc are

Errors are uncorrelated + (this would mean you should add lagged variables etc)

show bias from each GM violation

OLS is BUE under normally distributed errors

OLS is BLUE for non-normally distribed errors


T-test selection



Checking for heteroskedasticity using the White test

Robust standard errors


Regression dilution

Noise in \(y\) doesn’t cause bias.

Noise in \(x\) does cause bias.

Need to correct.



Causality v correlation. If just getting correlation, could have bad out of sample performance

Section on causality. Difference between disease causes symptom and symptom causes disease

Linear models can be manipulated to have any variable on the left.