We have some data \(X\). We model it as a function of parameters, \(\theta \).

\(P(X|\theta )=f(\theta )\).

There is a data generating process. We use models to say things about the data generating process.

Parameters could be arguments to some function, or moments.

We can create a distribution of \(\theta \):

\(P(\theta | X)\).

So to create a model for the data we need:

A model - \(P\)

Parameters for the model - \(\theta \)

There are therefore two questions: how to select a model and, given a model, how to select parameters.

We want to estimate parameters. One way of looking into this is to look at the likelihood function:

\(L(\theta ; X)=P(X|\theta )\)

The likelihood function shows the chance of the observed data being generated, given specific parameters.

If this has high peaks then it provides information that \(\theta \) is located in this region.

For multiple events, the likelihood function is:

\(L(\theta ; X)=P(X|\theta )\)

\(L(\theta ; X)=P(A_1 \land B_2 \land C_3 \land D_4…|\theta )\)

If the events are independent, that is the chance of a flip doesn’t depend on any other outcomes, then:

\(L(\theta ; X)=P(A_1|\theta ).P(B_2|\theta ).P(C_3|\theta ).P(D_4|\theta )...\)

If the events are identically distributed, the chance of flipping a head doesn’t change across flips (for example the heads side doesn’t get heavier over time) then:

\(L(\theta ; X)=P(A|\theta ).P(B|\theta ).P(C|\theta ).P(D|\theta )...\)

\(L(\theta ; X)=\prod_{i=1}^n P(X_i|\theta )\)

\(L(\theta ; X)=\prod_{i=1}^n P(X_i|\theta )\)

\(\ln L(\theta ; X)=\ln \prod_{i=1}^n P(X_i|\theta )\)

\(l(\theta ; X)=\sum_{i=1}^n \ln P(X_i|\theta )\)

The log-likelihood for IID data is:

\(l(\theta ; X)=\sum_{i=1}^n \ln P(X_i|\theta )\)

The score is defined as the differential of the log-likelihood function with respect to \(\theta \).

\(V(\theta, X)=\dfrac{\delta }{\delta \theta }l(\theta ; X) \)

\(V(\theta, X)=\dfrac{1 }{\prod_{i=1}^nP(X_i|\theta )}\dfrac{\delta }{\delta \theta}L(\theta; X) \)

The expectation of the score, given the true value of \(\theta \) is:

\(E[V(X|\theta)]=\int V(X|\theta) dX\)

\(E[V(X|\theta)]=E[\dfrac{1 }{\prod_{i=1}^nP(X_i|\theta )}\dfrac{\delta }{\delta \theta}L(\theta; X) ]\)

\(E[V(X|\theta)]=\int \dfrac{1 }{\prod_{i=1}^nP(X_i|\theta )}\dfrac{\delta }{\delta \theta}L(\theta; X) \)

\(E[\dfrac{1 }{\prod_{i=1}^nP(X_i|\theta )}]\)

\(\int \dfrac{1 }{\prod_{i=1}^nP(X_i|\theta )}P(\theta )d\theta \)

We can show that the expected value of this is \(0\).

The variance of the score is:

\(var [\dfrac{\delta }{\delta \theta }l(\theta ; X) ]\)

\(var [\dfrac{1 }{\prod_{i=1}^nP(X_i|\theta )}]\)

The Fisher information is the variance:

\(E[(\dfrac{\delta }{\delta \theta }\log f(X, \theta ))^2 |\theta ]\)

\(E[\dfrac{\delta^2 }{\delta \theta^2 }\log f(X, \theta ) |\theta ]\)

Same as expectation of score squared, because centred around \(0\).

We have \(k\) parameters.

\(I(\theta )_{ij}=E[(\dfrac{\delta }{\delta \theta_i}\log f(X, \theta ))(\dfrac{\delta }{\delta \theta_j }\log f(X, \theta ))|\theta ]\)

The Fisher information matrix contains informatio about the population

The observed Fisher infoirmation is the negative of the Hessian of the log likelihood.

We have:

\(l(\theta |\mathbf X)=\sum_i\ln P(\mathbf x_i|\theta )\)

\(J(\theta^*)=-\nabla \nabla^Tl(\theta|mathbf X )|_{\theta = \theta^*}\)

The Fisher information is the expected value of this.

\(I(\theta )=E[J(\theta)]\)

Two variables are called orthogonal if their entry in fisher info matrix is 0

This means that the parameters can be calculated separately. MLE estimates are separate

This can be written as a moment condition

\(\delta \)

We want to generate the probability distribution of \(\theta \) given the evidence \(X\).

We can transform this using Bayes rule.

\(P(\theta | X)=\dfrac{P(X|\theta )P(\theta )}{P(X)}\)

Here we have:

Our prior - \(P(\theta )\)

Our likelihood function - \(P(X|\theta )\)

Our posterior - \(P(\theta | X)\)

If our prior is a normal distribution then:

\(P(\theta )=\dfrac{1}{\sqrt {(2\pi )^n|\Sigma_0|}}e^{-\dfrac{1}{2}(x-\mu )^T\Sigma_0^{-1}(x-\mu)}\)

Similarly, if our likelihood function \(P(X|\theta )\) is a normal distriubtion then:

\(P(X|\theta )=\dfrac{1}{\sqrt {2\pi \sigma^2}}e^{-\dfrac{(x-\mu)^2}{2\sigma ^2}}\)

We can now plug these into Bayes rule:

\(P(\theta |X)=\dfrac{1}{P(X)}\dfrac{1}{\sqrt {2\pi \sigma_0^2}}e^{-\dfrac{(\theta-\mu_0)^2}{2\sigma_0^2}}\dfrac{1}{\sqrt {2\pi \sigma^2}}e^{-\dfrac{(x-\mu)^2}{2\sigma ^2}}\)

\(P(\theta |X)\propto e^{-\dfrac{1}{2}[\dfrac{(\theta-\mu_0)^2}{\sigma_0^2}+\dfrac{(x-\mu)^2}{\sigma ^2}]}\)

We can then set this an a new Gaussian:

\(P(\theta |X)=\dfrac{1}{\sqrt {(2\pi )^{n}|\Sigma|}^{\dfrac{1}{2}}} e^{-\dfrac{1}{2}[\dfrac{(\theta-\mu_0)^2}{\sigma_0^2}+\dfrac{(x-\mu)^2}{\sigma ^2}]}\)

We can calculate the posterior probability for \(\theta \), but we need a prior \(P(\theta )\).

\(P(\theta | X)=\dfrac{P(X|\theta )P(\theta )}{P(X)}\)

With empirical Bayes we get our prior from the data.

We have \(P(X|\theta )\)

And \(P(\theta |\rho )\)

We observe \(X\) and want to estimate \(\theta \).

\(P(\theta |X)=\dfrac{P(X|\theta)P(\theta )}{P(X)}=\dfrac{P(X|\theta)}{P(X)}\int P(\theta | \rho )P(\rho )d\rho \)

Our prior predictive distribution for \(X\) depends on our prior for \(\theta \).

\(P(\mathbf x)=\int_\Theta P(\mathbf x|\theta)P(\theta )d\theta \)

Once we have calculated \(P(\theta |X)\), we can calculate a posterior probability distribution for \(X\).

\(P(\mathbf x|\mathbf X)=\int_\Theta P(\mathbf x|\theta)P(\theta |\mathbf X)d\theta \)

When we take statistics we are often concerned with inferring properties of the underlying probability function.

As the properties of the probability distribution function affect the chance of observing the sample, we can analyse samples to infer properties of the underlying distribution.

There are many properties would could be interested in. This includes moments and parameters of a specific probability distribution function.

An estimator is a statistic which is our estimate of one of these values.

Emphasise that statistics and estimators are different things. A statistic may be terrible estimator, but be useful for other purposes.

We can make estimates of a population parameter using statistics from the same.

A statistic is sufficient if it contains all the information needed to estimate the parameter.

We can describe the role of a parameter as:

\(P(x|\theta, t)\)

\(t\) is a sufficient statistic for \(\theta \) if:

\(P(x|t)=P(x|\theta, t)\)

The error of an estimator is the difference between it and the actual parameter.

\(Error_{\theta }[\hat \theta ]=\hat \theta - \theta \)

The bias of an estimator is the expected error.

\(Bias_\theta [\hat \theta ]:=E_\theta [\hat \theta -\theta ]\)

\(Bias_\theta [\hat \theta ]:=E_\theta [\hat \theta] -\theta\)

Mean squared error

\(MSE = E[(\hat \theta - \theta )^2]=E[((\hat \theta - E[\hat \theta ])+(E[\hat \theta ]-\theta ))^2]\)

\(MSE = E[(\hat \theta - \theta )^2]=E[(\hat \theta - E[\hat \theta ])^2+(E[\hat \theta ]-\theta )^2+2(E[\hat \theta ]-\theta )(\hat \theta- E[\hat \theta ])]\)

\(MSE = E[(\hat \theta - \theta )^2]=E[(\hat \theta - E[\hat \theta ])^2]+E[(E[\hat \theta ]-\theta)^2] +E[2(E[\hat \theta ]-\theta )(\hat \theta- E[\hat \theta ])]\)

\(MSE = E[(\hat \theta - \theta )^2]=Var(\hat \theta )+(E[\hat \theta ]-\theta)^2 +2(E[\hat \theta ]-\theta )E[\hat \theta- E[\hat \theta ]]\)

\(MSE = E[(\hat \theta - \theta )^2]=Var(\hat \theta )+Bias (\hat \theta )^2\)

This is the square root of the MSE.

It is also called the Root Mean Square Deviation (RMSD)

A statistic \(\hat \theta \) is a consistent estimator for \(\theta \) if its error tends to \(0\).

That is:

\(\hat \theta\rightarrow^p \theta \)

We can show that an estimator is consistent if we can write:

\(\hat \theta -\theta \) as a function of \(n\), causing it to tend to \(0\).

Efficiency measures the speed at which a consistent estimator tends towards the true value.

The speed of this convergence is the efficiency. could be fairly efficient plus biased too p Measured as:

\(e(\hat \theta )=\dfrac{\dfrac{1}{I(\theta )}}{Var (\hat \theta )}\)

If an estimator as an efficiency of \(1\) and is unbiased, it is efficient.

We can measure the relative efficiency of two consistent estimators:

The relative efficiency is the variance of the first estimator, divided by the variance of the second.

An estimator is root-n consistent if it is consistent and its variance is:

\(O(\dfrac{1}{n})\)

A consistent estimator is \(n^\delta \)-consistent if its variance is:

\(O(\dfrac{1}{n^{2 \delta }})\)

For an unbiased estimator, the variance cannot be below the Cramer-Rao lower bound.

\(Var (\hat \theta )\ge \dfrac{1}{I(\theta )}\)

Where \(I(\theta )\) is the Fisher information.

We can prove this.

We have the score:

\(V=\dfrac{\delta }{\delta \theta }\ln f(X, \theta )\)

\(V=\dfrac{1}{f(X, \theta )}\dfrac{\delta }{\delta \theta } f(X, \theta )\)

The expectation of the score is \(0\):

\(E[V]=E[\dfrac{1}{f(X, \theta )}\dfrac{\delta }{\delta \theta } f(X, \theta )]\)

\(E[V]=\int \dfrac{1}{f(X, \theta )}\dfrac{\delta }{\delta \theta } f(X, \theta )dx\)

Bias-variance trade-off. if we care about \(E[(y-xt)^2]\) then we may not want an unbiased estimator. by adding some bias we could reduce the variance a lot.

We have a likelihood function of the data.

\(L(\theta ; X)=P(X|\theta )\)

We choose values for \(\theta \) which maximise the likelihood function.

\(argmax_\theta P(X|\theta )\)

That is, for which values of \(\theta \) was the observation we saw most likely?

This is a mode estimate.

\(L(\theta ; X)=\prod_i P(x_i|\theta )\)

We can take logarithms, which preserve stationary points. As logarithms are defined on all values above \(0\), and all probabilities are also above zero (or zero), this preserves solutions.

The non-zero stationary points of:

\(\ln L(\theta ; X)=\ln \prod_i P(x_i|\theta )\)

\(\ln L(\theta ; X)=\sum_i \ln P(x_i|\theta )\)

Let’s take our simple example about coins. Heads and tails are the only options, so \(P(H)+P(T)=1\).

\(P(H|\theta )=\theta \)

\(P(T|\theta )=1-\theta \)

\(\ln L(\theta ; X)=\sum_i \ln P(x_i|\theta )\)

If we had \(5\) heads and \(5\) tails we would have:

\(\ln L(\theta ; X)=5\ln (\theta )+ 5\ln (1-\theta )\)

So \(P(H)=\dfrac{1}{2}\) is the value which makes our observation most likely.

A bit like confidence intervals

The parameters are the population means and covariance matrix.

The MLE estimator for the mean is the sample mean.

The MLE estimator for the covariance matrix is the unadjusted sample covariance.

Mode estimate

\(Arg max_\theta p(\theta | X)\)

Using Bayes theorem:

\(P(\theta | X)= \dfrac{P(X|\theta )P(\theta)}{P(X)}\)

So:

\(P(\theta | X)= \dfrac{P(X|\theta )P(\theta)}{P(X)}\)

\(Argmax_\theta p(\theta | X)=Argmax_\theta \dfrac{p(X|\theta )P(\theta)}{P(X)}\)

The denominator isn’t affected so:

\(Arg max_\theta p(\theta | X)=Arg max_\theta p(X|\theta )P(\theta)\)

If \(P(\theta )\) is a constant then this is the same as the MLE estimator.

\(Argmax_\theta p(theta|X)\)

Mode estimate

\(p(theta|X)= \dfrac{p(X| theta)p(\theta )}{p(X)}\)

\(Argmax_\theta \dfrac{p(X| theta)p(\theta )}{p(X)}\)

\(\theta \) doesn’t change denominator so can instead use:

\(Argmax_\theta p(X| theta)p(\theta )\)

It is the same as maximum likelihood estimator if \(p(\theta )\) is a constant.

If we have \(k\) parameters to estimate, we can solve this if we have \(k\) equations.

We generate these

First, we link each first \(k\) moments to functions of the parameters.

Then we replace the momenets with sample estimates.

The moments of this population distribution are:

\(\mu_i =E[X^i]=g_i(\theta_1,...,\theta_k)\)

We have a sample.

\(X=[X_1,...,X_n]\)

We now define the method of moments estimator

\(\hat \mu_i=\dfrac{1}{n}\sum_{j=1}^nx_j^i\)

We have a function on the output and a parameter:

\(g(y, \theta )\)

A moment condition is that the expectation of such a function is \(0\).

\(m(\theta )=E[g(y, \theta )]=0\)

To do GMM, we estimate this using:

\(\hat m(\theta )=\dfrac{1}{n}\sum_ig(y_i, \theta )\)

We define:

\(\Omega = E[g(y, \theta )g(y, \theta)^T]\)

\(G=E[\Delta_\theta g(y, \theta)]\)

And then minimise the norm:

\(||\hat m(\theta )||^2_W=\hat m(\theta )^TW\hat m(\theta )\)

Where \(W\) is a positive definite matrix for the norm.

\(\Omega ^{-1}\) is most efficient. But we don’t know this. It depends on \(\theta \).

We can estimate it if IID:

\(\hat W(\hat \theta )= (\dfrac{1}{n}\sum_i g(y, \hat \theta)g(y, \hat \theta)^T)^{-1}\)

Estimate using \(\mathbf W=\mathbf I\)

Consistent, but not efficient.

OLS:

\(E[x(y-x\theta)]=0\)

WLS

\(E[x(y-x\theta)/\sigma^(x)]=0\)

IV

\(E[z(y-x\theta)]=0\)

MLE

\(E[\Delta_\theta \ln f(x, \theta)]=0\)