Parametric estimation of probability distributions

Introduction

Parametric models

Introduction

We have some data \(X\). We model it as a function of parameters, \(\theta \).

\(P(X|\theta )=f(\theta )\).

There is a data generating process. We use models to say things about the data generating process.

Parameters

Parameters could be arguments to some function, or moments.

Distributions of \(\theta \)

We can create a distribution of \(\theta \):

\(P(\theta | X)\).

So to create a model for the data we need:

  • A model - \(P\)

  • Parameters for the model - \(\theta \)

There are therefore two questions: how to select a model and, given a model, how to select parameters.

Likelihood functions

Likelihood function

Likelihood function

We want to estimate parameters. One way of looking into this is to look at the likelihood function:

\(L(\theta ; X)=P(X|\theta )\)

The likelihood function shows the chance of the observed data being generated, given specific parameters.

If this has high peaks then it provides information that \(\theta \) is located in this region.

IID

For multiple events, the likelihood function is:

\(L(\theta ; X)=P(X|\theta )\)

\(L(\theta ; X)=P(A_1 \land B_2 \land C_3 \land D_4…|\theta )\)

If the events are independent, that is the chance of a flip doesn’t depend on any other outcomes, then:

\(L(\theta ; X)=P(A_1|\theta ).P(B_2|\theta ).P(C_3|\theta ).P(D_4|\theta )...\)

If the events are identically distributed, the chance of flipping a head doesn’t change across flips (for example the heads side doesn’t get heavier over time) then:

\(L(\theta ; X)=P(A|\theta ).P(B|\theta ).P(C|\theta ).P(D|\theta )...\)

\(L(\theta ; X)=\prod_{i=1}^n P(X_i|\theta )\)

Log-likelihood function

\(L(\theta ; X)=\prod_{i=1}^n P(X_i|\theta )\)

\(\ln L(\theta ; X)=\ln \prod_{i=1}^n P(X_i|\theta )\)

\(l(\theta ; X)=\sum_{i=1}^n \ln P(X_i|\theta )\)

Score

Recap: Log-likelihood

The log-likelihood for IID data is:

\(l(\theta ; X)=\sum_{i=1}^n \ln P(X_i|\theta )\)

The score

The score is defined as the differential of the log-likelihood function with respect to \(\theta \).

\(V(\theta, X)=\dfrac{\delta }{\delta \theta }l(\theta ; X) \)

\(V(\theta, X)=\dfrac{1 }{\prod_{i=1}^nP(X_i|\theta )}\dfrac{\delta }{\delta \theta}L(\theta; X) \)

Expectation of the score

The expectation of the score, given the true value of \(\theta \) is:

\(E[V(X|\theta)]=\int V(X|\theta) dX\)

\(E[V(X|\theta)]=E[\dfrac{1 }{\prod_{i=1}^nP(X_i|\theta )}\dfrac{\delta }{\delta \theta}L(\theta; X) ]\)

\(E[V(X|\theta)]=\int \dfrac{1 }{\prod_{i=1}^nP(X_i|\theta )}\dfrac{\delta }{\delta \theta}L(\theta; X) \)

\(E[\dfrac{1 }{\prod_{i=1}^nP(X_i|\theta )}]\)

\(\int \dfrac{1 }{\prod_{i=1}^nP(X_i|\theta )}P(\theta )d\theta \)

We can show that the expected value of this is \(0\).

Variance of the score

The variance of the score is:

\(var [\dfrac{\delta }{\delta \theta }l(\theta ; X) ]\)

\(var [\dfrac{1 }{\prod_{i=1}^nP(X_i|\theta )}]\)

Fisher information

The Fisher information is the variance:

\(E[(\dfrac{\delta }{\delta \theta }\log f(X, \theta ))^2 |\theta ]\)

\(E[\dfrac{\delta^2 }{\delta \theta^2 }\log f(X, \theta ) |\theta ]\)

Same as expectation of score squared, because centred around \(0\).

Fisher information matrix

We have \(k\) parameters.

\(I(\theta )_{ij}=E[(\dfrac{\delta }{\delta \theta_i}\log f(X, \theta ))(\dfrac{\delta }{\delta \theta_j }\log f(X, \theta ))|\theta ]\)

Observed Fisher information matrix

The Fisher information matrix contains informatio about the population

The observed Fisher infoirmation is the negative of the Hessian of the log likelihood.

We have:

  • \(l(\theta |\mathbf X)=\sum_i\ln P(\mathbf x_i|\theta )\)

  • \(J(\theta^*)=-\nabla \nabla^Tl(\theta|mathbf X )|_{\theta = \theta^*}\)

The Fisher information is the expected value of this.

\(I(\theta )=E[J(\theta)]\)

Orthogonality

Two variables are called orthogonal if their entry in fisher info matrix is 0

This means that the parameters can be calculated separately. MLE estimates are separate

This can be written as a moment condition

\(\delta \)

Bayesian parameter estimation

Bayesian parameter estimation

Bayes rule

We want to generate the probability distribution of \(\theta \) given the evidence \(X\).

We can transform this using Bayes rule.

\(P(\theta | X)=\dfrac{P(X|\theta )P(\theta )}{P(X)}\)

Here we have:

  • Our prior - \(P(\theta )\)

  • Our likelihood function - \(P(X|\theta )\)

  • Our posterior - \(P(\theta | X)\)

Normal priors and posteriors

If our prior is a normal distribution then:

\(P(\theta )=\dfrac{1}{\sqrt {(2\pi )^n|\Sigma_0|}}e^{-\dfrac{1}{2}(x-\mu )^T\Sigma_0^{-1}(x-\mu)}\)

Similarly, if our likelihood function \(P(X|\theta )\) is a normal distriubtion then:

\(P(X|\theta )=\dfrac{1}{\sqrt {2\pi \sigma^2}}e^{-\dfrac{(x-\mu)^2}{2\sigma ^2}}\)

We can now plug these into Bayes rule:

\(P(\theta |X)=\dfrac{1}{P(X)}\dfrac{1}{\sqrt {2\pi \sigma_0^2}}e^{-\dfrac{(\theta-\mu_0)^2}{2\sigma_0^2}}\dfrac{1}{\sqrt {2\pi \sigma^2}}e^{-\dfrac{(x-\mu)^2}{2\sigma ^2}}\)

\(P(\theta |X)\propto e^{-\dfrac{1}{2}[\dfrac{(\theta-\mu_0)^2}{\sigma_0^2}+\dfrac{(x-\mu)^2}{\sigma ^2}]}\)

We can then set this an a new Gaussian:

\(P(\theta |X)=\dfrac{1}{\sqrt {(2\pi )^{n}|\Sigma|}^{\dfrac{1}{2}}} e^{-\dfrac{1}{2}[\dfrac{(\theta-\mu_0)^2}{\sigma_0^2}+\dfrac{(x-\mu)^2}{\sigma ^2}]}\)

Empirical Bayes

Bayes rule

We can calculate the posterior probability for \(\theta \), but we need a prior \(P(\theta )\).

\(P(\theta | X)=\dfrac{P(X|\theta )P(\theta )}{P(X)}\)

Empirical Bayes

With empirical Bayes we get our prior from the data.

We have \(P(X|\theta )\)

And \(P(\theta |\rho )\)

We observe \(X\) and want to estimate \(\theta \).

\(P(\theta |X)=\dfrac{P(X|\theta)P(\theta )}{P(X)}=\dfrac{P(X|\theta)}{P(X)}\int P(\theta | \rho )P(\rho )d\rho \)

Prior and posterior predictive distributions

Prior predictive distribution

Our prior predictive distribution for \(X\) depends on our prior for \(\theta \).

\(P(\mathbf x)=\int_\Theta P(\mathbf x|\theta)P(\theta )d\theta \)

Posterior predictive distribution

Once we have calculated \(P(\theta |X)\), we can calculate a posterior probability distribution for \(X\).

\(P(\mathbf x|\mathbf X)=\int_\Theta P(\mathbf x|\theta)P(\theta |\mathbf X)d\theta \)

Point estimates for parameters

Estimators

When we take statistics we are often concerned with inferring properties of the underlying probability function.

As the properties of the probability distribution function affect the chance of observing the sample, we can analyse samples to infer properties of the underlying distribution.

There are many properties would could be interested in. This includes moments and parameters of a specific probability distribution function.

An estimator is a statistic which is our estimate of one of these values.

Emphasise that statistics and estimators are different things. A statistic may be terrible estimator, but be useful for other purposes.

Sufficient statistics

We can make estimates of a population parameter using statistics from the same.

A statistic is sufficient if it contains all the information needed to estimate the parameter.

We can describe the role of a parameter as:

\(P(x|\theta, t)\)

\(t\) is a sufficient statistic for \(\theta \) if:

\(P(x|t)=P(x|\theta, t)\)

Properties of point estimators

Estimtor error and bias

Error of an estimator

The error of an estimator is the difference between it and the actual parameter.

\(Error_{\theta }[\hat \theta ]=\hat \theta - \theta \)

Bias of an estimator

The bias of an estimator is the expected error.

\(Bias_\theta [\hat \theta ]:=E_\theta [\hat \theta -\theta ]\)

\(Bias_\theta [\hat \theta ]:=E_\theta [\hat \theta] -\theta\)

Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) of an estimator

Mean squared error

Mean squared error

\(MSE = E[(\hat \theta - \theta )^2]=E[((\hat \theta - E[\hat \theta ])+(E[\hat \theta ]-\theta ))^2]\)

\(MSE = E[(\hat \theta - \theta )^2]=E[(\hat \theta - E[\hat \theta ])^2+(E[\hat \theta ]-\theta )^2+2(E[\hat \theta ]-\theta )(\hat \theta- E[\hat \theta ])]\)

\(MSE = E[(\hat \theta - \theta )^2]=E[(\hat \theta - E[\hat \theta ])^2]+E[(E[\hat \theta ]-\theta)^2] +E[2(E[\hat \theta ]-\theta )(\hat \theta- E[\hat \theta ])]\)

\(MSE = E[(\hat \theta - \theta )^2]=Var(\hat \theta )+(E[\hat \theta ]-\theta)^2 +2(E[\hat \theta ]-\theta )E[\hat \theta- E[\hat \theta ]]\)

\(MSE = E[(\hat \theta - \theta )^2]=Var(\hat \theta )+Bias (\hat \theta )^2\)

Root Mean Square Error (RMSE)

This is the square root of the MSE.

It is also called the Root Mean Square Deviation (RMSD)

Asymptotic properties of estimators

Consistency and efficiency of estimators

Consistency

A statistic \(\hat \theta \) is a consistent estimator for \(\theta \) if its error tends to \(0\).

That is:

\(\hat \theta\rightarrow^p \theta \)

We can show that an estimator is consistent if we can write:

\(\hat \theta -\theta \) as a function of \(n\), causing it to tend to \(0\).

Efficiency

Efficiency measures the speed at which a consistent estimator tends towards the true value.

The speed of this convergence is the efficiency. could be fairly efficient plus biased too p Measured as:

\(e(\hat \theta )=\dfrac{\dfrac{1}{I(\theta )}}{Var (\hat \theta )}\)

If an estimator as an efficiency of \(1\) and is unbiased, it is efficient.

Relative efficiency

We can measure the relative efficiency of two consistent estimators:

The relative efficiency is the variance of the first estimator, divided by the variance of the second.

Root-n estimators

An estimator is root-n consistent if it is consistent and its variance is:

\(O(\dfrac{1}{n})\)

\(n^\delta \)-convergent

A consistent estimator is \(n^\delta \)-consistent if its variance is:

\(O(\dfrac{1}{n^{2 \delta }})\)

Cramér-Rao lower bound

For an unbiased estimator, the variance cannot be below the Cramer-Rao lower bound.

\(Var (\hat \theta )\ge \dfrac{1}{I(\theta )}\)

Where \(I(\theta )\) is the Fisher information.

We can prove this.

We have the score:

\(V=\dfrac{\delta }{\delta \theta }\ln f(X, \theta )\)

\(V=\dfrac{1}{f(X, \theta )}\dfrac{\delta }{\delta \theta } f(X, \theta )\)

The expectation of the score is \(0\):

\(E[V]=E[\dfrac{1}{f(X, \theta )}\dfrac{\delta }{\delta \theta } f(X, \theta )]\)

\(E[V]=\int \dfrac{1}{f(X, \theta )}\dfrac{\delta }{\delta \theta } f(X, \theta )dx\)

Bias-Variance trade-off

Bias-variance trade-off. if we care about \(E[(y-xt)^2]\) then we may not want an unbiased estimator. by adding some bias we could reduce the variance a lot.

Maximum Likelihood Estimation

Maximum likelihood estimation

Maximising the likelihood function

We have a likelihood function of the data.

\(L(\theta ; X)=P(X|\theta )\)

We choose values for \(\theta \) which maximise the likelihood function.

\(argmax_\theta P(X|\theta )\)

That is, for which values of \(\theta \) was the observation we saw most likely?

This is a mode estimate.

IID

\(L(\theta ; X)=\prod_i P(x_i|\theta )\)

Logarithms

We can take logarithms, which preserve stationary points. As logarithms are defined on all values above \(0\), and all probabilities are also above zero (or zero), this preserves solutions.

The non-zero stationary points of:

\(\ln L(\theta ; X)=\ln \prod_i P(x_i|\theta )\)

\(\ln L(\theta ; X)=\sum_i \ln P(x_i|\theta )\)

Example: Coin flip

Let’s take our simple example about coins. Heads and tails are the only options, so \(P(H)+P(T)=1\).

\(P(H|\theta )=\theta \)

\(P(T|\theta )=1-\theta \)

\(\ln L(\theta ; X)=\sum_i \ln P(x_i|\theta )\)

If we had \(5\) heads and \(5\) tails we would have:

\(\ln L(\theta ; X)=5\ln (\theta )+ 5\ln (1-\theta )\)

So \(P(H)=\dfrac{1}{2}\) is the value which makes our observation most likely.

Relative likelihood ratios

A bit like confidence intervals

Asymptotic normality of the MLE

MLE of the Gaussian distribution

The parameters are the population means and covariance matrix.

The MLE estimator for the mean is the sample mean.

The MLE estimator for the covariance matrix is the unadjusted sample covariance.

MLE of the Poisson distribution

MLE of the Bernoulli and binomial distributions

Maximum A-Priori Estimation

Maximum A-Priori (MAP) estimation

Definition

Mode estimate

\(Arg max_\theta p(\theta | X)\)

Using Bayes theorem:

\(P(\theta | X)= \dfrac{P(X|\theta )P(\theta)}{P(X)}\)

So:

\(P(\theta | X)= \dfrac{P(X|\theta )P(\theta)}{P(X)}\)

\(Argmax_\theta p(\theta | X)=Argmax_\theta \dfrac{p(X|\theta )P(\theta)}{P(X)}\)

The denominator isn’t affected so:

\(Arg max_\theta p(\theta | X)=Arg max_\theta p(X|\theta )P(\theta)\)

If \(P(\theta )\) is a constant then this is the same as the MLE estimator.

Other

\(Argmax_\theta p(theta|X)\)

Mode estimate

\(p(theta|X)= \dfrac{p(X| theta)p(\theta )}{p(X)}\)

\(Argmax_\theta \dfrac{p(X| theta)p(\theta )}{p(X)}\)

\(\theta \) doesn’t change denominator so can instead use:

\(Argmax_\theta p(X| theta)p(\theta )\)

It is the same as maximum likelihood estimator if \(p(\theta )\) is a constant.

MAP of the Gaussian distribution

M-estimators

Method of Moments

Method of moments

Introduction

If we have \(k\) parameters to estimate, we can solve this if we have \(k\) equations.

We generate these

First, we link each first \(k\) moments to functions of the parameters.

Then we replace the momenets with sample estimates.

Estimation

The moments of this population distribution are:

\(\mu_i =E[X^i]=g_i(\theta_1,...,\theta_k)\)

We have a sample.

\(X=[X_1,...,X_n]\)

We now define the method of moments estimator

\(\hat \mu_i=\dfrac{1}{n}\sum_{j=1}^nx_j^i\)

Generalised Method of Moments (GMM)

Introduction

We have a function on the output and a parameter:

\(g(y, \theta )\)

A moment condition is that the expectation of such a function is \(0\).

\(m(\theta )=E[g(y, \theta )]=0\)

To do GMM, we estimate this using:

\(\hat m(\theta )=\dfrac{1}{n}\sum_ig(y_i, \theta )\)

We define:

\(\Omega = E[g(y, \theta )g(y, \theta)^T]\)

\(G=E[\Delta_\theta g(y, \theta)]\)

And then minimise the norm:

\(||\hat m(\theta )||^2_W=\hat m(\theta )^TW\hat m(\theta )\)

Where \(W\) is a positive definite matrix for the norm.

\(\Omega ^{-1}\) is most efficient. But we don’t know this. It depends on \(\theta \).

We can estimate it if IID:

\(\hat W(\hat \theta )= (\dfrac{1}{n}\sum_i g(y, \hat \theta)g(y, \hat \theta)^T)^{-1}\)

Two-step feasible GMM

Estimate using \(\mathbf W=\mathbf I\)

Consistent, but not efficient.

Moment conditions

OLS:

\(E[x(y-x\theta)]=0\)

WLS

\(E[x(y-x\theta)/\sigma^(x)]=0\)

IV

\(E[z(y-x\theta)]=0\)

MLE

\(E[\Delta_\theta \ln f(x, \theta)]=0\)