We want to estimate parameters. One way of looking into this is to look at the likelihood function:

\(L(\theta ; X)=P(X|\theta )\)

The likelihood function shows the chance of the observed data being generated, given specific parameters.

If this has high peaks then it provides information that \(\theta \) is located in this region.

For multiple events, the likelihood function is:

\(L(\theta ; X)=P(X|\theta )\)

\(L(\theta ; X)=P(A_1 \land B_2 \land C_3 \land D_4…|\theta )\)

If the events are independent, that is the chance of a flip doesn’t depend on any other outcomes, then:

\(L(\theta ; X)=P(A_1|\theta ).P(B_2|\theta ).P(C_3|\theta ).P(D_4|\theta )...\)

If the events are identically distributed, the chance of flipping a head doesn’t change across flips (for example the heads side doesn’t get heavier over time) then:

\(L(\theta ; X)=P(A|\theta ).P(B|\theta ).P(C|\theta ).P(D|\theta )...\)

\(L(\theta ; X)=\prod_{i=1}^n P(X_i|\theta )\)

The score is defined as the differential of the log-likelihood function with respect to \(\theta \).

\(V(\theta, X)=\dfrac{\delta }{\delta \theta }l(\theta ; X) \)

\(V(\theta, X)=\dfrac{1 }{\prod_{i=1}^nP(X_i|\theta )}\dfrac{\delta }{\delta \theta}L(\theta; X) \)

The expectation of the score, given the true value of \(\theta \) is:

\(E[V(X|\theta)]=\int V(X|\theta) dX\)

\(E[V(X|\theta)]=E[\dfrac{1 }{\prod_{i=1}^nP(X_i|\theta )}\dfrac{\delta }{\delta \theta}L(\theta; X) ]\)

\(E[V(X|\theta)]=\int \dfrac{1 }{\prod_{i=1}^nP(X_i|\theta )}\dfrac{\delta }{\delta \theta}L(\theta; X) \)

\(E[\dfrac{1 }{\prod_{i=1}^nP(X_i|\theta )}]\)

\(\int \dfrac{1 }{\prod_{i=1}^nP(X_i|\theta )}P(\theta )d\theta \)

We can show that the expected value of this is \(0\).

The variance of the score is:

\(var [\dfrac{\delta }{\delta \theta }l(\theta ; X) ]\)

\(var [\dfrac{1 }{\prod_{i=1}^nP(X_i|\theta )}]\)

The Fisher information is the variance:

\(E[(\dfrac{\delta }{\delta \theta }\log f(X, \theta ))^2 |\theta ]\)

\(E[\dfrac{\delta^2 }{\delta \theta^2 }\log f(X, \theta ) |\theta ]\)

Same as expectation of score squared, because centred around \(0\).

We have \(k\) parameters.

\(I(\theta )_{ij}=E[(\dfrac{\delta }{\delta \theta_i}\log f(X, \theta ))(\dfrac{\delta }{\delta \theta_j }\log f(X, \theta ))|\theta ]\)

The Fisher information matrix contains informatio about the population

The observed Fisher infoirmation is the negative of the Hessian of the log likelihood.

We have:

\(l(\theta |\mathbf X)=\sum_i\ln P(\mathbf x_i|\theta )\)

\(J(\theta^*)=-\nabla \nabla^Tl(\theta|mathbf X )|_{\theta = \theta^*}\)

The Fisher information is the expected value of this.

\(I(\theta )=E[J(\theta)]\)

Two variables are called orthogonal if their entry in fisher info matrix is 0

This means that the parameters can be calculated separately. MLE estimates are separate

This can be written as a moment condition

\(\delta \)