For parametric models without dependent variables we have a form:

\(P(y| \theta )\)

And we have various ways of estimating \(\theta \).

We can write this as a likelihood function:

\(L(\theta ;y )=P(y|\theta)\)

In discriminative models we learn:

\(P(y|X, \theta )\)

Which we can write as a likelihood function:

\(L(\theta ;y, X )=P(y| X, \theta)\)

In generative models we learn:

\(P(y, X| \theta )\)

Which we can write as a likelihood function:

\(L(\theta ;y, X )=P(y, X|\theta)\)

We can use the generative model to calculate dependent probabilities.

\(P(y| X, \theta )=\dfrac{P(y, X| \theta )P(\theta )}{P(X, \theta )}\)

\(P(y| X, \theta )=\dfrac{P(y, X| \theta )}{P(X| \theta )}\)

For non-dependent models we had:

\(P(\theta |y)=\dfrac{P(y, \theta)}{P(y)}\)

\(P(\theta |y)=\dfrac{P(y| \theta)P(\theta )}{P(y)}\)

The bottom bit is a normalisation factor, and so we can use:

\(P(\theta |y)\propto P(y| \theta)P(\theta )\)

We have here:

Our prior - \(P(\theta )\)

Our posterior - \(P(\theta |y)\)

Our likelihood function - \(P(y| \theta)\)

We know:

\(P(\theta |y,X)=\dfrac{P(y, \theta, X )}{P(y, X)}\)

\(P(\theta |y,X)=\dfrac{P(y, X |\theta )P(\theta )}{P(y, X)}\)

The bottom bit is a normalisation factor, and so we can use:

\(P(\theta |y,X)\propto P(y, X| \theta)P(\theta)\)

We have here:

Our prior - \(P(\theta )\)

Our posterior - \(P(\theta |y,X)\)

Our likelihood function - \(P(y, X| \theta )\)

We know:

\(P(\theta |y,X)=\dfrac{P(y, \theta, X )}{P(y, X)}\)

\(P(\theta |y,X)=\dfrac{P(y| \theta, X )P(\theta, X)}{P(y, X)}\)

\(P(\theta |y,X)=\dfrac{P(y| \theta, X )P(\theta )P(X|\theta )}{P(y, X)}\)

We assume \(P(X|\theta )=X\) and so:

\(P(\theta |y,X)=\dfrac{P(y| \theta, X )P(\theta )P(X)}{P(y, X)}\)

The bottom bit is a normalisation factor, and so we can use:

\(P(\theta |y,X)\propto P(y| X, \theta)P(\theta)\)

We have here:

Our prior - \(P(\theta )\)

Our posterior - \(P(\theta |y,X)\)

Our likelihood function - \(P(y| X, \theta )\)

Our prior predictive distribution for \(P(y|X)\) depends on our prior for \(\theta \).

\(P(y|X)=\int_\Theta P(\mathbf y|X, \theta)P(\theta )d\theta \)

Once we have calculated \(P(\theta |\mathbf y, \mathbf X)\), we can calculate a posterior probability distribution for \(P(y|X)\).

\(P(y|\mathbf x, \mathbf y, \mathbf X )=\int_\Theta P(y|\mathbf x, \theta)P(\theta |\mathbf y, \mathbf X)d\theta \)

Classification models are a type of regression model, where \(y\) is discrete rather than continuous.

So we want to find a mapping from a vector \(X\) to probabilities across discrete \(y\) values.

A classifier takes \(X\) and returns a vector.

For a classifier we have \(K\) classes.

Multiclass classification

What if can be email for work, friends, family, hobby?

We can measure the risk of a classifier. This is the chance of misclassification.

\(R(C)=P(C(X)\ne Y)\)

This is the classifer \(C(X)\) which minimises the chance of misclassification.

It takes the output of the soft classifier and chooses the one with the highest chance.

We know:

\(P(\theta |y,X)=\dfrac{P(y, \theta, X )}{P(y, X)}\)

\(P(\theta |y,X)=\dfrac{P(y, X |\theta )P(\theta )}{P(y, X)}\)

The bottom bit is a normalisation factor, and so we can use:

\(P(\theta |y,X)\propto P(y, X| \theta)P(\theta)\)

We have here:

Our prior - \(P(\theta )\)

Our posterior - \(P(\theta |y,X)\)

Our likelihood function - \(P(y, X| \theta )\)

Our data \((\mathbf y, \mathbf X)\) is divided into \((y_i, \mathbf x_i)\).

We create a function \(\hat y_i = f(\mathbf x_i)\).

The best predictor of \(y\) given \(x\) is:

\(g(X)=E[Y|X]\)

The goal of regression is to find an approximation of this function.

\(P(y|X, \theta )\)

\(\hat y =f(\mathbf x)\)

\(\epsilon_i = y_i- \hat y_i\)

\(RSS=\sum_i \epsilon_i^2\)

\(RSS=\sum_i (y_i-\hat y_i)^2\)

\(ESS=\sum_i (\bar y-\hat y_i)^2\)

\(TSS=\sum_i (y_i-\bar y)^2\)

A hard classifier can return a sparce vector with \(1\) in the relevant classification.

A soft classifier returns probabilities for each entry in the vector.

The vector represents \(P(Y=k|X=x)\)

We can use a cutoff.

If there are more than two classes we can choose the one with the highest score.

Include error types here

\(R^2= 1-\dfrac{RSS}{TSS}\)

Mean estimate.

Can do for a parameter, or for a predicted estimate for \(y\).

Linear models

MLE is same as \(y^2\) loss

MAP is same as \(y^2\) loss with regularisation

Donâ€™t want answers outside \(0\) and \(1\).

\(F_1\) score: \(\dfrac{2PR}{(P+R)}\)

may not just care about accuracy, eg breast cancer screening

high accurancy can result from v basic model (ie all died on titanic)