Parametric models for dependent variables

Introduction

Generative and discriminative models

Recap

For parametric models without dependent variables we have a form:

\(P(y| \theta )\)

And we have various ways of estimating \(\theta \).

We can write this as a likelihood function:

\(L(\theta ;y )=P(y|\theta)\)

Discriminative models

In discriminative models we learn:

\(P(y|X, \theta )\)

Which we can write as a likelihood function:

\(L(\theta ;y, X )=P(y| X, \theta)\)

Generative models

In generative models we learn:

\(P(y, X| \theta )\)

Which we can write as a likelihood function:

\(L(\theta ;y, X )=P(y, X|\theta)\)

We can use the generative model to calculate dependent probabilities.

\(P(y| X, \theta )=\dfrac{P(y, X| \theta )P(\theta )}{P(X, \theta )}\)

\(P(y| X, \theta )=\dfrac{P(y, X| \theta )}{P(X| \theta )}\)

Bayesian parameter estimation

Bayesian parameter estimation for dependent models

Recap

For non-dependent models we had:

\(P(\theta |y)=\dfrac{P(y, \theta)}{P(y)}\)

\(P(\theta |y)=\dfrac{P(y| \theta)P(\theta )}{P(y)}\)

The bottom bit is a normalisation factor, and so we can use:

\(P(\theta |y)\propto P(y| \theta)P(\theta )\)

We have here:

  • Our prior - \(P(\theta )\)

  • Our posterior - \(P(\theta |y)\)

  • Our likelihood function - \(P(y| \theta)\)

Bayesian regression for generative models

We know:

\(P(\theta |y,X)=\dfrac{P(y, \theta, X )}{P(y, X)}\)

\(P(\theta |y,X)=\dfrac{P(y, X |\theta )P(\theta )}{P(y, X)}\)

The bottom bit is a normalisation factor, and so we can use:

\(P(\theta |y,X)\propto P(y, X| \theta)P(\theta)\)

We have here:

  • Our prior - \(P(\theta )\)

  • Our posterior - \(P(\theta |y,X)\)

  • Our likelihood function - \(P(y, X| \theta )\)

Bayesian regression for discriminative models

We know:

\(P(\theta |y,X)=\dfrac{P(y, \theta, X )}{P(y, X)}\)

\(P(\theta |y,X)=\dfrac{P(y| \theta, X )P(\theta, X)}{P(y, X)}\)

\(P(\theta |y,X)=\dfrac{P(y| \theta, X )P(\theta )P(X|\theta )}{P(y, X)}\)

We assume \(P(X|\theta )=X\) and so:

\(P(\theta |y,X)=\dfrac{P(y| \theta, X )P(\theta )P(X)}{P(y, X)}\)

The bottom bit is a normalisation factor, and so we can use:

\(P(\theta |y,X)\propto P(y| X, \theta)P(\theta)\)

We have here:

  • Our prior - \(P(\theta )\)

  • Our posterior - \(P(\theta |y,X)\)

  • Our likelihood function - \(P(y| X, \theta )\)

Prior and posterior predictive distributions for dependent variables

Prior predictive distribution

Our prior predictive distribution for \(P(y|X)\) depends on our prior for \(\theta \).

\(P(y|X)=\int_\Theta P(\mathbf y|X, \theta)P(\theta )d\theta \)

Posterior predictive distribution

Once we have calculated \(P(\theta |\mathbf y, \mathbf X)\), we can calculate a posterior probability distribution for \(P(y|X)\).

\(P(y|\mathbf x, \mathbf y, \mathbf X )=\int_\Theta P(y|\mathbf x, \theta)P(\theta |\mathbf y, \mathbf X)d\theta \)

Classification

Classification

Classification models are a type of regression model, where \(y\) is discrete rather than continuous.

So we want to find a mapping from a vector \(X\) to probabilities across discrete \(y\) values.

A classifier takes \(X\) and returns a vector.

For a classifier we have \(K\) classes.

Multiclass classification

Multiclass classification

What if can be email for work, friends, family, hobby?

Bayesian classifier

Classification risk

We can measure the risk of a classifier. This is the chance of misclassification.

\(R(C)=P(C(X)\ne Y)\)

The Bayesian classifier

This is the classifer \(C(X)\) which minimises the chance of misclassification.

It takes the output of the soft classifier and chooses the one with the highest chance.

Point parameter estimation

Maximum Likelihood Estimation (MLE) for generative and discriminative models

Maximum A-Priori estimation (MAP) for generative models

Bayesian regression for generative models

We know:

\(P(\theta |y,X)=\dfrac{P(y, \theta, X )}{P(y, X)}\)

\(P(\theta |y,X)=\dfrac{P(y, X |\theta )P(\theta )}{P(y, X)}\)

The bottom bit is a normalisation factor, and so we can use:

\(P(\theta |y,X)\propto P(y, X| \theta)P(\theta)\)

We have here:

  • Our prior - \(P(\theta )\)

  • Our posterior - \(P(\theta |y,X)\)

  • Our likelihood function - \(P(y, X| \theta )\)

Point predictions in discriminative models

Predictions and residuals

Introduction

Our data \((\mathbf y, \mathbf X)\) is divided into \((y_i, \mathbf x_i)\).

We create a function \(\hat y_i = f(\mathbf x_i)\).

The best predictor of \(y\) given \(x\) is:

\(g(X)=E[Y|X]\)

The goal of regression is to find an approximation of this function.

\(P(y|X, \theta )\)

\(\hat y =f(\mathbf x)\)

Residuals

\(\epsilon_i = y_i- \hat y_i\)

Residual sum of squares (RSS)

\(RSS=\sum_i \epsilon_i^2\)

\(RSS=\sum_i (y_i-\hat y_i)^2\)

Explained sum of squares (ESS)

\(ESS=\sum_i (\bar y-\hat y_i)^2\)

Total sum of squares (TSS)

\(TSS=\sum_i (y_i-\bar y)^2\)

Hard and soft classifiers

A hard classifier can return a sparce vector with \(1\) in the relevant classification.

A soft classifier returns probabilities for each entry in the vector.

The vector represents \(P(Y=k|X=x)\)

Transforming soft classifiers into hard classifiers

We can use a cutoff.

If there are more than two classes we can choose the one with the highest score.

Confusion matrix

Include error types here

Coefficient of determination (\(R^2\))

\(R^2= 1-\dfrac{RSS}{TSS}\)

Loss functions for point predictions

Minimum Mean Square Error (MMSE)

Mean estimate.

Can do for a parameter, or for a predicted estimate for \(y\).

Linear models

MLE is same as \(y^2\) loss

MAP is same as \(y^2\) loss with regularisation

Loss functions for regression

Binary classification loss functions

Hinge loss

Brier score

Loss functions for hard classifiers

Don’t want answers outside \(0\) and \(1\).

F score

F1 score

\(F_1\) score: \(\dfrac{2PR}{(P+R)}\)

may not just care about accuracy, eg breast cancer screening

high accurancy can result from v basic model (ie all died on titanic)

Receiver Operating Characteristic (ROC) Area Under Curve (AUC)

Other

Estimating other priors

Estimating \(P(k|T)\) - Which variables we split by, given the tree size

Estimating \(P(r|T, k)\) - The cutoff, given the tree size and the variables we are splitting by

Estimating \(P(\theta | T, k, r)\)