# Parametric models for dependent variables

## Introduction

### Generative and discriminative models

#### Recap

For parametric models without dependent variables we have a form:

$$P(y| \theta )$$

And we have various ways of estimating $$\theta$$.

We can write this as a likelihood function:

$$L(\theta ;y )=P(y|\theta)$$

#### Discriminative models

In discriminative models we learn:

$$P(y|X, \theta )$$

Which we can write as a likelihood function:

$$L(\theta ;y, X )=P(y| X, \theta)$$

#### Generative models

In generative models we learn:

$$P(y, X| \theta )$$

Which we can write as a likelihood function:

$$L(\theta ;y, X )=P(y, X|\theta)$$

We can use the generative model to calculate dependent probabilities.

$$P(y| X, \theta )=\dfrac{P(y, X| \theta )P(\theta )}{P(X, \theta )}$$

$$P(y| X, \theta )=\dfrac{P(y, X| \theta )}{P(X| \theta )}$$

## Bayesian parameter estimation

### Bayesian parameter estimation for dependent models

#### Recap

$$P(\theta |y)=\dfrac{P(y, \theta)}{P(y)}$$

$$P(\theta |y)=\dfrac{P(y| \theta)P(\theta )}{P(y)}$$

The bottom bit is a normalisation factor, and so we can use:

$$P(\theta |y)\propto P(y| \theta)P(\theta )$$

We have here:

• Our prior - $$P(\theta )$$

• Our posterior - $$P(\theta |y)$$

• Our likelihood function - $$P(y| \theta)$$

#### Bayesian regression for generative models

We know:

$$P(\theta |y,X)=\dfrac{P(y, \theta, X )}{P(y, X)}$$

$$P(\theta |y,X)=\dfrac{P(y, X |\theta )P(\theta )}{P(y, X)}$$

The bottom bit is a normalisation factor, and so we can use:

$$P(\theta |y,X)\propto P(y, X| \theta)P(\theta)$$

We have here:

• Our prior - $$P(\theta )$$

• Our posterior - $$P(\theta |y,X)$$

• Our likelihood function - $$P(y, X| \theta )$$

#### Bayesian regression for discriminative models

We know:

$$P(\theta |y,X)=\dfrac{P(y, \theta, X )}{P(y, X)}$$

$$P(\theta |y,X)=\dfrac{P(y| \theta, X )P(\theta, X)}{P(y, X)}$$

$$P(\theta |y,X)=\dfrac{P(y| \theta, X )P(\theta )P(X|\theta )}{P(y, X)}$$

We assume $$P(X|\theta )=X$$ and so:

$$P(\theta |y,X)=\dfrac{P(y| \theta, X )P(\theta )P(X)}{P(y, X)}$$

The bottom bit is a normalisation factor, and so we can use:

$$P(\theta |y,X)\propto P(y| X, \theta)P(\theta)$$

We have here:

• Our prior - $$P(\theta )$$

• Our posterior - $$P(\theta |y,X)$$

• Our likelihood function - $$P(y| X, \theta )$$

### Prior and posterior predictive distributions for dependent variables

#### Prior predictive distribution

Our prior predictive distribution for $$P(y|X)$$ depends on our prior for $$\theta$$.

$$P(y|X)=\int_\Theta P(\mathbf y|X, \theta)P(\theta )d\theta$$

#### Posterior predictive distribution

Once we have calculated $$P(\theta |\mathbf y, \mathbf X)$$, we can calculate a posterior probability distribution for $$P(y|X)$$.

$$P(y|\mathbf x, \mathbf y, \mathbf X )=\int_\Theta P(y|\mathbf x, \theta)P(\theta |\mathbf y, \mathbf X)d\theta$$

## Classification

### Classification

Classification models are a type of regression model, where $$y$$ is discrete rather than continuous.

So we want to find a mapping from a vector $$X$$ to probabilities across discrete $$y$$ values.

A classifier takes $$X$$ and returns a vector.

For a classifier we have $$K$$ classes.

### Multiclass classification

Multiclass classification

What if can be email for work, friends, family, hobby?

### Bayesian classifier

#### Classification risk

We can measure the risk of a classifier. This is the chance of misclassification.

$$R(C)=P(C(X)\ne Y)$$

#### The Bayesian classifier

This is the classifer $$C(X)$$ which minimises the chance of misclassification.

It takes the output of the soft classifier and chooses the one with the highest chance.

## Point parameter estimation

### Maximum A-Priori estimation (MAP) for generative models

#### Bayesian regression for generative models

We know:

$$P(\theta |y,X)=\dfrac{P(y, \theta, X )}{P(y, X)}$$

$$P(\theta |y,X)=\dfrac{P(y, X |\theta )P(\theta )}{P(y, X)}$$

The bottom bit is a normalisation factor, and so we can use:

$$P(\theta |y,X)\propto P(y, X| \theta)P(\theta)$$

We have here:

• Our prior - $$P(\theta )$$

• Our posterior - $$P(\theta |y,X)$$

• Our likelihood function - $$P(y, X| \theta )$$

## Point predictions in discriminative models

### Predictions and residuals

#### Introduction

Our data $$(\mathbf y, \mathbf X)$$ is divided into $$(y_i, \mathbf x_i)$$.

We create a function $$\hat y_i = f(\mathbf x_i)$$.

The best predictor of $$y$$ given $$x$$ is:

$$g(X)=E[Y|X]$$

The goal of regression is to find an approximation of this function.

$$P(y|X, \theta )$$

$$\hat y =f(\mathbf x)$$

#### Residuals

$$\epsilon_i = y_i- \hat y_i$$

#### Residual sum of squares (RSS)

$$RSS=\sum_i \epsilon_i^2$$

$$RSS=\sum_i (y_i-\hat y_i)^2$$

#### Explained sum of squares (ESS)

$$ESS=\sum_i (\bar y-\hat y_i)^2$$

#### Total sum of squares (TSS)

$$TSS=\sum_i (y_i-\bar y)^2$$

### Hard and soft classifiers

A hard classifier can return a sparce vector with $$1$$ in the relevant classification.

A soft classifier returns probabilities for each entry in the vector.

The vector represents $$P(Y=k|X=x)$$

#### Transforming soft classifiers into hard classifiers

We can use a cutoff.

If there are more than two classes we can choose the one with the highest score.

### Confusion matrix

Include error types here

### Coefficient of determination ($$R^2$$)

$$R^2= 1-\dfrac{RSS}{TSS}$$

## Loss functions for point predictions

### Minimum Mean Square Error (MMSE)

Mean estimate.

Can do for a parameter, or for a predicted estimate for $$y$$.

Linear models

MLE is same as $$y^2$$ loss

MAP is same as $$y^2$$ loss with regularisation

### Loss functions for hard classifiers

Don’t want answers outside $$0$$ and $$1$$.

#### F1 score

$$F_1$$ score: $$\dfrac{2PR}{(P+R)}$$

may not just care about accuracy, eg breast cancer screening

high accurancy can result from v basic model (ie all died on titanic)