Bayesian parameter estimation of discriminative models

Introduction

Generative and discriminative models

Recap

For parametric models without dependent variables we have a form:

\(P(y| \theta )\)

And we have various ways of estimating \(\theta\).

We can write this as a likelihood function:

\(L(\theta ;y )=P(y|\theta)\)

Discriminative models

In discriminative models we learn:

\(P(y|X, \theta )\)

Which we can write as a likelihood function:

\(L(\theta ;y, X )=P(y| X, \theta)\)

Generative models

In generative models we learn:

\(P(y, X| \theta )\)

Which we can write as a likelihood function:

\(L(\theta ;y, X )=P(y, X|\theta)\)

We can use the generative model to calculate dependent probabilities.

\(P(y| X, \theta )=\dfrac{P(y, X| \theta )P(\theta )}{P(X, \theta )}\)

\(P(y| X, \theta )=\dfrac{P(y, X| \theta )}{P(X| \theta )}\)

Bayesian parameter estimation

Bayesian parameter estimation for dependent models

Recap

For non-dependent models we had:

\(P(\theta |y)=\dfrac{P(y, \theta)}{P(y)}\)

\(P(\theta |y)=\dfrac{P(y| \theta)P(\theta )}{P(y)}\)

The bottom bit is a normalisation factor, and so we can use:

\(P(\theta |y)\propto P(y| \theta)P(\theta )\)

We have here:

Our prior - \(P(\theta )\)
Our posterior - \(P(\theta |y)\)
Our likelihood function - \(P(y| \theta)\)

Bayesian regression for generative models

We know:

\(P(\theta |y,X)=\dfrac{P(y, \theta, X )}{P(y, X)}\)

\(P(\theta |y,X)=\dfrac{P(y, X |\theta )P(\theta )}{P(y, X)}\)

The bottom bit is a normalisation factor, and so we can use:

\(P(\theta |y,X)\propto P(y, X| \theta)P(\theta)\)

We have here:

Our prior - \(P(\theta )\)
Our posterior - \(P(\theta |y,X)\)
Our likelihood function - \(P(y, X| \theta )\)

Bayesian regression for discriminative models

We know:

\(P(\theta |y,X)=\dfrac{P(y, \theta, X )}{P(y, X)}\)

\(P(\theta |y,X)=\dfrac{P(y| \theta, X )P(\theta, X)}{P(y, X)}\)

\(P(\theta |y,X)=\dfrac{P(y| \theta, X )P(\theta )P(X|\theta )}{P(y, X)}\)

We assume \(P(X|\theta )=X\) and so:

\(P(\theta |y,X)=\dfrac{P(y| \theta, X )P(\theta )P(X)}{P(y, X)}\)

The bottom bit is a normalisation factor, and so we can use:

\(P(\theta |y,X)\propto P(y| X, \theta)P(\theta)\)

We have here:

Our prior - \(P(\theta )\)
Our posterior - \(P(\theta |y,X)\)
Our likelihood function - \(P(y| X, \theta )\)

Prior and posterior predictive distributions for dependent variables

Prior predictive distribution

Our prior predictive distribution for \(P(y|X)\) depends on our prior for \(\theta\).

\(P(y|X)=\int_\Theta P(\mathbf y|X, \theta)P(\theta )d\theta\)

Posterior predictive distribution

Once we have calculated \(P(\theta |\mathbf y, \mathbf X)\), we can calculate a posterior probability distribution for \(P(y|X)\).

\(P(y|\mathbf x, \mathbf y, \mathbf X )=\int_\Theta P(y|\mathbf x, \theta)P(\theta |\mathbf y, \mathbf X)d\theta\)