Point variable estimates for discriminative models

Predictions and residuals


Our data \((\mathbf y, \mathbf X)\) is divided into \((y_i, \mathbf x_i)\).

We create a function \(\hat y_i = f(\mathbf x_i)\).

The best predictor of \(y\) given \(x\) is:


The goal of regression is to find an approximation of this function.


\(\epsilon_i = y_i- \hat y_i\)

Residual sum of squares (RSS)

\(RSS=\sum_i \epsilon_i^2\)

\(RSS=\sum_i (y_i-\hat y_i)^2\)

Explained sum of squares (ESS)

\(ESS=\sum_i (\bar y-\hat y_i)^2\)

Total sum of squares (TSS)

\(TSS=\sum_i (y_i-\bar y)^2\)

Relationship between prediction and probability distribution

\(P(y|X, \theta )\)

\(\hat y =f(\mathbf x)\)

Through integration?

\(E[y] = \int P(y|X, \theta ) dy\)

Coefficient of determination (\(R^2\))

\(R^2= 1-\dfrac{RSS}{TSS}\)


Binary classification

Classification models are a type of regression model, where \(y\) is discrete rather than continuous.

So we want to find a mapping from a vector \(X\) to probabilities across discrete \(y\) values.

A classifier takes \(X\) and returns a vector.

For a classifier we have \(K\) classes.


Confusion matrix. true positve, false positive, false negative, true negative

Can use this to get

Accuracy: percentage correct

Precision: percentage of positive predictions which are correct

Recall (sensitivity): percentage of poitive cases that were predicted as positive

Specificity: percentage of negative cases preicated as negative

Multiclass classification

Multiclass classification

What if can be email for work, friends, family, hobby?

Confusion matrix

Include error types here

Hard and soft classifiers

A hard classifier can return a sparce vector with \(1\) in the relevant classification.

A soft classifier returns probabilities for each entry in the vector.

The vector represents \(P(Y=k|X=x)\)

Transforming soft classifiers into hard classifiers

We can use a cutoff.

If there are more than two classes we can choose the one with the highest score.

Loss functions for point predictions

Minimum Mean Square Error (MMSE)

Mean estimate.

Can do for a parameter, or for a predicted estimate for \(y\).

Linear models

MLE is same as \(y^2\) loss

MAP is same as \(y^2\) loss with regularisation

Loss functions for soft classifiers

Hinge loss

Brier score

Loss functions for hard classifiers

Don’t want answers outside \(0\) and \(1\).

F score

F1 score

\(F_1\) score: \(\dfrac{2PR}{(P+R)}\)

may not just care about accuracy, eg breast cancer screening

high accurancy can result from v basic model (ie all died on titanic)

Receiver Operating Characteristic (ROC) Area Under Curve (AUC)


Estimating other priors

Estimating \(P(k|T)\) - Which variables we split by, given the tree size

Estimating \(P(r|T, k)\) - The cutoff, given the tree size and the variables we are splitting by

Estimating \(P(\theta | T, k, r)\)

Maximum Likelihood Estimation (MLE) for generative and discriminative models

Maximum A-Priori estimation (MAP) for generative models

Bayesian regression for generative models

We know:

\(P(\theta |y,X)=\dfrac{P(y, \theta, X )}{P(y, X)}\)

\(P(\theta |y,X)=\dfrac{P(y, X |\theta )P(\theta )}{P(y, X)}\)

The bottom bit is a normalisation factor, and so we can use:

\(P(\theta |y,X)\propto P(y, X| \theta)P(\theta)\)

We have here:

  • Our prior - \(P(\theta )\)

  • Our posterior - \(P(\theta |y,X)\)

  • Our likelihood function - \(P(y, X| \theta )\)

Bayesian classifier

Classification risk

We can measure the risk of a classifier. This is the chance of misclassification.

\(R(C)=P(C(X)\ne Y)\)

The Bayesian classifier

This is the classifer \(C(X)\) which minimises the chance of misclassification.

It takes the output of the soft classifier and chooses the one with the highest chance.