Point variable estimates for discriminative models

Predictions and residuals

Predictions

Our data $$(\mathbf y, \mathbf X)$$ is divided into $$(y_i, \mathbf x_i)$$.

We create a function $$\hat y_i = f(\mathbf x_i)$$.

The best predictor of $$y$$ given $$x$$ is:

$$g(X)=E[Y|X]$$

The goal of regression is to find an approximation of this function.

Residuals

$$\epsilon_i = y_i- \hat y_i$$

$$RSS=\sum_i \epsilon_i^2$$

$$RSS=\sum_i (y_i-\hat y_i)^2$$

Explained sum of squares (ESS)

$$ESS=\sum_i (\bar y-\hat y_i)^2$$

Total sum of squares (TSS)

$$TSS=\sum_i (y_i-\bar y)^2$$

Relationship between prediction and probability distribution

$$P(y|X, \theta )$$

$$\hat y =f(\mathbf x)$$

Through integration?

$$E[y] = \int P(y|X, \theta ) dy$$

Coefficient of determination ($$R^2$$)

$$R^2= 1-\dfrac{RSS}{TSS}$$

Classification

Binary classification

Classification models are a type of regression model, where $$y$$ is discrete rather than continuous.

So we want to find a mapping from a vector $$X$$ to probabilities across discrete $$y$$ values.

A classifier takes $$X$$ and returns a vector.

For a classifier we have $$K$$ classes.

Classification

Confusion matrix. true positve, false positive, false negative, true negative

Can use this to get

Accuracy: percentage correct

Precision: percentage of positive predictions which are correct

Recall (sensitivity): percentage of poitive cases that were predicted as positive

Specificity: percentage of negative cases preicated as negative

Multiclass classification

Multiclass classification

What if can be email for work, friends, family, hobby?

Confusion matrix

Include error types here

Hard and soft classifiers

A hard classifier can return a sparce vector with $$1$$ in the relevant classification.

A soft classifier returns probabilities for each entry in the vector.

The vector represents $$P(Y=k|X=x)$$

Transforming soft classifiers into hard classifiers

We can use a cutoff.

If there are more than two classes we can choose the one with the highest score.

Loss functions for point predictions

Minimum Mean Square Error (MMSE)

Mean estimate.

Can do for a parameter, or for a predicted estimate for $$y$$.

Linear models

MLE is same as $$y^2$$ loss

MAP is same as $$y^2$$ loss with regularisation

Loss functions for hard classifiers

Don’t want answers outside $$0$$ and $$1$$.

F1 score

$$F_1$$ score: $$\dfrac{2PR}{(P+R)}$$

may not just care about accuracy, eg breast cancer screening

high accurancy can result from v basic model (ie all died on titanic)

Other

Maximum A-Priori estimation (MAP) for generative models

Bayesian regression for generative models

We know:

$$P(\theta |y,X)=\dfrac{P(y, \theta, X )}{P(y, X)}$$

$$P(\theta |y,X)=\dfrac{P(y, X |\theta )P(\theta )}{P(y, X)}$$

The bottom bit is a normalisation factor, and so we can use:

$$P(\theta |y,X)\propto P(y, X| \theta)P(\theta)$$

We have here:

• Our prior - $$P(\theta )$$

• Our posterior - $$P(\theta |y,X)$$

• Our likelihood function - $$P(y, X| \theta )$$

Bayesian classifier

Classification risk

We can measure the risk of a classifier. This is the chance of misclassification.

$$R(C)=P(C(X)\ne Y)$$

The Bayesian classifier

This is the classifer $$C(X)$$ which minimises the chance of misclassification.

It takes the output of the soft classifier and chooses the one with the highest chance.