# Choosing parametric probability distributions

## Choosing the form of a model

### Sample sizes

If you’re modelling house prices using just size, getting a large sample size won’t help too much

Can improve low bias models*** Sample size

Is data size an issue? can artificially restrict training data size and then evaluate error

Training: + zero error for low m + increases error as m increases, as degrees of freedom/m falls

cv: + error decreases as data set increases, more accurate theta The two curves converge towards each other for v large m

When are large datasets useful?

When all features available:

Predicting house price using just size, won’t benefit from more data...

If choosing correct word in sentence (to, too, two), more helpful

If human expert can do it, then more data probably helpful

Expert realtor probably couldn’t do much with just size, but speaker could answer other q

Could expert do it?

Low bias algorithms do well with more data

More data good if large number of parameters, or lot of hidden units.

## Choosing the form of a model

### Overfitting

Role of lambda: high makes impact of more variables lower => high bias

Low makes impacts of more variables strong => high variance

Can trade off using cut off. only make positive if above $$0.7$$

How to use? difficult, as lambda within cost!

Can do similarly to d:

Run for a range of lambda (eg 0, 0.01, 0.02, 0.04, 0.08: 10), then pick from cross validation set

Low lambda always has low cost for training set, but not for cv set..

Regularisation: add to error term the size of the term. penalised large parameters

May not fit outside sample

High bias: eg house prices and size. linear would have high bias for out of scope sample (underfitting)

High variance: making polynomial passing through all data (overfitting)

Can reduce overfitting by reducing features either manaually or using models

OR regularisation: keep all features, but reduce magnitude of theta

### Regularisation

Make cost function include size of $$\theta^2$$ values

$$\min \dfrac{1}{2m} [\sum (h(x)-y)^2 + 1000 \theta 3 ^2 + 1000 \theta 4 ^2]$$

$$\min \dfrac{1}{2m}[\sum ..... + \lambda \sum \theta j^2]$$

Tend to not include theta 0 as convention, no regularisation

Update for linear regression is

$$\theta j = \theta j -\alpha{(\dfrac{1}{m})* sum(h(x)-y)xj + (\lambda/m \theta j)}$$

$$\theta j = \theta j (1- \alpha \lambda / m) -alpha {(1/m)*\sum(h(x)-y)xj}$$

This is the same as before, but theta $$j$$ updates from a smaller $$\theta$$ $$j$$ each time.

Normal equation needs a change

$$(X'X)^-1X'y=\theta''$$

Now is

$$(X'X+\lambda I)^-1X y'$$

although for theta 0, lambda zero, so indentiy matrix, but first element 0

REGULARISATION FOR REGULARISATION

add to end of $$J(\theta)$$:

$$\dfrac{\lambda }{2m} \sum \theta j^2$$

update for $$\theta$$$$j$$ $$j>0$$: is a as linear regression, but $$h(x)$$ is a different function

## Kullback-Leibler divergence

Bayesian inference means we have full distribution of $$p(w)$$, not just moments of a specific point estimate

### Cross entropy:

$$H(P,Q)=E_P(I(Q))$$

So for a discrete distribution this is:

$$H(P,Q)=-\sum_x P(x)\log Q(x)$$

$$Q$$ is prior

$$P$$ is posterior

### Kullback-Leibler divergence

When we move from a prior to a posterior distribution, the entropy of the probability distribution changes.

$$D_{KL}(P||Q)=H(P,Q)-H(P)$$

KL divergence is also called the information gain.

### Gibb’s inequality

$$D_{KL}(P||Q)\ge 0$$