Choosing parametric probability distributions

Choosing the form of a model

Sample sizes

If you’re modelling house prices using just size, getting a large sample size won’t help too much

Can improve low bias models*** Sample size

Is data size an issue? can artificially restrict training data size and then evaluate error

Training: + zero error for low m + increases error as m increases, as degrees of freedom/m falls

cv: + error decreases as data set increases, more accurate theta The two curves converge towards each other for v large m

When are large datasets useful?

When all features available:

Predicting house price using just size, won’t benefit from more data...

If choosing correct word in sentence (to, too, two), more helpful

If human expert can do it, then more data probably helpful

Expert realtor probably couldn’t do much with just size, but speaker could answer other q

Could expert do it?

Low bias algorithms do well with more data

More data good if large number of parameters, or lot of hidden units.

Choosing the form of a model


Role of lambda: high makes impact of more variables lower => high bias

Low makes impacts of more variables strong => high variance

Can trade off using cut off. only make positive if above \(0.7\)

How to use? difficult, as lambda within cost!

Can do similarly to d:

Run for a range of lambda (eg 0, 0.01, 0.02, 0.04, 0.08: 10), then pick from cross validation set

Low lambda always has low cost for training set, but not for cv set..

Regularisation: add to error term the size of the term. penalised large parameters

May not fit outside sample

High bias: eg house prices and size. linear would have high bias for out of scope sample (underfitting)

High variance: making polynomial passing through all data (overfitting)

Can reduce overfitting by reducing features either manaually or using models

OR regularisation: keep all features, but reduce magnitude of theta


Make cost function include size of \(\theta^2\) values

\(\min \dfrac{1}{2m} [\sum (h(x)-y)^2 + 1000 \theta 3 ^2 + 1000 \theta 4 ^2]\)

or more broadly:

\(\min \dfrac{1}{2m}[\sum ..... + \lambda \sum \theta j^2]\)

Tend to not include theta 0 as convention, no regularisation

Update for linear regression is

\(\theta j = \theta j -\alpha{(\dfrac{1}{m})* sum(h(x)-y)xj + (\lambda/m \theta j)}\)

\(\theta j = \theta j (1- \alpha \lambda / m) -alpha {(1/m)*\sum(h(x)-y)xj}\)

This is the same as before, but theta \(j\) updates from a smaller \(\theta \) \(j\) each time.

Normal equation needs a change


Now is

\((X'X+\lambda I)^-1X y'\)

although for theta 0, lambda zero, so indentiy matrix, but first element 0


add to end of \(J(\theta)\):

\(\dfrac{\lambda }{2m} \sum \theta j^2\)

update for \(\theta \)\(j\) \(j>0\): is a as linear regression, but \(h(x)\) is a different function

Choosing the form of a model

AIC, AICc, Bayes factor, BIC

Choosing the form of a model

Kullback-Leibler divergence

Bayesian inference means we have full distribution of \(p(w)\), not just moments of a specific point estimate

Cross entropy:


So for a discrete distribution this is:

\(H(P,Q)=-\sum_x P(x)\log Q(x)\)

\(Q\) is prior

\(P\) is posterior

Kullback-Leibler divergence

When we move from a prior to a posterior distribution, the entropy of the probability distribution changes.


KL divergence is also called the information gain.

Gibb’s inequality

\(D_{KL}(P||Q)\ge 0\)

Bayesian model selection