There is distribution \(P(x)\) which we want to know more about.

If the function was closed, we could estimate it by using values of \(x\).

However if the function does not have such a form, we cannot do that.

We can’t plug in values, because the function is complex.

Sometimes we may know a function of the form:

\(f(x)=cP(x)\)

That is, a multiple of the function.

This can happen from Bayes’ theorem:

\(P(y|x)=\dfrac{P(x|y)P(y)}{P(x)}\)

We may be able to estimate \(P(x|y)\) and \(P(y)\), but not \(P(x)\)

This means be have

\(P(y|x)=cP(x|y)P(y)\)

The Metropolis-Hastings algorithm creates a set of samples \(x\) such that the distribution of the samples approaches the goal distribution.

The algorithm takes an arbitrary starting sample \(x_0\). It then must decide which sample to consider next.

It does this using a Markov chain. That is, there is a map \(g(x_j, x_i)\).

This distribution is generally a normal distribution around \(x_i\), making the process a random walk.

Now we have a considered sample, we can either accept or reject it. It is this step that makes the end distribution approximage the function.

We accept if \(\dfrac{f(x_j)}{f(x_i)}>u\), where \(u\) is a random variable between \(0\) and \(1\), generated each time.

We can calculate this because we know this function.

As with Metropolis-Hastings, we want to generate samples for \(P(X)\) and use this to approximate its form.

We do this by using the conditional distribution. If \(X\) is a vector then we also have:

\(P(x_j|x_0,...,x_{j-1},x_{j+1},...,x_n)\)

We use our knowledge of this distribution.

Start with vector \(x_0\).

This has components \(x_{0,j}\)

To form the next vector \(x_1\) we loop through each component.

\(P(x_{1,0}|x_{0,0},x_{0,1},...,x_{0,n})\)

We use this to form \(x_{1,0}\)

However after th the first component we update this so it uses the updated variables.

\(P(x_{1,k}|x_{1,0},...,x_{1,k-1},x_{0,k},...,x_{0,n}\)

This means we only need to know the conditional distributions.

We start with a random set of parameters, \(x\).

We then loop through the following:

We define a search space local to our current selection.

We randomly select a point from this space.

We compare the new point to our current point. If the new point is better we move to that.

This is similar to random search, however we use a multivariate Gaussian distribution around our current point rather than a hypersphere.

We can use a version of Metropolis-Hastings to find the global maximum of a function \(f(x)\).

We start with an arbitrary point \(x_0\).

We move randomly from this to identiy a candidate point \(x_c\).

We accept this with probability depending on the the relationship between \(x_0\) and \(x_c\).

This process will converge on the global maximum.

There is a hyperparameter for selection. At the extreme this becomes a greedy function.

If we have sampled from the hyperparameter space we know something about the shape.

Can we use this to inform where we should next look?

The shape of the function is \(y=f(\mathbf x)\)

We have observations \(\mathbf X\) and \(\mathbf y\).

So what’s our posterior, \(P(y|\mathbf X, \mathbf y)\)?

The can be a tradeoff between:

Exploring - which gives us a better shape for \(y=f(x)\); and

Exploiting - which gives us a better estimate for the global optimum.

We do not know \(y=f(x)\), but we model it as:

\(z(x)=y(x)+\epsilon\)

We can then maximise \(z\)

We want an algorithm which maps from our history of observations to a new candidate.

There are different approaches:

Probability of improvement - Choosing one with the highest chance of a more optimal value

Expected improvement - Choosing one with the biggest expected increase in the optimal value

Entropy search - choosing one which reduces uncertainty about the global maximum.

We generate a set of candidate parameter values, \(x\).

We evaluate each of these against a fitness function (the function we are optimising).

We assign fitness values to each individual.

We generate a second generation. We select “parents” randomly using the fitness values as weightings.

The values of the new individual are a function of the values of the parents, and noise (mutation).

We do this for each member in the next generation.

We iterate this process across successive generations.