Optimisation

Markov Chain Monte Carlo (MCMC) methods

Metropolis-Hastings algorithm

Direct sampling

Density estimation through direct sampling

There is distribution \(P(x)\) which we want to know more about.

If the function was closed, we could estimate it by using values of \(x\).

Limitations of direct sampling

However if the function does not have such a form, we cannot do that.

We can’t plug in values, because the function is complex.

Sometimes we may know a function of the form:

\(f(x)=cP(x)\)

That is, a multiple of the function.

This can happen from Bayes’ theorem:

\(P(y|x)=\dfrac{P(x|y)P(y)}{P(x)}\)

We may be able to estimate \(P(x|y)\) and \(P(y)\), but not \(P(x)\)

This means be have

\(P(y|x)=cP(x|y)P(y)\)

The Metropolis-Hastings algorithm

The Metropolis-Hastings algorithm

The Metropolis-Hastings algorithm creates a set of samples \(x\) such that the distribution of the samples approaches the goal distribution.

Initialisation

The algorithm takes an arbitrary starting sample \(x_0\). It then must decide which sample to consider next.

Generation

It does this using a Markov chain. That is, there is a map \(g(x_j, x_i)\).

This distribution is generally a normal distribution around \(x_i\), making the process a random walk.

Acceptance

Now we have a considered sample, we can either accept or reject it. It is this step that makes the end distribution approximage the function.

We accept if \(\dfrac{f(x_j)}{f(x_i)}>u\), where \(u\) is a random variable between \(0\) and \(1\), generated each time.

We can calculate this because we know this function.

Properties

Gibb’s sampling

Gibb’s sampling

Introduction

As with Metropolis-Hastings, we want to generate samples for \(P(X)\) and use this to approximate its form.

We do this by using the conditional distribution. If \(X\) is a vector then we also have:

\(P(x_j|x_0,...,x_{j-1},x_{j+1},...,x_n)\)

We use our knowledge of this distribution.

Start with vector \(x_0\).

This has components \(x_{0,j}\)

To form the next vector \(x_1\) we loop through each component.

\(P(x_{1,0}|x_{0,0},x_{0,1},...,x_{0,n})\)

We use this to form \(x_{1,0}\)

However after th the first component we update this so it uses the updated variables.

\(P(x_{1,k}|x_{1,0},...,x_{1,k-1},x_{0,k},...,x_{0,n}\)

This means we only need to know the conditional distributions.

Random search

We start with a random set of parameters, \(x\).

We then loop through the following:

  • We define a search space local to our current selection.

  • We randomly select a point from this space.

  • We compare the new point to our current point. If the new point is better we move to that.

Random optimisation

This is similar to random search, however we use a multivariate Gaussian distribution around our current point rather than a hypersphere.

Simulated annealing

Introduction

We can use a version of Metropolis-Hastings to find the global maximum of a function \(f(x)\).

We start with an arbitrary point \(x_0\).

We move randomly from this to identiy a candidate point \(x_c\).

We accept this with probability depending on the the relationship between \(x_0\) and \(x_c\).

This process will converge on the global maximum.

Hyperparameter

There is a hyperparameter for selection. At the extreme this becomes a greedy function.

Bayesian optimisation

Bayesian optimisation

Introduction

If we have sampled from the hyperparameter space we know something about the shape.

Can we use this to inform where we should next look?

The shape of the function is \(y=f(\mathbf x)\)

We have observations \(\mathbf X\) and \(\mathbf y\).

So what’s our posterior, \(P(y|\mathbf X, \mathbf y)\)?

Exploration and exploitation

The can be a tradeoff between:

  • Exploring - which gives us a better shape for \(y=f(x)\); and

  • Exploiting - which gives us a better estimate for the global optimum.

The surrogate function

We do not know \(y=f(x)\), but we model it as:

\(z(x)=y(x)+\epsilon\)

We can then maximise \(z\)

Proposing new candidates

We want an algorithm which maps from our history of observations to a new candidate.

There are different approaches:

  • Probability of improvement - Choosing one with the highest chance of a more optimal value

  • Expected improvement - Choosing one with the biggest expected increase in the optimal value

  • Entropy search - choosing one which reduces uncertainty about the global maximum.

Evolutionary algorithms

Evolutionary algorithms

Initialisation

We generate a set of candidate parameter values, \(x\).

Evaluate using the fitness function

We evaluate each of these against a fitness function (the function we are optimising).

We assign fitness values to each individual.

Crossover and mutation

We generate a second generation. We select “parents” randomly using the fitness values as weightings.

The values of the new individual are a function of the values of the parents, and noise (mutation).

We do this for each member in the next generation.

We iterate this process across successive generations.

Differential evolution

Differential evolution

Particle swarms