Stochastic optimisation

Random search

We start with a random set of parameters, \(x\).

We then loop through the following:

  • We define a search space local to our current selection.

  • We randomly select a point from this space.

  • We compare the new point to our current point. If the new point is better we move to that.

Random optimisation

This is similar to random search, however we use a multivariate Gaussian distribution around our current point rather than a hypersphere.

Simulated annealing

Introduction

We can use a version of Metropolis-Hastings to find the global maximum of a function \(f(x)\).

We start with an arbitrary point \(x_0\).

We move randomly from this to identiy a candidate point \(x_c\).

We accept this with probability depending on the the relationship between \(x_0\) and \(x_c\).

This process will converge on the global maximum.

Hyperparameter

There is a hyperparameter for selection. At the extreme this becomes a greedy function.

Bayesian optimisation

Bayesian optimisation

Introduction

If we have sampled from the hyperparameter space we know something about the shape.

Can we use this to inform where we should next look?

The shape of the function is \(y=f(\mathbf x)\)

We have observations \(\mathbf X\) and \(\mathbf y\).

So what’s our posterior, \(P(y|\mathbf X, \mathbf y)\)?

Exploration and exploitation

The can be a tradeoff between:

  • Exploring - which gives us a better shape for \(y=f(x)\); and

  • Exploiting - which gives us a better estimate for the global optimum.

The surrogate function

We do not know \(y=f(x)\), but we model it as:

\(z(x)=y(x)+\epsilon\)

We can then maximise \(z\)

Proposing new candidates

We want an algorithm which maps from our history of observations to a new candidate.

There are different approaches:

  • Probability of improvement - Choosing one with the highest chance of a more optimal value

  • Expected improvement - Choosing one with the biggest expected increase in the optimal value

  • Entropy search - choosing one which reduces uncertainty about the global maximum.

Evolutionary algorithms

Evolutionary algorithms

Initialisation

We generate a set of candidate parameter values, \(x\).

Evaluate using the fitness function

We evaluate each of these against a fitness function (the function we are optimising).

We assign fitness values to each individual.

Crossover and mutation

We generate a second generation. We select “parents” randomly using the fitness values as weightings.

The values of the new individual are a function of the values of the parents, and noise (mutation).

We do this for each member in the next generation.

We iterate this process across successive generations.

Differential evolution

Differential evolution

Particle swarms

Particle swarms