Multi-layer perceptrons and backpropagation

Multi-Layer Perceptron (MLP)

Hidden layers

In the perceptron we have input vector \(x\), and output:

\(a=a(wx)\)

We can augment the perceptron by adding a hidden layer.

Now the output on the activation function is an input to a second layer. By using different weights, we can create a second vector of inputs to the second layer.

The parameters of a feed forward model

\(\Theta^{j}\) is a matrix of weights for mapping layer \(j\) to \(j+1\). So we have \(\Theta^1\) and \(\Theta^2\).

If we have \(s\) units in the hidden layer, \(n\) features and \(k\) classes:

  • The dimension of \(\Theta^1\) is \((n+1) \times s\)

  • The dimension of \(\Theta^2\) is \((s+1) \times k\)

These include the offsets for each layer.

The activation function of a multi-layer perceptron

For a perceptron we had \(a=f(wx)\). Now we have:

\(a_i^j=f(a_{j-1}\Theta_{j-1})\)

We refer to the value of a node as \(a_i^{j}\), the activation of unit \(i\) in layer \(j\).

Initialising parameters

We start by randomly initialisng the value of each \(\theta \).

We do this to prevent each neuron from moving in sync.

Dummies in neural networks

Back propagation

Adapting the delta rule

To arrive at the delta rule we considered the cost function:

\(E=\sum_j\dfrac{1}{2}(y_j-a_j)^2\)

And used the chain rule:

\(\dfrac{\delta E}{\delta \theta_i }=\dfrac{\delta E}{\delta a_j}\dfrac{\delta a_j}{\delta z_j}\dfrac{\delta z_j}{\delta \theta_i}\)

This gave us:

\(\Delta \theta_i=\alpha \sum_j(y_j-a_j)a'(z_j)x_{ij}\)

Or, setting \(\delta_i=-\dfrac{\delta E}{\delta z_j}=\sum_j(y_j-a_j)a'(z_j)\)

\(\Delta \theta_i=\alpha \delta_j x_{ij}\)

Let’s update the rule for multiple layers:

\(\dfrac{\delta E}{\delta \theta_{li}}=\dfrac{\delta E}{\delta a_{lj}}\dfrac{\delta a_{lj}}{\delta z_{lj}}\dfrac{\delta z_{lj}}{\delta \theta_{li}}\)

Previously \(\dfrac{\delta z_{lj}}{\delta \theta_{li}}=x_i\). We now use the more general \(a_{li}\). For the first layer, these will be the same.

We can then instead write:

\(\Delta \theta_i=\alpha \delta_{lj} a_{li}\)

Calculating delta values

Now we need a way of calculating the value of \(\delta_{lj}\) for all neurons.

\(\delta_i=-\dfrac{\delta E}{\delta z_{lj}}\)

If this is an output node, then this is simply \(\sum_j(y_j-a_j)a'(z_j)\)

If this is not an output node, then the impact of change in the parameter will affect the results through all intermediate neurons.

In this case:

\(\dfrac{\delta E}{\delta z_{lj}}=\sum_{k\in succ{l}}\dfrac{\delta E}{\delta z_{k}}\dfrac{\delta z_{k}}{\delta z_{lj}}\)

\(\dfrac{\delta E}{\delta z_{lj}}=\sum_{k\in succ{l}}-\delta_{k}\dfrac{\delta z_{k}}{\delta a_{kj}}\dfrac{\delta a_{kj}}{\delta z_{lj}}\)

\(\dfrac{\delta E}{\delta z_{lj}}=\sum_{k\in succ{l}}-\delta_{k}\theta_{kj}a'_{kj}\)

\(\delta_i=a'_{kj}\sum_{k\in succ{l}}\delta_{k}\theta_{kj}\)

For each layer there is a matrix, where the columns and rows represent the \(theta \) between the current layer and the next layer. We have a matrix for each layer in the network.

Ill conditioning

Skips over actual max if not “smooth” enough.

Deep neural networks

More layers allow for more complex function

With additional hidden layers we can map more complex functions.

These allow the effective combination of logic gates.

2 hidden layers can map highly complex functions

With only two hidden layers we can map any function for classification, including discontinuous functions.

Convexity

The error function for neural networks is nearly convex.

Unstable gradient problem

Vanishing gradient problem

Gradients can be become small, and so propagation can be very slow.

Exploding gradient problem

Gradients can become too large and not converge.

ReLU

This addresses the unstable gradient problem.

Curse of dimensionality

Increasing numbers of dimensions in a unit

Topology of layers. Increasing number of units in subsequent layers is like increasing dimension.

We are trying to make data linearly separable. it may be that we need additional dimensions to do this, rather than a series of transformations within the existing number of dimensions.

eg for a circle of data within a circle of data, there is no linear separable line, so no depth without increasing dimensions will split data.

Optimisation

Input layer normalisation

We normalise the input layer.

This speeds up training, and makes regularisation saner.

Batch normalisation

We can normalise other layers. We take each input, subtract the batch mean divided by the batch standard deviation.

Batch normalisation and covariance shift

Batch normalisation can make networks better adapted for related problems.

Training with batch normalisation

Other

Representational sparsity

This is where the values in nodes are often \(0\), as opposed to just the parameters.

Catastrophic interference