Backpropagation

Back propagation

Recap on the delta rule

To arrive at the delta rule we considered the cost function:

\(E=\sum_j\dfrac{1}{2}(y_j-a_j)^2\)

This gave us:

\(\dfrac{\delta E}{\delta \theta^i }=-\sum_j(y_j-f(\sum_i\theta^i x_j^i))f'(\sum_i\theta^ix_j^i)x_j^i\)

By defining \(z_j=\sum_i\theta^ix_j^i\) and \(a=f\) we have:

\(\dfrac{\delta E}{\delta \theta^i }=-\sum_j(y_j-a(z_j))a'(z_j)x_j^i\)

\(\delta_i=-\dfrac{\delta E}{\delta z_j}=\sum_j(y_j-a_j)a'(z_j)\)

So:

\(\dfrac{\delta E}{\delta \theta_i }=\delta_i x_{ij}\)

\(\dfrac{\delta E}{\delta \theta^i }=-\sum_j(y_j-a(z_j))f'(z_j)x_j^i\)

We define delta as:

\(\delta_i=-\dfrac{\delta E}{\delta z_j}=\sum_j(y_j-a_j)a'(z_j)\)

So:

\(\dfrac{\delta E}{\delta \theta_i }=\delta_i x_{ij}\)

\(\Delta \theta_i=\alpha \sum_j(y_j-a_j)a'(z_j)x_{ij}\)

We update the parameters using gradient descent:

\(\Delta \theta_i=\alpha \delta_i x_{ij}\)

Or, setting \(\delta_i=-\dfrac{\delta E}{\delta z_j}=\sum_j(y_j-a_j)a'(z_j)\)

\(\Delta \theta_i=\alpha \delta_j x_{ij}\)

Adapting the delta rule for multiple layers

Let’s update the rule for multiple layers:

And used the chain rule:

\(\dfrac{\delta E}{\delta \theta^i }=\dfrac{\delta E}{\delta a_j}\dfrac{\delta a_j}{\delta z_j}\dfrac{\delta z_j}{\delta \theta^i}\)

Where \(a_j=f(z_j)\) and \(z_j=\boldsymbol{\theta x_j}\)

\(\dfrac{\delta E}{\delta \theta_{li}}=\dfrac{\delta E}{\delta a_{lj}}\dfrac{\delta a_{lj}}{\delta z_{lj}}\dfrac{\delta z_{lj}}{\delta \theta_{li}}\)

Previously \(\dfrac{\delta z_{lj}}{\delta \theta_{li}}=x_i\). We now use the more general \(a_{li}\). For the first layer, these will be the same.

We can then instead write:

\(\Delta \theta_i=\alpha \delta_{lj} a_{li}\)

Now we need a way of calculating the value of \(\delta_{lj}\) for all neurons.

\(\delta_i=-\dfrac{\delta E}{\delta z_{lj}}\)

If this is an output node, then this is simply \(\sum_j(y_j-a_j)a'(z_j)\)

If this is not an output node, then the impact of change in the parameter will affect the results through all intermediate neurons.

In this case:

\(\dfrac{\delta E}{\delta z_{lj}}=\sum_{k\in succ{l}}\dfrac{\delta E}{\delta z_{k}}\dfrac{\delta z_{k}}{\delta z_{lj}}\)

\(\dfrac{\delta E}{\delta z_{lj}}=\sum_{k\in succ{l}}-\delta_{k}\dfrac{\delta z_{k}}{\delta a_{kj}}\dfrac{\delta a_{kj}}{\delta z_{lj}}\)

\(\dfrac{\delta E}{\delta z_{lj}}=\sum_{k\in succ{l}}-\delta_{k}\theta_{kj}a'_{kj}\)

\(\delta_i=a'_{kj}\sum_{k\in succ{l}}\delta_{k}\theta_{kj}\)

For each layer there is a matrix, where the columns and rows represent the \(theta\) between the current layer and the next layer. We have a matrix for each layer in the network.

Initialising parameters

We start by randomly initialisng the value of each \(\theta\).

We do this to prevent each neuron from moving in sync.