In the perceptron we have input vector \(x\), and output:

\(a=a(wx)\)

We can augment the perceptron by adding a hidden layer.

Now the output on the activation function is an input to a second layer. By using different weights, we can create a second vector of inputs to the second layer.

\(\Theta^{j}\) is a matrix of weights for mapping layer \(j\) to \(j+1\). So we have \(\Theta^1\) and \(\Theta^2\).

If we have \(s\) units in the hidden layer, \(n\) features and \(k\) classes:

The dimension of \(\Theta^1\) is \((n+1) \times s\)

The dimension of \(\Theta^2\) is \((s+1) \times k\)

These include the offsets for each layer.

For a perceptron we had \(a=f(wx)\). Now we have:

\(a_i^j=f(a_{j-1}\Theta_{j-1})\)

We refer to the value of a node as \(a_i^{j}\), the activation of unit \(i\) in layer \(j\).

We start by randomly initialisng the value of each \(\theta \).

We do this to prevent each neuron from moving in sync.

To arrive at the delta rule we considered the cost function:

\(E=\sum_j\dfrac{1}{2}(y_j-a_j)^2\)

And used the chain rule:

\(\dfrac{\delta E}{\delta \theta_i }=\dfrac{\delta E}{\delta a_j}\dfrac{\delta a_j}{\delta z_j}\dfrac{\delta z_j}{\delta \theta_i}\)

This gave us:

\(\Delta \theta_i=\alpha \sum_j(y_j-a_j)a'(z_j)x_{ij}\)

Or, setting \(\delta_i=-\dfrac{\delta E}{\delta z_j}=\sum_j(y_j-a_j)a'(z_j)\)

\(\Delta \theta_i=\alpha \delta_j x_{ij}\)

Let’s update the rule for multiple layers:

\(\dfrac{\delta E}{\delta \theta_{li}}=\dfrac{\delta E}{\delta a_{lj}}\dfrac{\delta a_{lj}}{\delta z_{lj}}\dfrac{\delta z_{lj}}{\delta \theta_{li}}\)

Previously \(\dfrac{\delta z_{lj}}{\delta \theta_{li}}=x_i\). We now use the more general \(a_{li}\). For the first layer, these will be the same.

We can then instead write:

\(\Delta \theta_i=\alpha \delta_{lj} a_{li}\)

Now we need a way of calculating the value of \(\delta_{lj}\) for all neurons.

\(\delta_i=-\dfrac{\delta E}{\delta z_{lj}}\)

If this is an output node, then this is simply \(\sum_j(y_j-a_j)a'(z_j)\)

If this is not an output node, then the impact of change in the parameter will affect the results through all intermediate neurons.

In this case:

\(\dfrac{\delta E}{\delta z_{lj}}=\sum_{k\in succ{l}}\dfrac{\delta E}{\delta z_{k}}\dfrac{\delta z_{k}}{\delta z_{lj}}\)

\(\dfrac{\delta E}{\delta z_{lj}}=\sum_{k\in succ{l}}-\delta_{k}\dfrac{\delta z_{k}}{\delta a_{kj}}\dfrac{\delta a_{kj}}{\delta z_{lj}}\)

\(\dfrac{\delta E}{\delta z_{lj}}=\sum_{k\in succ{l}}-\delta_{k}\theta_{kj}a'_{kj}\)

\(\delta_i=a'_{kj}\sum_{k\in succ{l}}\delta_{k}\theta_{kj}\)

For each layer there is a matrix, where the columns and rows represent the \(theta \) between the current layer and the next layer. We have a matrix for each layer in the network.

Skips over actual max if not “smooth” enough.

With additional hidden layers we can map more complex functions.

These allow the effective combination of logic gates.

With only two hidden layers we can map any function for classification, including discontinuous functions.

The error function for neural networks is nearly convex.

Gradients can be become small, and so propagation can be very slow.

Gradients can become too large and not converge.

This addresses the unstable gradient problem.

Topology of layers. Increasing number of units in subsequent layers is like increasing dimension.

We are trying to make data linearly separable. it may be that we need additional dimensions to do this, rather than a series of transformations within the existing number of dimensions.

eg for a circle of data within a circle of data, there is no linear separable line, so no depth without increasing dimensions will split data.

We normalise the input layer.

This speeds up training, and makes regularisation saner.

We can normalise other layers. We take each input, subtract the batch mean divided by the batch standard deviation.

Batch normalisation can make networks better adapted for related problems.

This is where the values in nodes are often \(0\), as opposed to just the parameters.