Feedforward neural networks

Multi-Layer Perceptron (MLP)

Hidden layers

In the perceptron we have input vector \(x\), and output:


We can augment the perceptron by adding a hidden layer.

Now the output on the activation function is an input to a second layer. By using different weights, we can create a second vector of inputs to the second layer.

The parameters of a feed forward model

\(\Theta^{j}\) is a matrix of weights for mapping layer \(j\) to \(j+1\). So we have \(\Theta^1\) and \(\Theta^2\).

If we have \(s\) units in the hidden layer, \(n\) features and \(k\) classes:

  • The dimension of \(\Theta^1\) is \((n+1) \times s\)

  • The dimension of \(\Theta^2\) is \((s+1) \times k\)

These include the offsets for each layer.

The activation function of a multi-layer perceptron

For a perceptron we had \(a=f(wx)\). Now we have:


We refer to the value of a node as \(a_i^{j}\), the activation of unit \(i\) in layer \(j\).

Initialising parameters

We start by randomly initialisng the value of each \(\theta \).

We do this to prevent each neuron from moving in sync.

Dummies in neural networks

Back propagation

Adapting the delta rule

To arrive at the delta rule we considered the cost function:


And used the chain rule:

\(\dfrac{\delta E}{\delta \theta_i }=\dfrac{\delta E}{\delta a_j}\dfrac{\delta a_j}{\delta z_j}\dfrac{\delta z_j}{\delta \theta_i}\)

This gave us:

\(\Delta \theta_i=\alpha \sum_j(y_j-a_j)a'(z_j)x_{ij}\)

Or, setting \(\delta_i=-\dfrac{\delta E}{\delta z_j}=\sum_j(y_j-a_j)a'(z_j)\)

\(\Delta \theta_i=\alpha \delta_j x_{ij}\)

Let’s update the rule for multiple layers:

\(\dfrac{\delta E}{\delta \theta_{li}}=\dfrac{\delta E}{\delta a_{lj}}\dfrac{\delta a_{lj}}{\delta z_{lj}}\dfrac{\delta z_{lj}}{\delta \theta_{li}}\)

Previously \(\dfrac{\delta z_{lj}}{\delta \theta_{li}}=x_i\). We now use the more general \(a_{li}\). For the first layer, these will be the same.

We can then instead write:

\(\Delta \theta_i=\alpha \delta_{lj} a_{li}\)

Calculating delta values

Now we need a way of calculating the value of \(\delta_{lj}\) for all neurons.

\(\delta_i=-\dfrac{\delta E}{\delta z_{lj}}\)

If this is an output node, then this is simply \(\sum_j(y_j-a_j)a'(z_j)\)

If this is not an output node, then the impact of change in the parameter will affect the results through all intermediate neurons.

In this case:

\(\dfrac{\delta E}{\delta z_{lj}}=\sum_{k\in succ{l}}\dfrac{\delta E}{\delta z_{k}}\dfrac{\delta z_{k}}{\delta z_{lj}}\)

\(\dfrac{\delta E}{\delta z_{lj}}=\sum_{k\in succ{l}}-\delta_{k}\dfrac{\delta z_{k}}{\delta a_{kj}}\dfrac{\delta a_{kj}}{\delta z_{lj}}\)

\(\dfrac{\delta E}{\delta z_{lj}}=\sum_{k\in succ{l}}-\delta_{k}\theta_{kj}a'_{kj}\)

\(\delta_i=a'_{kj}\sum_{k\in succ{l}}\delta_{k}\theta_{kj}\)

For each layer there is a matrix, where the columns and rows represent the \(theta \) between the current layer and the next layer. We have a matrix for each layer in the network.

Ill conditioning

Skips over actual max if not “smooth” enough.

More than \(2\) classes


The softmax function is often used in the last layer of a classification network.

It takes a vector of dimension \(k\) and returns another vector of the same size. Only, this time all numbers are between \(0\) and \(1\) and the values sum to \(1\).

The softmax function is based on the sigmoid function.


Local Winner Takes All (LWTA) layer

The output for all nodes in a layer is \(0\) unless it is the greatest.

Maxout layer

Single node, spits out the max of all inputs.


Deep neural networks

More layers allow for more complex function

With additional hidden layers we can map more complex functions.

These allow the effective combination of logic gates.

2 hidden layers can map highly complex functions

With only two hidden layers we can map any function for classification, including discontinuous functions.


The error function for neural networks is nearly convex.

Unstable gradient problem

Vanishing gradient problem

Gradients can be become small, and so propagation can be very slow.

Exploding gradient problem

Gradients can become too large and not converge.


This addresses the unstable gradient problem.

Curse of dimensionality

Regularising neural networks

Feature normalisation

Dropout, and dropout layers

\(L_2\) regularisation (including how to change backprob algorithm)

Sparse networks

Parameters are set to \(0\) and not trained.

Parameter sharing

Parameters share the same value and are trained together.

Weight decay

After each update, multiply the parameter by \(p<1\).

The anomoly detection problem

Can change input to get any classification.

Early stopping

Residual blocks

In a node we have:


That is, the value of a node, is the activation on the sum of the weights of the previous layer.

Residual block however look further back that one layer. They include the full data from an older layer (without weights)



Input layer normalisation

We normalise the input layer.

This speeds up training, and makes regularisation saner.

Batch normalisation

We can normalise other layers. We take each input, subtract the batch mean divided by the batch standard deviation.

Batch normalisation and covariance shift

Batch normalisation can make networks better adapted for related problems.

Training with batch normalisation

Alternatives to backpropagation

Greedy pretraining

Cascade-correlation learning architecture

This is a method for both building and training.

We start with a bare bones network. We then add nodes one by one, training and then fixing their values.

Extreme learning machines

This is an alternative to backprobagation for training a feedforward neural network.

We start with random parameters for each layer \(W_i\).

We have:

\(\hat y=W_2\sigma (W_1 x)\)


We calculate:


So \(W_1\) is random and not updated.

\(W_2\) is assigned to minimise loss, where \(W_2\) has no activation function.

Convolutional layers

Convolutional layers

Can connect each node in first hidden layer to a subset of the input layer, eg one node for each 5x5 pixels

We also share weights for each of the first layer. Much fewer parameters, and can learn all good stuff

This also uses windows. Instead of max we multiply the window by a matrix elementwise and sum the values

Each matrix can represent some feature, like a curve.

We can use multiply convolution matrices to create multiple output matrices.

Matrices are called kernels. they are trained. start off random

Training convolutional layers

Invariance of convolutional layers (rotation, translation)

Flattening layers

We split the data up everytime we use convolional layers

Flattening layers bring them all back together

Parameters are that for pooling layers (height, width, stride, padding, but also set of convolutions.

Multi-scale convolutions

We use different window sizes in parallel.

Pooling(max pooling, average pooling, subsampling)

Pooling layers

The input is a matrix. We place a number of windows on the input matrix. The max of each window is an input to the next layer.

Means fewer parameters, easier to compute, less chance of overfitting

Parameters: height, width of window, stride (amount shifts by each window)

We can also add padding to the edge of the image so we don’t lose data.

Same padding (use 0), valid padding (no padding)

Pooling layer compresses, takes 2x2. Max pooling returns highest activiation

Vector of Locally Aggregated Descriptors (VLAD)

Other window layers


Primary capsule layers

Outputs of convolutions are scalars. however we can also create vectors, if we associate some convolutions with each other

eg if we have 6 convolutions, the output of these can be used to create a 6 dimensional vector for each window.

Normalisation in primary capsule layers (vector squishing)

We can normalise the length of these vectors to between \(0\) and \(1\).

The output of this repesents the chance of finding the feature they are looking for, and the orientation

If the vector length is low, feature not found. if high, feature found.

We have orientation from vector, and position from window

Routing capsule layers

We now have a layer of position and orientation of basic shapes (triangles, rectangles etc)

We want to know which more complex thing they are part of.

So the output of this step is again a matrix with position and orientation, but of more complex features

To determine the activation from each basic shape to the next feature we use routing-by-agreement.

This takes each basic shape and works out what it would look like if the complex feature was present.

If a complex feature has two basic shapes, they will both have the same predicted complex shape. Otherwise the relationship is spurious and they will not

If they agree we have a high weight

This process is complex and computationally expensive.

However we don’t need pooling layers now

caps net

Does normal conv first, then primary, then secondary.

caps: reconstruction

We have vector space of feature position and orietnation. we can recreate output



Pre-training. eg train on general pictures before specific stuff. means you fit many parameters for detecting edges etc firsts

Learning rate

Reduce the learning rate.

Resetting parameters

Replace last layer (softmax) for new problem.

Freeze layers

Freeze feature learning of early layers.


Representational sparsity

This is where the values in nodes are often \(0\), as opposed to just the parameters.

Catastrophic interference

Attention and Neural Turing Machines

Probabilistic neural networks


Input layer

The input is the feature vector.

Pattern layer

The first layer has a node for each variable in the training set.

In each node, the value is the distance from the input to the comparator.

This can be calculated using Gaussian distribution, or another method.

Summation layer

One neuron for each category.

We map from the pattern layer to the summation layer according to the actual label of each training item.

Ie, if a sample is red, it will be fed only to the red neuron.

The values are summed.

Largest value is selected.