In the perceptron we have input vector \(x\), and output:

\(a=a(wx)\)

We can augment the perceptron by adding a hidden layer.

Now the output on the activation function is an input to a second layer. By using different weights, we can create a second vector of inputs to the second layer.

\(\Theta^{j}\) is a matrix of weights for mapping layer \(j\) to \(j+1\). So we have \(\Theta^1\) and \(\Theta^2\).

If we have \(s\) units in the hidden layer, \(n\) features and \(k\) classes:

The dimension of \(\Theta^1\) is \((n+1) \times s\)

The dimension of \(\Theta^2\) is \((s+1) \times k\)

These include the offsets for each layer.

For a perceptron we had \(a=f(wx)\). Now we have:

\(a_i^j=f(a_{j-1}\Theta_{j-1})\)

We refer to the value of a node as \(a_i^{j}\), the activation of unit \(i\) in layer \(j\).

We start by randomly initialisng the value of each \(\theta \).

We do this to prevent each neuron from moving in sync.

To arrive at the delta rule we considered the cost function:

\(E=\sum_j\dfrac{1}{2}(y_j-a_j)^2\)

And used the chain rule:

\(\dfrac{\delta E}{\delta \theta_i }=\dfrac{\delta E}{\delta a_j}\dfrac{\delta a_j}{\delta z_j}\dfrac{\delta z_j}{\delta \theta_i}\)

This gave us:

\(\Delta \theta_i=\alpha \sum_j(y_j-a_j)a'(z_j)x_{ij}\)

Or, setting \(\delta_i=-\dfrac{\delta E}{\delta z_j}=\sum_j(y_j-a_j)a'(z_j)\)

\(\Delta \theta_i=\alpha \delta_j x_{ij}\)

Let’s update the rule for multiple layers:

\(\dfrac{\delta E}{\delta \theta_{li}}=\dfrac{\delta E}{\delta a_{lj}}\dfrac{\delta a_{lj}}{\delta z_{lj}}\dfrac{\delta z_{lj}}{\delta \theta_{li}}\)

Previously \(\dfrac{\delta z_{lj}}{\delta \theta_{li}}=x_i\). We now use the more general \(a_{li}\). For the first layer, these will be the same.

We can then instead write:

\(\Delta \theta_i=\alpha \delta_{lj} a_{li}\)

Now we need a way of calculating the value of \(\delta_{lj}\) for all neurons.

\(\delta_i=-\dfrac{\delta E}{\delta z_{lj}}\)

If this is an output node, then this is simply \(\sum_j(y_j-a_j)a'(z_j)\)

If this is not an output node, then the impact of change in the parameter will affect the results through all intermediate neurons.

In this case:

\(\dfrac{\delta E}{\delta z_{lj}}=\sum_{k\in succ{l}}\dfrac{\delta E}{\delta z_{k}}\dfrac{\delta z_{k}}{\delta z_{lj}}\)

\(\dfrac{\delta E}{\delta z_{lj}}=\sum_{k\in succ{l}}-\delta_{k}\dfrac{\delta z_{k}}{\delta a_{kj}}\dfrac{\delta a_{kj}}{\delta z_{lj}}\)

\(\dfrac{\delta E}{\delta z_{lj}}=\sum_{k\in succ{l}}-\delta_{k}\theta_{kj}a'_{kj}\)

\(\delta_i=a'_{kj}\sum_{k\in succ{l}}\delta_{k}\theta_{kj}\)

For each layer there is a matrix, where the columns and rows represent the \(theta \) between the current layer and the next layer. We have a matrix for each layer in the network.

Skips over actual max if not “smooth” enough.

The softmax function is often used in the last layer of a classification network.

It takes a vector of dimension \(k\) and returns another vector of the same size. Only, this time all numbers are between \(0\) and \(1\) and the values sum to \(1\).

The softmax function is based on the sigmoid function.

\(a_j(z)=\dfrac{e^{z_j}}{\sum_{i}e^{z_i}}\)

The output for all nodes in a layer is \(0\) unless it is the greatest.

Single node, spits out the max of all inputs.

With additional hidden layers we can map more complex functions.

These allow the effective combination of logic gates.

With only two hidden layers we can map any function for classification, including discontinuous functions.

The error function for neural networks is nearly convex.

Gradients can be become small, and so propagation can be very slow.

Gradients can become too large and not converge.

This addresses the unstable gradient problem.

Parameters are set to \(0\) and not trained.

Parameters share the same value and are trained together.

After each update, multiply the parameter by \(p<1\).

Can change input to get any classification.

In a node we have:

\(a_{ij}=\sigma_{ij}(W_{ij}a_{i-1})\)

That is, the value of a node, is the activation on the sum of the weights of the previous layer.

Residual block however look further back that one layer. They include the full data from an older layer (without weights)

\(a_{ij}=\sigma_{ij}(W_{ij}a_{i-1}+a_k)\)

We normalise the input layer.

This speeds up training, and makes regularisation saner.

We can normalise other layers. We take each input, subtract the batch mean divided by the batch standard deviation.

Batch normalisation can make networks better adapted for related problems.

This is a method for both building and training.

We start with a bare bones network. We then add nodes one by one, training and then fixing their values.

This is an alternative to backprobagation for training a feedforward neural network.

We start with random parameters for each layer \(W_i\).

We have:

\(\hat y=W_2\sigma (W_1 x)\)

Etc.

We calculate:

\(W_2=\sigma(W_1x)^+Y\)

So \(W_1\) is random and not updated.

\(W_2\) is assigned to minimise loss, where \(W_2\) has no activation function.

Can connect each node in first hidden layer to a subset of the input layer, eg one node for each 5x5 pixels

We also share weights for each of the first layer. Much fewer parameters, and can learn all good stuff

This also uses windows. Instead of max we multiply the window by a matrix elementwise and sum the values

Each matrix can represent some feature, like a curve.

We can use multiply convolution matrices to create multiple output matrices.

Matrices are called kernels. they are trained. start off random

We split the data up everytime we use convolional layers

Flattening layers bring them all back together

Parameters are that for pooling layers (height, width, stride, padding, but also set of convolutions.

We use different window sizes in parallel.

The input is a matrix. We place a number of windows on the input matrix. The max of each window is an input to the next layer.

Means fewer parameters, easier to compute, less chance of overfitting

Parameters: height, width of window, stride (amount shifts by each window)

We can also add padding to the edge of the image so we don’t lose data.

Same padding (use 0), valid padding (no padding)

Pooling layer compresses, takes 2x2. Max pooling returns highest activiation

Outputs of convolutions are scalars. however we can also create vectors, if we associate some convolutions with each other

eg if we have 6 convolutions, the output of these can be used to create a 6 dimensional vector for each window.

We can normalise the length of these vectors to between \(0\) and \(1\).

The output of this repesents the chance of finding the feature they are looking for, and the orientation

If the vector length is low, feature not found. if high, feature found.

We have orientation from vector, and position from window

We now have a layer of position and orientation of basic shapes (triangles, rectangles etc)

We want to know which more complex thing they are part of.

So the output of this step is again a matrix with position and orientation, but of more complex features

To determine the activation from each basic shape to the next feature we use routing-by-agreement.

This takes each basic shape and works out what it would look like if the complex feature was present.

If a complex feature has two basic shapes, they will both have the same predicted complex shape. Otherwise the relationship is spurious and they will not

If they agree we have a high weight

This process is complex and computationally expensive.

However we don’t need pooling layers now

Does normal conv first, then primary, then secondary.

We have vector space of feature position and orietnation. we can recreate output

Pre-training. eg train on general pictures before specific stuff. means you fit many parameters for detecting edges etc firsts

Reduce the learning rate.

Replace last layer (softmax) for new problem.

Freeze feature learning of early layers.

This is where the values in nodes are often \(0\), as opposed to just the parameters.

The input is the feature vector.

The first layer has a node for each variable in the training set.

In each node, the value is the distance from the input to the comparator.

This can be calculated using Gaussian distribution, or another method.

One neuron for each category.

We map from the pattern layer to the summation layer according to the actual label of each training item.

Ie, if a sample is red, it will be fed only to the red neuron.

The values are summed.

Largest value is selected.