In the perceptron we have input vector \(\textbf{x}\), and output:
\(a=f(\textbf{wx})=f(\sum_i^n {w_ix_i})\)
We can augment the perceptron by adding a hidden layer.
Now the output on the activation function is an input to a second layer. By using different weights, we can create a second vector of inputs to the second layer.
\(\Theta^{j}\) is a matrix of weights for mapping layer \(j\) to \(j+1\).
In a \(2\)-layer perceptron we have \(\Theta^0\) and \(\Theta^1\).
If we have \(s\) units in the hidden layer, \(n\) features and \(k\) classes:
The dimension of \(\Theta^0\) is \((n+1) \times s\)
The dimension of \(\Theta^1\) is \((s+1) \times k\)
These include the offsets for each layer.
For a perceptron we had:
\(a=f(\textbf{wx})=f(\sum_i^n{w_ix_i})\).
Now we have:
\(a_i^1=f(\boldsymbol{x\Theta^0 })=f(\sum_i^n{x_i\Theta_i^0 })\)
\(a_i^2=f(\boldsymbol{a^1\Theta^1 })=f(\sum_i^n{a_i^1\Theta_i^1 })\)
For additional layers this is:
\(a_i^j=f(\boldsymbol{a^{j-1}\Theta^{j-1} })=f(\sum_i^s{a_i^{j-1}\Theta_i^{j-1} })\)
We refer to the value of a node as \(a_i^{j}\), the activation of unit \(i\) in layer \(j\).
Can not have a sigmoid function at the last step.
Alternatively can apply a sigmoid function to the unbounded output to make it bounded.