Generalised linear models, the delta rule and binary classification

Introduction

Link/activation functions

Estimating parameters

Delta rule

Introduction

We want to train the parameters \(\boldsymbol{\theta }\).

We can do this with gradient descent, by working out how much the loss function falls as we change each parameter.

The delta rule tells us how to do this.

The loss function

If we have \(n\) features and \(m\) samples The error of the network is:

\(E=\sum_j^m\dfrac{1}{2}(y_j-a_j)^2\)

We know that \(a_j=f(\boldsymbol{\theta x_j})=f(\sum_i^n\theta^i x_j^i)\) and so:

\(E=\sum_j\dfrac{1}{2}(y_j-f(\boldsymbol{\theta x_j}))^2\)

\(E=\sum_j\dfrac{1}{2}(y_j-f(\sum_i\theta^i x_j^i))^2\)

Minimising loss

We can see the change in error as we change the parameter:

\(\dfrac{\delta E}{\delta \theta^i }=\dfrac{\delta }{\delta \theta^i}\sum_j\dfrac{1}{2}(y_j-f(\boldsymbol{\theta x_j}))^2\)

\(\dfrac{\delta E}{\delta \theta^i }=\dfrac{\delta }{\delta \theta^i}\sum_j\dfrac{1}{2}(y_j-f(\sum_i\theta^i x_j^i))^2\)

\(\dfrac{\delta E}{\delta \theta^i }=\sum_j(y_j-f(\boldsymbol{\theta x_j}))\dfrac{\delta }{\delta \theta^i}(y_j-f(\boldsymbol{\theta x_j}))\)

\(\dfrac{\delta E}{\delta \theta^i }=\sum_j(y_j-f(\sum_i\theta^i x_j^i))\dfrac{\delta }{\delta \theta^i}(y_j-f(\sum_i\theta^i x_j^i))\)

\(\dfrac{\delta E}{\delta \theta^i }=\sum_j(y_j-f(\boldsymbol{\theta x_j}))\dfrac{\delta }{\delta \theta^i}(-f(\boldsymbol{\theta x_j}))\)

\(\dfrac{\delta E}{\delta \theta^i }=-\sum_j(y_j-f(\sum_i\theta^i x_j^i))\dfrac{\delta }{\delta \theta^i}f(\sum_i\theta^i x_j^i)\)

\(\dfrac{\delta E}{\delta \theta^i }=-\sum_j(y_j-f(\sum_i\theta^i x_j^i))\dfrac{\delta f(\sum_i\theta^ix_j^i)}{\delta \theta^i}\)

\(\dfrac{\delta E}{\delta \theta^i }=-\sum_j(y_j-f(\sum_i\theta^i x_j^i))\dfrac{\delta f(\sum_i\theta^ix_j^i)}{\sum_i\theta^ix_j^i}\dfrac{\sum_i\theta^ix_j^i}{\delta \theta^i}\)

\(\dfrac{\delta E}{\delta \theta^i }=-\sum_j(y_j-f(\sum_i\theta^i x_j^i))f'(\sum_i\theta^ix_j^i)x_j^i\)

By defining \(z_j=\sum_i\theta^ix_j^i\) and \(a=f\) we have:

\(\dfrac{\delta E}{\delta \theta^i }=-\sum_j(y_j-a(z_j))a'(z_j)x_j^i\)

Minimising loss: Other simpler attempt

We can see the change in error as we change the parameter:

\(\dfrac{\delta E}{\delta \theta^i }=\dfrac{\delta }{\delta \theta^i}\sum_j\dfrac{1}{2}(y_j-f(\boldsymbol{\theta x_j}))^2\)

By defining \(z_j=\sum_i\theta^ix_j^i\) and \(a=f\) we have:

\(\dfrac{\delta E}{\delta \theta^i }=\dfrac{\delta }{\delta \theta^i}\sum_j\dfrac{1}{2}(y_j-a(z_j))^2\)

\(\dfrac{\delta E}{\delta \theta^i }=\dfrac{\delta E}{\delta a_j}\dfrac{\delta a_j}{\delta z_j}\dfrac{\delta z_j}{\delta \theta^i}\)

\(\dfrac{\delta E}{\delta \theta^i }=\dfrac{\delta E}{\delta a}a'(z_j)x_j\)

\(\dfrac{\delta E}{\delta \theta^i }=\sum_ja'(z_j)x_j\)

\(\dfrac{\delta E}{\delta \theta^i }=-\sum_j(y_j-a(z_j))a'(z_j)x_j^i\)

Delta

We define delta as:

\(\delta_i=-\dfrac{\delta E}{\delta z_j}=\sum_j(y_j-a_j)a'(z_j)\)

So:

\(\dfrac{\delta E}{\delta \theta^i }=\delta_i x_{ij}\)

The delta rule

We update the parameters using gradient descent:

\(\Delta \theta_i=\alpha \delta^i x_{ij}\)

Maximum likelihood

The function

\(a(z)=z\)

The derivative

\(a'(z)=1\)

Notes

This is the same as ordinary linear regression.

Link/activation functions: Classification

The binomial data generating process

Introduction

For linear regression our data generating process is:

\(y=\alpha + \beta x +\epsilon\)

For linear classification our data generating process is:

\(z=\alpha + \beta x +\epsilon\)

And set \(y\) to \(1\) if \(z>0\)

Or:

\(y=\mathbf I[\alpha+\beta x+\epsilon >0]\)

Probability of each class

The probability that an invididual with characteristics \(x\) is classified as \(1\) is:

\(P_1=P(y=1|x)\)

\(P_1=P(\alpha + \beta x+\epsilon >0)\)

\(P_1=\int \mathbf I [\alpha + \beta x+\epsilon >0]f(\epsilon )d\epsilon\)

\(P_1=\int \mathbf I [\epsilon >-\alpha-\beta x ]f(\epsilon )d\epsilon\)

\(P_1=\int_{\epsilon=-\alpha-\beta x}^\infty f(\epsilon )d\epsilon\)

\(P_1=1-F(-\alpha-\beta x)\)

Example: The logistic function

Depending on the probability distribution of \(epsilon\) we have different classifiers.

For the logistic case we then have

\(P(y=1|x)=\dfrac{e^{\alpha + \beta x}}{1+e^{\alpha + \beta x}}\)

Perceptron (step function)

The function

If the sum is above \(0\), \(a(z)=1\). Otherwise, \(a(z)=0\).

The derivative

This has a differential of \(0\) at all point except \(0\), where it is undefined.

Notes

This function is not smooth.

These is the activation function used in the perceptron.

Perceptron data needs to be linearly separable to train.

Even if linearly separable, doesn’t necessarily get the best outcome?

Perceptron

Perceptron: one node neural network. is one or zero depednign if weightd inputs enough. therefore is classiication

If error, update weights

Only works if linearly separable. ie can draw linear line completely separting all inputs

Neural network has more layers

Works if data is linear

How to treat node inputs: raw, sigmoid, 0,1

For all of these want the cost function have only one solution, like least squares doe. not guaranteed for all

For logistic, want to make it convex. loss = -log(f(x)) or -log(1-f(x)) depending on correct y. this is convex

How to create node inputs: sigmoid, binary cutoff

Logistic function (AKA sigmoid, logit)

The function

\(\sigma (z)=\dfrac{1}{1+e^{-z}}\)

The range of this activation is between \(0\) and \(1\).

The derivative

\(\sigma '(z)=\dfrac{e^{-z}}{(1+e^{-z})^2}\)

\(\sigma '(z)=\sigma (z)\dfrac{1+e^{-z}-1}{1+e^{-z}}\)

\(\sigma '(z)=\sigma (z)[1-\sigma (z)]\)

Notes

Probability unit (probit)

The function

The cumulative distribution function of the normal distribution.

\(\Phi (z)\)

The derivative

The normal distribution:

\(\Phi'(z)=\phi (z)\)

tanH

ArcTan

Radial Basis Function (RBF) activation function

\(a(x)=\sum_i a_i f(||x-c_i||)\)

\(a(x)=\sum_i a_i fe^{||x-c_i||^2}\)

Linear probability model

\(p=xB\). can be outside \([0,1]\).

Generalised Additive Models (GAMs)

Introduction