Generalised linear models

Introduction

Link/activation functions

Estimating parameters

Delta rule

Introduction

We want to train the parameters \(\theta \).

We can do this with gradient descent, by working out how much the loss function falls as we change each parameter.

The delta rule tells us how to do this.

The loss function

The error of the network is:

\(E=\sum_j\dfrac{1}{2}(y_j-a_j)^2\)

We know that \(a_j=a(\theta x_j)\) and so:

\(E=\sum_j\dfrac{1}{2}(y_j-a(\theta x_j))^2\)

Minimising loss

We can see the change in error as we change the parameter:

\(\dfrac{\delta E}{\delta \theta_i }=\sum_j \dfrac{\delta E}{\delta a_j}\dfrac{\delta a_j}{\delta z_j}\dfrac{\delta z_j}{\delta \theta_i}\)

\(\dfrac{\delta E}{\delta \theta_i }=-\sum_j(y_j-a_j)a'(z_j)x_{ij}\)

Delta

We define delta as:

\(\delta_i=-\dfrac{\delta E}{\delta z_j}=\sum_j(y_j-a_j)a'(z_j)\)

So:

\(\dfrac{\delta E}{\delta \theta_i }=\delta_i x_{ij}\)

The delta rule

We update the parameters using gradient descent:

\(\Delta \theta_i=\alpha \delta_i x_{ij}\)

Maximum likelihood

Link/activation functions: Regression

The function

\(a(z)=z\)

The derivative

\(a'(z)=1\)

Notes

This is the same as ordinary linear regression.

Absolute value rectification

\(a(x)=|x|\)

Rectified Linear Unit (ReLU)

The function

\(a(z)=\max (0,z)\)

The derivative

Its differential is \(1\) for values of \(z\) above \(0\), and \(0\) for values of \(z\) below \(0\).

The differential is undefined at \(z=0\), however this is unlikely to occur in practice.

Notes

The ReLU activation function induces sparcity.

Noisy ReLU

Leaky ReLU

Parametric ReLU

Softplus

The function

\(a(z)=\ln (1+e^z)\)

The derivative

Its derivative is the sigmoid function:

\(a'(z)=\dfrac{1}{1+e^{-z}}\)

Notes

The softplus function is a smooth approximation of the ReLU function.

Unlike the ReLU function, Softplus does not induce sparcity.

Exponential Linear Unit (ELU)

Link/activation functions: Classification

The binomial data generating process

Introduction

For linear regression our data generating process is:

\(y=\alpha + \beta x +\epsilon \)

For linear classification our data generating process is:

\(z=\alpha + \beta x +\epsilon \)

And set \(y\) to \(1\) if \(z>0\)

Or:

\(y=\mathbf I[\alpha+\beta x+\epsilon >0]\)

Probability of each class

The probability that an invididual with characteristics \(x\) is classified as \(1\) is:

\(P_1=P(y=1|x)\)

\(P_1=P(\alpha + \beta x+\epsilon >0)\)

\(P_1=\int \mathbf I [\alpha + \beta x+\epsilon >0]f(\epsilon )d\epsilon \)

\(P_1=\int \mathbf I [\epsilon >-\alpha-\beta x ]f(\epsilon )d\epsilon \)

\(P_1=\int_{\epsilon=-\alpha-\beta x}^\infty f(\epsilon )d\epsilon \)

\(P_1=1-F(-\alpha-\beta x) \)

Example: The logistic function

Depending on the probability distribution of \(epsilon \) we have different classifiers.

For the logistic case we then have

\(P(y=1|x)=\dfrac{e^{\alpha + \beta x}}{1+e^{\alpha + \beta x}}\)

Perceptron (step function)

The function

If the sum is above \(0\), \(a(z)=1\). Otherwise, \(a(z)=0\).

The derivative

This has a differential of \(0\) at all point except \(0\), where it is undefined.

Notes

This function is not smooth.

These is the activation function used in the perceptron.

Perceptron data needs to be linearly separable to train.

Even if linearly separable, doesn’t necessarily get the best outcome?

Logistic function (AKA sigmoid, logit)

The function

\(\sigma (z)=\dfrac{1}{1+e^{-z}}\)

The range of this activation is between \(0\) and \(1\).

The derivative

\(\sigma '(z)=\dfrac{e^{-z}}{(1+e^{-z})^2}\)

\(\sigma '(z)=\sigma (z)\dfrac{1+e^{-z}-1}{1+e^{-z}}\)

\(\sigma '(z)=\sigma (z)[1-\sigma (z)]\)

Notes

Probability unit (probit)

The function

The cumulative distribution function of the normal distribution.

\(\Phi (z)\)

The derivative

The normal distribution:

\(\Phi'(z)=\phi (z)\)

tanH

ArcTan

Radial Basis Function (RBF) activation function

\(a(x)=\sum_i a_i f(||x-c_i||)\)

\(a(x)=\sum_i a_i fe^{||x-c_i||^2}\)

Multinomial classification

The multinomial data generating process

Introduction

In the binomial case we had:

\(z_i=\alpha + \beta x_i +\epsilon_i \)

And set \(y_i\) to \(1\) if \(z_i>0\)

In the multinomial case we have \(m\) alternatives

\(z_{ij}=\alpha + \beta x_{ij} +\epsilon_{ij} \)

And set \(y_{ij}=1\) if \(z_{ij}>z_{ik}\forall k\ne j\)

Generalised version

We can rewrite this as:

\(z_{ij}=v_{ij} +\epsilon_{ij} \)

Where:

\(v_{ij}=\alpha+\beta x_{ij}\)

In this case \(v\) does not depend on \(j\), but in other formulations it could.

Probabilities

\(P_{ij}=P(y_{ij}=1|x_{ij})\)

\(P_{ij}=P(z_{ij}>z_{ik}\forall k\ne j)\)

\(P_{ij}=P(\epsilon_{ik} <v_{ij} -v_{ik} +\epsilon_{ij}\forall k\ne j)\)

The form of the multinomial model: Intercepts

Previously we described the multinomial model

\(z_{ij}=v_{ij} +\epsilon_{ij} \)

Where:

\(v_{ij}=\alpha+\beta x_{ij}\)

The probability of \(j\) being chosen is.

\(P_{ij}=P(\epsilon_{ik} <v_{ij} -v_{ik} +\epsilon_{ij}\forall k\ne j)\)

Interceps in \(v\) cancel out. Therefore in the basic model there is no need to use

\(v_{ij}=\alpha+\beta x_{ij}\)

We can instead use:

\(v_{ij}=\beta x_{ij}\)

The form of the multinomial model: Conditional model

We have :

\(v_{ij}=\beta x_{ij}\)

What do we include in \(x_{ij}\)?

We can include observable characteristics for each product:

\(v_{ij}=\alpha_j + \beta x_j\)

One of the \(\alpha_j\) must be normalised to \(0\), as only differences matter. We cannot tell the difference if all \(\alpha \) are raised by the same amount.

For consistency with other models we can write this as:

\(v_{ij}=\beta x_{ij}\)

Even though this does not vary from individual to individual.

Here \(\beta \) represents average preferences for each product characteristic.

The form of the multinomial model: The multinomial model

We have differing characteristics for each individual:

\(v_{ij}=\beta x_i\)

However this adds a constant for each product. For this to discriminate we need varying coefficients.

\(v_{ij}=\beta_j x_i\)

As we only observe differences, one of the \(\beta_j\) must be normalised to \(0\).

We can rewrite this.

\(v_{ij}=\sum_k \beta_k\delta_{kj} x_i\)

\(v_{ij}=\beta z_{ij}\)

The original \(x_i\) is dense and contains data about the individual.

\(z_{ij}\) is sparce and only has entries in the {j} section.

Here \(\beta \) represents how the

The form of the multinomial model: Combined multinomial and conditional model

If we have observations of the characteristics of both individuals and alternatives we can write:

\(v_{ij}=\beta_m m_{ij}+\beta_cc_{ij}\)

\(v_{ij}=\beta x_{ij}\)

Here \(\beta \) represents both:

  • Average preferences for customer characteristics (conditional)

  • How preferences change as individual characteristics change (multinomial)

Extreme IID multinomial

IID

The probability of \(j\) being chosen is:

\(P_{ij}=P(\epsilon_{ik} <v_{ij} -v_{ik} +\epsilon_{ij}\forall k\ne j)\)

If these are independent then we have:

\(P_{ij}=\prod_{k\ne j} P(\epsilon_{ik} <v_{ij} -v_{ik} +\epsilon_{ij})\)

\(P_{ij}=\prod_{k\ne j} F_\epsilon (v_{ij} -v_{ik} +\epsilon_{ij})\)

We do not know \(\epsilon_{ij}\) so we have to integrate over possibilities.

\(P_{ij}=\int [\prod_{k\ne j} F_\epsilon (v_{ij} -v_{ik} +\epsilon_{ij})]f_\epsilon(\epsilon_{ij})d\epsilon_{ij}\)

Extreme values

We have:

\(P_{ij}=\int [\prod_{k\ne j} F_\epsilon (v_{ij} -v_{ik} +\epsilon_{ij})]f_\epsilon(\epsilon_{ij})d\epsilon_{ij}\)

If \(\epsilon \) is extreme value type-I this gives us:

\(P_{ij}=\dfrac{e^{v_{ij}}}{\sum_k e^{v_{ik}}}\)

Independence of irrelevant alternatives

Consider the ratio two probabilities:

\(\dfrac{P_{ij}}{P_{im}}=\dfrac{e^{v_{ij}}}{e^{v_{im}}}\)

This means that changes to any other products do not affect relative odds.

This can be undesirable. For example removing one option may cause unbalanced substitution.

For example raising the price of buses may cause more substitution to trains than helicopter, for a commute.

Estimating multinomial logit models

Nested logit

The probability of \(j\) being chosen is:

\(P_{ij}=P(\epsilon_{ik} <v_{ij} -v_{ik} +\epsilon_{ij}\forall k\ne j)\)

If the error terms are not IID this is more difficult to calculate.

We divide the \(J\) alternatives into nests. Within each of these we assume IID error terms, but allow variation between them.

For example we could have a nest of public/private transport. We could have a nest of types of product, and within that the firms offering the product.

The nested logit model does \(2\) or more seqential IID logit models. One to select the nest, and the other to select the alternative within the nest.

Mixed logit (random coefficients)

Introduction

In our standard model we have:

\(z_{ij}=\beta x_{ij} +\epsilon_{ij} \)

If we allow the parameters to vary for each individual we have:

\(z_{ij}=\beta_i x_{ij} +\epsilon_{ij} \)

The probability of choosing \(j\) now depends on the distribution of \(\beta \).

In the IID case we had:

\(P_{ij}=\dfrac{e^{\beta x_{ij}}}{\sum_k e^{\beta x_{ik}}}\)

Rather than evaluate this at a single point \(\beta \) we integrate.

\(P_{ij}=\int \dfrac{e^{\beta x_{ij}}}{\sum_k e^{\beta x_{ik}}}f(\beta )d\beta \)

If \(\beta \) is degenerate this reduces to the standard logit model.

Multinomial probit

This relaxes the IID and extreme value assumption.

Errors have a normal variance-covariance matrix.