Generalised linear models and multiclass classification

Multinomial classification

The multinomial data generating process


In the binomial case we had:

\(z_i=\alpha + \beta x_i +\epsilon_i \)

And set \(y_i\) to \(1\) if \(z_i>0\)

In the multinomial case we have \(m\) alternatives

\(z_{ij}=\alpha + \beta x_{ij} +\epsilon_{ij} \)

And set \(y_{ij}=1\) if \(z_{ij}>z_{ik}\forall k\ne j\)

Generalised version

We can rewrite this as:

\(z_{ij}=v_{ij} +\epsilon_{ij} \)


\(v_{ij}=\alpha+\beta x_{ij}\)

In this case \(v\) does not depend on \(j\), but in other formulations it could.



\(P_{ij}=P(z_{ij}>z_{ik}\forall k\ne j)\)

\(P_{ij}=P(\epsilon_{ik} <v_{ij} -v_{ik} +\epsilon_{ij}\forall k\ne j)\)

The form of the multinomial model: Intercepts

Previously we described the multinomial model

\(z_{ij}=v_{ij} +\epsilon_{ij} \)


\(v_{ij}=\alpha+\beta x_{ij}\)

The probability of \(j\) being chosen is.

\(P_{ij}=P(\epsilon_{ik} <v_{ij} -v_{ik} +\epsilon_{ij}\forall k\ne j)\)

Interceps in \(v\) cancel out. Therefore in the basic model there is no need to use

\(v_{ij}=\alpha+\beta x_{ij}\)

We can instead use:

\(v_{ij}=\beta x_{ij}\)

The form of the multinomial model: Conditional model

We have :

\(v_{ij}=\beta x_{ij}\)

What do we include in \(x_{ij}\)?

We can include observable characteristics for each product:

\(v_{ij}=\alpha_j + \beta x_j\)

One of the \(\alpha_j\) must be normalised to \(0\), as only differences matter. We cannot tell the difference if all \(\alpha \) are raised by the same amount.

For consistency with other models we can write this as:

\(v_{ij}=\beta x_{ij}\)

Even though this does not vary from individual to individual.

Here \(\beta \) represents average preferences for each product characteristic.

The form of the multinomial model: The multinomial model

We have differing characteristics for each individual:

\(v_{ij}=\beta x_i\)

However this adds a constant for each product. For this to discriminate we need varying coefficients.

\(v_{ij}=\beta_j x_i\)

As we only observe differences, one of the \(\beta_j\) must be normalised to \(0\).

We can rewrite this.

\(v_{ij}=\sum_k \beta_k\delta_{kj} x_i\)

\(v_{ij}=\beta z_{ij}\)

The original \(x_i\) is dense and contains data about the individual.

\(z_{ij}\) is sparce and only has entries in the {j} section.

Here \(\beta \) represents how the

The form of the multinomial model: Combined multinomial and conditional model

If we have observations of the characteristics of both individuals and alternatives we can write:

\(v_{ij}=\beta_m m_{ij}+\beta_cc_{ij}\)

\(v_{ij}=\beta x_{ij}\)

Here \(\beta \) represents both:

  • Average preferences for customer characteristics (conditional)

  • How preferences change as individual characteristics change (multinomial)

Extreme IID multinomial


The probability of \(j\) being chosen is:

\(P_{ij}=P(\epsilon_{ik} <v_{ij} -v_{ik} +\epsilon_{ij}\forall k\ne j)\)

If these are independent then we have:

\(P_{ij}=\prod_{k\ne j} P(\epsilon_{ik} <v_{ij} -v_{ik} +\epsilon_{ij})\)

\(P_{ij}=\prod_{k\ne j} F_\epsilon (v_{ij} -v_{ik} +\epsilon_{ij})\)

We do not know \(\epsilon_{ij}\) so we have to integrate over possibilities.

\(P_{ij}=\int [\prod_{k\ne j} F_\epsilon (v_{ij} -v_{ik} +\epsilon_{ij})]f_\epsilon(\epsilon_{ij})d\epsilon_{ij}\)

Extreme values

We have:

\(P_{ij}=\int [\prod_{k\ne j} F_\epsilon (v_{ij} -v_{ik} +\epsilon_{ij})]f_\epsilon(\epsilon_{ij})d\epsilon_{ij}\)

If \(\epsilon \) is extreme value type-I this gives us:

\(P_{ij}=\dfrac{e^{v_{ij}}}{\sum_k e^{v_{ik}}}\)

Independence of irrelevant alternatives

Consider the ratio two probabilities:


This means that changes to any other products do not affect relative odds.

This can be undesirable. For example removing one option may cause unbalanced substitution.

For example raising the price of buses may cause more substitution to trains than helicopter, for a commute.

Estimating multinomial logit models

Estimating with individual level data.

Estimating with market share level data.

Nested logit

The probability of \(j\) being chosen is:

\(P_{ij}=P(\epsilon_{ik} <v_{ij} -v_{ik} +\epsilon_{ij}\forall k\ne j)\)

If the error terms are not IID this is more difficult to calculate.

We divide the \(J\) alternatives into nests. Within each of these we assume IID error terms, but allow variation between them.

For example we could have a nest of public/private transport. We could have a nest of types of product, and within that the firms offering the product.

The nested logit model does \(2\) or more seqential IID logit models. One to select the nest, and the other to select the alternative within the nest.

Mixed logit (random coefficients)


In our standard model we have:

\(z_{ij}=\beta x_{ij} +\epsilon_{ij} \)

If we allow the parameters to vary for each individual we have:

\(z_{ij}=\beta_i x_{ij} +\epsilon_{ij} \)

The probability of choosing \(j\) now depends on the distribution of \(\beta \).

In the IID case we had:

\(P_{ij}=\dfrac{e^{\beta x_{ij}}}{\sum_k e^{\beta x_{ik}}}\)

Rather than evaluate this at a single point \(\beta \) we integrate.

\(P_{ij}=\int \dfrac{e^{\beta x_{ij}}}{\sum_k e^{\beta x_{ik}}}f(\beta )d\beta \)

If \(\beta \) is degenerate this reduces to the standard logit model.

Multinomial probit

This relaxes the IID and extreme value assumption.

Errors have a normal variance-covariance matrix.


The softmax function is often used in the last layer of a classification network.

It takes a vector of dimension \(k\) and returns another vector of the same size. Only, this time all numbers are between \(0\) and \(1\) and the values sum to \(1\).

The softmax function is based on the sigmoid function.


Temperature for Softmax