# The Naive Bayes classifier

## Naive Bayes

### The Naive Bayes posterior

#### Bayes theorem

Consider Bayesâ€™ theorem

\(P(y|x_1,x_2,...,x_n)=\dfrac{P(x_1,x_2,...,x_n|y)P(y)}{P(x_1,x_2,...,x_n)}\)

Here, \(y\) is the label, and \(x_1,x_2,...,x_n\) is the evidence. We want to know the probability of each label given evidence.

The denominator, \(P(x_1,x_2,...,x_n)\), is the same for all, so we only need to identify:

\(P(y|x_1,x_2,...,x_n)\propto P(x_1,x_2,...,x_n|y)P(y)\)

#### The assumption of Naive Bayes

We assume each \(x\) is independent. Therefore:

\(P(x_1,x_2,...,x_n|y)=P(x_1|y)P(x_2|y)...P(x_n|y)\)

\(P(y|x_1,x_2,...,x_n)\propto P(x_1|y)P(x_2|y)...P(x_n|y)P(y)\)

### The Naive Bayes classifier

#### Calculating the Naive Bayes estimator

With the Naive Bayes assumption we have:

\(P(y|x_1,x_2,...,x_n)\propto P(x_1|y)P(x_2|y)...P(x_n|y)P(y)\)

We now choose \(y\) which maximises this.

This is easier to calculate, as there is less of a sample restriction.

This is used when evidence is also in classes, as the chance of any individual outcome on a continuous probability is \(0\).

#### Estimating \(P(y)\)

We can easily calculate \(P(y)\), by looking at the frequency across the sample.

#### Estimating \(P(x_1|y)\)

Normally, \(P(x_1|y)=\dfrac{n_c}{n_y}\), where:

#### Regularising the Naive Bayes estimator

To reduce the risk of specific probabilities being zero, we can adjust them, so that:

\(P(x_1|y)=\dfrac{n_c+mp}{n_y+m}\), where:

\(p\) is the prior probability. If this is unknown, use \(\dfrac{1}{k}\), where \(k\) is the number of classes.

\(m\) is a parameter called the equivilant sample size.

### Text classification using Naive Bayes

#### Naive Bayes and text classification

Naive Bayes can be used to classify text documents. The \(x\) variables can be appearances of each word, and \(y\) can be the document classification.