# Summary statistics

## Basis statistics for a single variable

### N

The is the size of the sample.

### Sample range

#### Minimum

This is the smallest value in the sample.

#### Maximum

This is the largest value in the sample.

#### Range

This is the difference between the maximum and minimum.

#### Median

This is the value whereby 50% of the sample can be found below the value.

#### Percentiles

The $$x$$th percentile is the value by which $$x\%$$ of the values can be found below it.

#### Interquartile range

This is the differnence between the $$25$$th percentile and the $$75$$th percentile.

### Sample mode

The is the most common value in the sample.

## Sample moments

### Sample mean

We previously defined the population mean is defined as $$\mu=E[X]$$.

The sample mean is defined as $$\bar x = \dfrac{1}{n}\sum_i x_i$$.

#### Centred mean

We can subtract the mean from each entry in the sample. This will leave a new mean of $$0$$. This is convenient for many calculations.

### Sample variance

We previously defined the population variance as $$\sigma^2=E[(X-\mu)^2]$$.

We define the sample variance as $$\sigma^2=\dfrac{1}{n}\sum_i(x_i-\bar x)^2$$.

We can calculate this using matrices:

$$M=X-\bar x$$

$$\sigma^2=\dfrac{1}{n}M^TM$$.

#### Centred variance

If $$\bar x =0$$ then:

$$\sigma^2=\dfrac{1}{n}X^TX$$.

## Statistics for two variables

### Sample covariance

#### Calculating

We previously defined the population covariance as $$\sigma_{XY}=E[(X-\mu_X)^T(Y-\mu_Y)]$$.

We define the sample covariance as $$\sigma_{XY}=\dfrac{1}{n}\sum_i(x_i-\bar x)(y_i-\bar y)$$.

We can calculate this using matrices:

$$M=X-\bar x$$

$$N=Y-\bar y$$

$$\sigma_{XY}=\dfrac{1}{n}M^TN$$.

#### Sample correlation

$$\rho_{XY}=\dfrac{\sigma_{XY}}{\sigma_X \sigma_Y}$$

#### Covariance matrix

If we have $$n$$ variables we can have a $$n\times n$$ matrix $$\Sigma$$ where:

$$\Sigma_{ij} = \sigma_{ij}=\dfrac{1}{n}(X_i-\bar x_i)^T(X_j-\bar x_j)$$

#### Centred covariance

If $$\bar x = \bar y = 0$$ then:

$$\sigma_{XY}=\dfrac{1}{n}X^TY$$

#### Correlation matrix

Here each entry is the correlation rather than the covariance.

### Pearson correlation coefficient

The Pearson correlation coefficient is definited as the covariance normalised by the individual variances.

It is between $$-1$$ (total negative linear correlation), $$0$$ (no linear correlation) and $$1$$ (total negative linear correlation).

$$p_{X,Y}=\dfrac{cov (X,Y)}{\sigma_X\sigma_Y}$$

### Spearman rank correlation

#### Ranking variables

For each of $$2$$ variables we create a ranking of them.

From $$X$$ and $$Y$$ we then have $$R_X$$ and $$R_Y$$.

#### Calculating the Spearman rank correlation

We then calculate the Pearson correlation coefficient between the rankings.

$$r_S=\dfrac{cov(R_X, R_Y)}{\sigma_{R_X}\sigma_{R_Y}}$$

## Updating statistics

### Updating the mean, variance and covariance

#### Mean

$$\bar x_{n+1} = \dfrac{n\bar x_n+x_{n+1}}{n+1}$$

#### Variance

If it is centred:

$$\sigma_n^2=\dfrac{1}{n}X_n^TX_n$$

So:

$$\sigma_{n+1}^2=\dfrac{n\sigma_n^2 +x_{n+1}^tx_{n+1}}{n+1}$$

#### Covariance

If it is centred:

$$\sigma_{XY}^n=\dfrac{1}{n}X_n^TY_n$$

So:

$$\sigma_{XY}^{n+1}=\dfrac{n\sigma^n_{XY}+x_{n+1}^ty_{n+1}}{n+1}$$

## Splitting data into classes

### Splitting data into classes

We have $$X$$, and we want to split this into $$m$$ different matrices.

We can do this by creating an index for each group, $$v$$, where it is $$1$$ if in the group and $$0$$ otherwise.

We then select $$X[v]$$.

Alternatively we can do $$v^TX$$, however we must then trim the extra variables.