Summary statistics

Basis statistics for a single variable

N

The is the size of the sample.

Sample range

Minimum

This is the smallest value in the sample.

Maximum

This is the largest value in the sample.

Range

This is the difference between the maximum and minimum.

Median

This is the value whereby 50% of the sample can be found below the value.

Percentiles

The \(x\)th percentile is the value by which \(x\%\) of the values can be found below it.

Interquartile range

This is the differnence between the \(25\)th percentile and the \(75\)th percentile.

Sample mode

The is the most common value in the sample.

Sample moments

Sample mean

We previously defined the population mean is defined as \(\mu=E[X]\).

The sample mean is defined as \(\bar x = \dfrac{1}{n}\sum_i x_i\).

Centred mean

We can subtract the mean from each entry in the sample. This will leave a new mean of \(0\). This is convenient for many calculations.

Sample variance

We previously defined the population variance as \(\sigma^2=E[(X-\mu)^2]\).

We define the sample variance as \(\sigma^2=\dfrac{1}{n}\sum_i(x_i-\bar x)^2\).

We can calculate this using matrices:

\(M=X-\bar x\)

\(\sigma^2=\dfrac{1}{n}M^TM\).

Centred variance

If \(\bar x =0\) then:

\(\sigma^2=\dfrac{1}{n}X^TX\).

Statistics for two variables

Sample covariance

Calculating

We previously defined the population covariance as \(\sigma_{XY}=E[(X-\mu_X)^T(Y-\mu_Y)]\).

We define the sample covariance as \(\sigma_{XY}=\dfrac{1}{n}\sum_i(x_i-\bar x)(y_i-\bar y)\).

We can calculate this using matrices:

\(M=X-\bar x\)

\(N=Y-\bar y\)

\(\sigma_{XY}=\dfrac{1}{n}M^TN\).

Sample correlation

\(\rho_{XY}=\dfrac{\sigma_{XY}}{\sigma_X \sigma_Y}\)

Covariance matrix

If we have \(n\) variables we can have a \(n\times n \) matrix \(\Sigma \) where:

\(\Sigma_{ij} = \sigma_{ij}=\dfrac{1}{n}(X_i-\bar x_i)^T(X_j-\bar x_j)\)

Centred covariance

If \(\bar x = \bar y = 0\) then:

\(\sigma_{XY}=\dfrac{1}{n}X^TY\)

Correlation matrix

Here each entry is the correlation rather than the covariance.

Pearson correlation coefficient

The Pearson correlation coefficient is definited as the covariance normalised by the individual variances.

It is between \(-1\) (total negative linear correlation), \(0\) (no linear correlation) and \(1\) (total negative linear correlation).

\(p_{X,Y}=\dfrac{cov (X,Y)}{\sigma_X\sigma_Y}\)

Spearman rank correlation

Ranking variables

For each of \(2\) variables we create a ranking of them.

From \(X\) and \(Y\) we then have \(R_X\) and \(R_Y\).

Calculating the Spearman rank correlation

We then calculate the Pearson correlation coefficient between the rankings.

\(r_S=\dfrac{cov(R_X, R_Y)}{\sigma_{R_X}\sigma_{R_Y}}\)

Updating statistics

Updating the mean, variance and covariance

Mean

\(\bar x_{n+1} = \dfrac{n\bar x_n+x_{n+1}}{n+1}\)

Variance

If it is centred:

\(\sigma_n^2=\dfrac{1}{n}X_n^TX_n\)

So:

\(\sigma_{n+1}^2=\dfrac{n\sigma_n^2 +x_{n+1}^tx_{n+1}}{n+1}\)

Covariance

If it is centred:

\(\sigma_{XY}^n=\dfrac{1}{n}X_n^TY_n\)

So:

\(\sigma_{XY}^{n+1}=\dfrac{n\sigma^n_{XY}+x_{n+1}^ty_{n+1}}{n+1}\)

Splitting data into classes

Splitting data into classes

We have \(X\), and we want to split this into \(m\) different matrices.

We can do this by creating an index for each group, \(v\), where it is \(1\) if in the group and \(0\) otherwise.

We then select \(X[v]\).

Alternatively we can do \(v^TX\), however we must then trim the extra variables.