Summary statistics and visualisation for multiple variables

Statistics for two variables

Sample covariance

We previously defined the population covariance as \(\sigma_{XY}=E[(X-\mu_X)^T(Y-\mu_Y)]\).

We define the sample covariance as \(\sigma_{XY}=\dfrac{1}{n}\sum_i(x_i-\bar x)(y_i-\bar y)\).

We can calculate this using matrices:

\(M=X-\bar x\)

\(N=Y-\bar y\)

\(\sigma_{XY}=\dfrac{1}{n}M^TN\).

Sample correlation

\(\rho_{XY}=\dfrac{\sigma_{XY}}{\sigma_X \sigma_Y}\)

Covariance matrix

If we have \(n\) variables we can have a \(n\times n \) matrix \(\Sigma \) where:

\(\Sigma_{ij} = \sigma_{ij}=\dfrac{1}{n}(X_i-\bar x_i)^T(X_j-\bar x_j)\)

Centred covariance

If \(\bar x = \bar y = 0\) then:

\(\sigma_{XY}=\dfrac{1}{n}X^TY\)

Correlation matrix

Here each entry is the correlation rather than the covariance.

Correlation coefficients

Pearson correlation coefficient

The Pearson correlation coefficient is definited as the covariance normalised by the individual variances.

It is between \(-1\) (total negative linear correlation), \(0\) (no linear correlation) and \(1\) (total negative linear correlation).

\(p_{X,Y}=\dfrac{cov (X,Y)}{\sigma_X\sigma_Y}\)

Spearman rank correlation

For each of \(2\) variables we create a ranking of them.

From \(X\) and \(Y\) we then have \(R_X\) and \(R_Y\).

We then calculate the Pearson correlation coefficient between the rankings.

\(r_S=\dfrac{cov(R_X, R_Y)}{\sigma_{R_X}\sigma_{R_Y}}\)

Kendall rank correlation

Concordant and discordant pairs

\(\tau = \dfrac{n_{concordant}-n_{discordant}}{\begin{pmatrix}n\\2\end{pmatrix}}\)

General correlation coefficient

Updating statistics

Updating the covariance

If it is centred:

\(\sigma_{XY}^n=\dfrac{1}{n}X_n^TY_n\)

So:

\(\sigma_{XY}^{n+1}=\dfrac{n\sigma^n_{XY}+x_{n+1}^ty_{n+1}}{n+1}\)

Visualising multiple continuous variables

Time series

Scatter plots (with size as variable)

Q-Q plots

Plot quartiles of variables against each other.

Visualising a single class variable

Bar and column charts

Pie charts

Visualising multiple class variables

Stacked bar and column charts

Visualising class and continous variables

Multiple box and whiskers

Scatter plots with colour

Visualising geographic data

Visualising time series

Heat maps

Sparklines