# Summary statistics and visualisation for multiple variables

## Statistics for two variables

### Sample covariance

We previously defined the population covariance as $$\sigma_{XY}=E[(X-\mu_X)^T(Y-\mu_Y)]$$.

We define the sample covariance as $$\sigma_{XY}=\dfrac{1}{n}\sum_i(x_i-\bar x)(y_i-\bar y)$$.

We can calculate this using matrices:

$$M=X-\bar x$$

$$N=Y-\bar y$$

$$\sigma_{XY}=\dfrac{1}{n}M^TN$$.

### Sample correlation

$$\rho_{XY}=\dfrac{\sigma_{XY}}{\sigma_X \sigma_Y}$$

### Covariance matrix

If we have $$n$$ variables we can have a $$n\times n$$ matrix $$\Sigma$$ where:

$$\Sigma_{ij} = \sigma_{ij}=\dfrac{1}{n}(X_i-\bar x_i)^T(X_j-\bar x_j)$$

### Centred covariance

If $$\bar x = \bar y = 0$$ then:

$$\sigma_{XY}=\dfrac{1}{n}X^TY$$

### Correlation matrix

Here each entry is the correlation rather than the covariance.

## Correlation coefficients

### Pearson correlation coefficient

The Pearson correlation coefficient is definited as the covariance normalised by the individual variances.

It is between $$-1$$ (total negative linear correlation), $$0$$ (no linear correlation) and $$1$$ (total negative linear correlation).

$$p_{X,Y}=\dfrac{cov (X,Y)}{\sigma_X\sigma_Y}$$

### Spearman rank correlation

For each of $$2$$ variables we create a ranking of them.

From $$X$$ and $$Y$$ we then have $$R_X$$ and $$R_Y$$.

We then calculate the Pearson correlation coefficient between the rankings.

$$r_S=\dfrac{cov(R_X, R_Y)}{\sigma_{R_X}\sigma_{R_Y}}$$

### Kendall rank correlation

Concordant and discordant pairs

$$\tau = \dfrac{n_{concordant}-n_{discordant}}{\begin{pmatrix}n\\2\end{pmatrix}}$$

## Updating statistics

### Updating the covariance

If it is centred:

$$\sigma_{XY}^n=\dfrac{1}{n}X_n^TY_n$$

So:

$$\sigma_{XY}^{n+1}=\dfrac{n\sigma^n_{XY}+x_{n+1}^ty_{n+1}}{n+1}$$

## Visualising multiple continuous variables

### Q-Q plots

Plot quartiles of variables against each other.