Summary statistics for multiple variables

Statistics for two variables

Sample covariance

We previously defined the population covariance as \(\sigma_{XY}=E[(X-\mu_X)^T(Y-\mu_Y)]\).

We define the sample covariance as \(\sigma_{XY}=\dfrac{1}{n}\sum_i(x_i-\bar x)(y_i-\bar y)\).

We can calculate this using matrices:

\(M=X-\bar x\)

\(N=Y-\bar y\)

\(\sigma_{XY}=\dfrac{1}{n}M^TN\).

Sample correlation

\(\rho_{XY}=\dfrac{\sigma_{XY}}{\sigma_X \sigma_Y}\)

Covariance matrix

If we have \(n\) variables we can have a \(n\times n\) matrix \(\Sigma\) where:

\(\Sigma_{ij} = \sigma_{ij}=\dfrac{1}{n}(X_i-\bar x_i)^T(X_j-\bar x_j)\)

Centred covariance

If \(\bar x = \bar y = 0\) then:

\(\sigma_{XY}=\dfrac{1}{n}X^TY\)

Correlation matrix

Here each entry is the correlation rather than the covariance.

Correlation coefficients

Pearson correlation coefficient

The Pearson correlation coefficient is definited as the covariance normalised by the individual variances.

It is between \(-1\) (total negative linear correlation), \(0\) (no linear correlation) and \(1\) (total negative linear correlation).

\(p_{X,Y}=\dfrac{cov (X,Y)}{\sigma_X\sigma_Y}\)

Spearman rank correlation

For each of \(2\) variables we create a ranking of them.

From \(X\) and \(Y\) we then have \(R_X\) and \(R_Y\).

We then calculate the Pearson correlation coefficient between the rankings.

\(r_S=\dfrac{cov(R_X, R_Y)}{\sigma_{R_X}\sigma_{R_Y}}\)

Kendall rank correlation

Concordant and discordant pairs

\(\tau = \dfrac{n_{concordant}-n_{discordant}}{\begin{pmatrix}n\\2\end{pmatrix}}\)

General correlation coefficient

Updating statistics

Updating the covariance

If it is centred:

\(\sigma_{XY}^n=\dfrac{1}{n}X_n^TY_n\)

So:

\(\sigma_{XY}^{n+1}=\dfrac{n\sigma^n_{XY}+x_{n+1}^ty_{n+1}}{n+1}\)

Sparklines