# Distance metrics and outliers

## Measuring distance between vectors

### $$L_p$$ norms

$$L_p$$ norms can be used to measure the distance between two metrics.

If we have data points $$v$$ and $$w$$ the distance is:

$$||v-w||=(\sum_i |v_i-w_i|^p)^{\dfrac{1}{p}}$$

If $$p=2$$ we have the Euclidian norm. If $$p=1$$ we have the Manhatten norm.

### Dot product

Given two vectors we can calculate:

$$\dfrac{a.b}{||a|| ||b||}$$

If the two vectors are identical, this is $$1$$. If they are orthogonal this is $$0$$. If they are opposite, this is $$-1$$.

### Kernels

This is a generalisation of the dot product function, where we want to find similarity between two vectors.

If we have data points $$v$$ and $$w$$ the distance is:

$$K(v, w)$$

### Mahalanobis distance

We have a point. How far away is this from the mean.

For a single dimension: number of standard deviations.

Could do sd for all distances, but correlations between variables. If two variables are highly correlated, it’s not really twice as far.

We use this:

$$D_M(\mathbf x)=\sqrt {(\mathbf x-\mathbf \mu )^TS^{-1}(\mathbf x-\mathbf \mu )}$$

## Measuring distance between matrices

### Frobenius norm

If we have matrices $$A$$ and $$A$$ the distance is:

$$||A-B||=\sqrt {\sum_i \sum_j |a_{ij}-b_{ij}|^2}$$

This is a Euclidian norm.

## Measuring distance between time series

### Dynamic time warping

We may want to examine the similarity between two sequences.

We want to match a sample from one sequence to a sample from the other sequence.

Simply matching at the same time point is naive, as samples may move at different speeds, or have offsets.

## Finding neighbours

### Finding neighbours

Say we have a distance function and a sample. How can we identify the $$k$$-nearest neighbours?

We can find the distance for all points, sort this and take the top $$k$$ observations.