# Distance metrics and outliers

## Measuring distance between vectors

### \(L_p\) norms

\(L_p\) norms can be used to measure the distance between two metrics.

If we have data points \(v\) and \(w\) the distance is:

\(||v-w||=(\sum_i |v_i-w_i|^p)^{\dfrac{1}{p}}\)

If \(p=2\) we have the Euclidian norm. If \(p=1\) we have the Manhatten norm.

### Dot product

Given two vectors we can calculate:

\(\dfrac{a.b}{||a|| ||b||}\)

If the two vectors are identical, this is \(1\). If they are orthogonal this is \(0\). If they are opposite, this is \(-1\).

### Kernels

This is a generalisation of the dot product function, where we want to find similarity between two vectors.

If we have data points \(v\) and \(w\) the distance is:

\(K(v, w)\)

### Mahalanobis distance

We have a point. How far away is this from the mean.

For a single dimension: number of standard deviations.

What about multidimensional data?

Could do sd for all distances, but correlations between variables. If two variables are highly correlated, itâ€™s not really twice as far.

We use this:

\(D_M(\mathbf x)=\sqrt {(\mathbf x-\mathbf \mu )^TS^{-1}(\mathbf x-\mathbf \mu )}\)

## Measuring distance between matrices

### Frobenius norm

If we have matrices \(A\) and \(A\) the distance is:

\(||A-B||=\sqrt {\sum_i \sum_j |a_{ij}-b_{ij}|^2}\)

This is a Euclidian norm.

## Measuring distance between time series

### Dynamic time warping

We may want to examine the similarity between two sequences.

We want to match a sample from one sequence to a sample from the other sequence.

Simply matching at the same time point is naive, as samples may move at different speeds, or have offsets.

## Finding neighbours

### Nearest Neighbour Search (NSS)

### Finding neighbours

Say we have a distance function and a sample. How can we identify the \(k\)-nearest neighbours?

We can find the distance for all points, sort this and take the top \(k\) observations.

### k-Nearest Neighbour Search