Distance metrics and outliers

Measuring distance between vectors

\(L_p\) norms

\(L_p\) norms can be used to measure the distance between two metrics.

If we have data points \(v\) and \(w\) the distance is:

\(||v-w||=(\sum_i |v_i-w_i|^p)^{\dfrac{1}{p}}\)

If \(p=2\) we have the Euclidian norm. If \(p=1\) we have the Manhatten norm.

Dot product

Given two vectors we can calculate:

\(\dfrac{a.b}{||a|| ||b||}\)

If the two vectors are identical, this is \(1\). If they are orthogonal this is \(0\). If they are opposite, this is \(-1\).

Kernels

This is a generalisation of the dot product function, where we want to find similarity between two vectors.

If we have data points \(v\) and \(w\) the distance is:

\(K(v, w)\)

Mahalanobis distance

We have a point. How far away is this from the mean.

For a single dimension: number of standard deviations.

What about multidimensional data?

Could do sd for all distances, but correlations between variables. If two variables are highly correlated, it’s not really twice as far.

We use this:

\(D_M(\mathbf x)=\sqrt {(\mathbf x-\mathbf \mu )^TS^{-1}(\mathbf x-\mathbf \mu )}\)

Measuring distance between matrices

Frobenius norm

If we have matrices \(A\) and \(A\) the distance is:

\(||A-B||=\sqrt {\sum_i \sum_j |a_{ij}-b_{ij}|^2}\)

This is a Euclidian norm.

Measuring distance between time series

Dynamic time warping

We may want to examine the similarity between two sequences.

We want to match a sample from one sequence to a sample from the other sequence.

Simply matching at the same time point is naive, as samples may move at different speeds, or have offsets.

Finding neighbours

Nearest Neighbour Search (NSS)

Finding neighbours

Say we have a distance function and a sample. How can we identify the \(k\)-nearest neighbours?

We can find the distance for all points, sort this and take the top \(k\) observations.