Clustering

Clustering

Evaluating clusterings

Davies-Bouldin index

The Davies-Bouldin index is a method for evaluating clustering algorithms, such as k-means.

It examines the distance between centroids, and the tightness of centroids.

k-means clustering

Introduction

K-means clustering is the most widely used unsupervised model.

In k-means clustering we identify \(k\) centroids in the feature space. We then calculate the distance from each data point to each of the centroids, and allocate the data point to the nearest centroid.

This requires a method for calculating the location of the centroids.

Identifying the centroids

We apply an iterative approach to identifying the centroids.

We first initialise by assigning centroids randomly to existing data points.

We then iteratively perform the following:

  • Calculate the distances between each data point and each centroid.

  • Assign each data point to the closed centroid.

  • Update each centroid location to the mean of the data points allocated to it.

Calculating distances

This method requires us to calculate the distance between two points in the feature space.

For k-means we use the Euclidian distance.

Potential issues

It is possibile for a centroid to have no data assigned to it. If this happens we can eliminate the cluster, or reassign some data points.

The algorithm may only arrive at a local minima. In order to maximise the chance of an effective clustering, we can do k-means under different initialisations of the centroids in order to minimise risk of bad local optima.

Choosing \(k\)

If the points in each cluster follow a normal distribution, that’s a good sign. This can be tested with Anderson-Darling.

If it’s not normal, we can split the cluster into 2.

Using clusting as part of data analysis

We can choose \(k\) if output is being used in later data analysis (eg type assignment, complaint level or something)

k-medoids

Introduction

k-mediods is similar to k-means clustering, with two key differences:

  • Centroids are now always located on data points, rather than floating freely.

  • We mimimise \(l_1\) distance, rathern than \(l_2\).

Partitioning Around Medoids (PAM) algorithm

This is the most common approach for k-medoids.

We initialise randomly, as we do for k-means.

We then iterate the following:

  • Calculate the loss for the current allocation

  • For each medoid, see if swapping allocation with another (non-medoid) data point decreases the cost.

  • If it does, make the swap.