# K-means and k-mediods clustering

## Clustering

### Evaluating clusterings

#### Davies-Bouldin index

The Davies-Bouldin index is a method for evaluating clustering algorithms, such as k-means.

It examines the distance between centroids, and the tightness of centroids.

### k-means clustering

#### Introduction

K-means clustering is the most widely used unsupervised model.

In k-means clustering we identify $$k$$ centroids in the feature space. We then calculate the distance from each data point to each of the centroids, and allocate the data point to the nearest centroid.

This requires a method for calculating the location of the centroids.

#### Identifying the centroids

We apply an iterative approach to identifying the centroids.

We first initialise by assigning centroids randomly to existing data points.

We then iteratively perform the following:

• Calculate the distances between each data point and each centroid.

• Assign each data point to the closed centroid.

• Update each centroid location to the mean of the data points allocated to it.

#### Calculating distances

This method requires us to calculate the distance between two points in the feature space.

For k-means we use the Euclidian distance.

#### Potential issues

It is possibile for a centroid to have no data assigned to it. If this happens we can eliminate the cluster, or reassign some data points.

The algorithm may only arrive at a local minima. In order to maximise the chance of an effective clustering, we can do k-means under different initialisations of the centroids in order to minimise risk of bad local optima.

#### Choosing $$k$$

If the points in each cluster follow a normal distribution, that’s a good sign. This can be tested with Anderson-Darling.

If it’s not normal, we can split the cluster into 2.

#### Using clusting as part of data analysis

We can choose $$k$$ if output is being used in later data analysis (eg type assignment, complaint level or something)

### k-medoids

#### Introduction

k-mediods is similar to k-means clustering, with two key differences:

• Centroids are now always located on data points, rather than floating freely.

• We mimimise $$l_1$$ distance, rathern than $$l_2$$.

#### Partitioning Around Medoids (PAM) algorithm

This is the most common approach for k-medoids.

We initialise randomly, as we do for k-means.

We then iterate the following:

• Calculate the loss for the current allocation

• For each medoid, see if swapping allocation with another (non-medoid) data point decreases the cost.

• If it does, make the swap.