Clustering
Section outline
-
We use predictive modelling (supervised learning) when our goal is to predict the value of dependent variable (also called target, outcome, class, label) based on independent variables (features, attributes, variables, properties). In a different setup, we do not have any particular target but just some objects with known properties (essentially independent variables) and our goal is to find some general patterns. Such methods belong to the field of unsupervised modelling.
One such task is finding separate groups in the data, if those exist. We call such groups clusters, and the procedure is called clustering.
We learned about two approaches. In hierarchical clustering, every data point represents a cluster and we iteratively merge clusters until arriving at a single cluster. The process results in a hierarchy of clusters. The hierarchy is presented with a dendrogram, a tree-like visualization in which the lengths of branches correspond to distances between clusters. Dendrogram also offers an insight into the number of clusters in the data: we usually set a cut-off at a point where the distances begin to grow more rapidly. One of the decisions we have to make is the definition of distances between two clusters of points: we may consider the closest pair (single linkage), the farthest (complete linkage), the average distance (average linkage) or another measure based on intra-cluster variance (Ward linkage). We will usually use the latter, unless there are some principal arguments for others.
The other approach we considered is k-means clustering. We first set the number of clusters ($k$). The method puts $k$ centroids at random position and assigns each data instance to the closest centroid, thus forming some clusters. Next it moves the centroid to the center of this cluster. Then it reassigns the data instances according to the new positions of the centroids. Then it again moves the centroids...
The two methods - hierarchical clustering and k-means clustering - differ in several imporant aspects.
Hierarchical clustering starts from a matrix of distances that do not necessarily correspond to any observable coordinates; they may represent, say, subjectively determined dissimilarities between people. K-means works with coordinates and computes (Euclidean) distances between them.
Hierarchical clustering provides a nice visualization, which can be used to inspect the quality of clustering and determine the number of clusters. K-means does none of this.
Hierarchical clustering consumes a lot of memory and may be slow. K-means is often faster, although it may get stuck in suboptimal positions. In practice, we restart it multiple times and take the best result.
Hierarchical clustering results in a hierarchy, and we decide the number of clusters after the procedure. In k-means we set it in advance. In practice, we tried with different numbers of clusters and take the one with the highest silhouette score - another topic of the lecture.
The chapter below describes all of this in more detail.
-
Obligatory reading: sections 8.2 (you may skip 8.2.6), 8.3 (skip 8.3.3), The Silhouette Coefficient (pg. 541). Everything else is also quite easy to read, so we recommend it.