You can click on any of the method names to preview the Jupyter Notebooks online:
Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Each of these "links" measure distance between clusters differently, and every step clusters are combined on the basis of which of the clusters are closest (least distant).
There are many variants of hierarchical clustering; here we explore 3. The key difference is how you measure the distance between two clusters S1 and S2.
Single-Link: inter-cluster distance is measured by the shortest link (distance between the closest set of points from cluster S1 to S2)
Complete-Link: inter-cluster distance is measured by the longest link (distance between the furthest set of points from cluster S1 to S2)
Mean-Link: measures inter-cluster distance is measured by the mean link (distance between the center of masses of cluster S1 and S2)
Gonzalez-Algorithm: Greedy approximation algorithm for finding k clusters that minimize the maximum diameter of a cluster. This essenatially tries to pick cluster points as the furthest point from all other cluster center points. Chokes with outliers.
Centroid-based: Optimizes finding the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized. Chokes with bad initial centeroid initialization.
Lloyd's Algorithm (K-means) Consists of two important steps:
- Assignment step: Assign each observation to the cluster whose mean has the least squared Euclidean distance, this is intuitively the "nearest" mean. (partition into the Voronoi diagram by using the means as the "post offices")
- Update step: Calculate the new means to be the centroids of the observations in the new clusters.
The algorithm finishes when the new means stop changing with new iterations.
K-means++ Is meant to overcome K-means initialization pitfalls by assigning initial ceneroids by picking a random point using a weighted probability distribution where picking a point p is propoortional to D(x)^2, the distance between it and all other established centeroids. After K centers have been chosen, we are able to proceed with the normal K-means algorithm.