Discover the fascinating world of clustering, and learn how k-means clustering is able to asses the dominant colors of an image.
This blog post is related to the project movie barcode. Where the visually percieved dominant color out of every frame of a movie were merged onto a canvas encapsulating the visual color essence of a film.
The following introductory part is based on the Stanford CS229 lecture notes by Andrew Ng and Tengyu Ma.
The k-means clustering problem is an unsupervised learning problem. We are given a training set \(\{x^{(1)},...,x^{(n)}\}\) that we want to group into cohesive clusters, where \(x{(i)} \in \mathbb{R}^{d}\), but no labels \(y^{(i)}\) are provided.
The k-means clustering algorithm is defined as follows:
Randomly initialize Centroid Clusters \(\mu_1,\mu_1,...,\mu_k \in \mathbb{R}^{d}\)
Repeat until convergence:
i) For every i, set
\[c^{(i)} := \underset{\substack{\mu}}{\text{arg min}} || x^{(i)} - \mu_j ||^2.\]ii) For every j, set
\[\mu_j := \frac{\sum_{k=1}^n 1\{c^{(i)}=j\}x^{(i)}}{\sum_{k=1}^n 1\{c^{(i)}=j\}}.\]In this algorithm \(k\) is a parameter and denotes the number of clusters we wan’t to find; the cluster centroids \(\mu_j\) denote our current guesses for the position of the centers of the clusters. To set the initial cluster centroids (as described in step 1 of the algorithm above), one approach is to randomly select \(k\) training examples and then assign the cluster centroids to have the same values as these \(k\) selected examples. It’s worth noting that there are alternative methods for initializing cluster centroids as well. Check this paper out if your interested in cluster center initialization algorithms or this one for big data optimized approaches for cluster initialization.
The inner loop of the algorithm iteratively performs two key steps:
i) Assigning each training example \(x^{(i)}\) to its closest cluster centroid \(\mu_j\).
ii) Updating each cluster centroid \(\mu_j\) to be the mean of the points assigned to it.
And do this two key steps until convergance. But will the k-means algorithm always converge?
Using the vectorized form, which sums up the squared distances for all data points in a more compact way, the distortion function can be defined as:
\[J(c,\mu) = \sum_{i=1}^{N} ||x^{(i)} - \mu_{c^{(i)}}||^2\]\(J\) is a measure of how well the data points are clustered around their respective cluster centroids. For each training example \(x^{(i)}\), \(J\) measures the sum of squared distances between \(x^{(i)}\) and the cluster centroid \(\mu_c^{(i)}\) to which it has been assigned. This means that \(J\) quantifies how spread out or how far each data point is from its assigned centroid.
It can be proven that k-means is exactly “coordinate descent” on \(J\). Coordinate descent is an optimization technique where you iteratively minimize a function with respect to one variable while holding the others fixed. In the context of k-means: The inner-loop of k-means refers to the iterative steps of the algorithm. The algorithm repeats these steps until it converges. In each iteration of the inner-loop, the algorithm alternates between two tasks:
Minimizing \(J\) with respect to the cluster assignments (\(c\)) while keeping the cluster centroids (\(\mu\)) fixed.
Minimizing \(J\) with respect to the cluster centroids (\(\mu\)) while keeping the cluster assignments (\(c\)) fixed.
Because k-means is “coordinate descent” \(J\) must monotonically decrease during the iterative process. The value of \(J\) must eventually converge to a stable value. This is because, in each iteration, the algorithm strives to improve the clustering, which should lead to a lower value of \(J\). The convergence of \(J\) implies that both the cluster assignments (\(c\)) and the cluster centroids (\(\mu\)) will also converge. As \(J\) decreases and converges, the algorithm refines the cluster assignments and centroids, moving them towards a better clustering solution.
But attention \(J\) is a non-convex function which means, that it is susceptible for local optima. More times then not \(k\)-means will work fine and this should be no reason to worry. But if you are afraid that it does not come up with very good clusterings there is always the possibility to run \(k\)-means many times (with varying random initial values for the cluster centroids \(\mu_j\)). The best clustering can be determined by the lowest \(J(c,\mu)\) out of every try.
Our subject study for this blog post will consist of this image:
Digital images consist of pixels arranged in a grid, with each pixel containing data about color and brightness. In color images, pixels use red, green, and blue (RGB) channels, each with values from 0 to 255, determining the pixel’s color. Grayscale images rely on a single brightness value per pixel, ranging from 0 (black) to 255 (white), producing various shades of gray.
Now let’s visualize this image in 3D. First we read the image and resize it. Then we split the image into red, green, and blue channels. After that we extract pixel data from each channel and plot the result:
Remember in k-means clustering \(k\) is a parameter that denotes the number of well defined clusters we wan’t to find. There are various methods to help you determine the optimal value of \(k\), such as the “elbow method,” silhouette analysis, or even domain-specific knowledge. Each method has its strengths and limitations, and the choice of \(k\) often involves a balance between the desired granularity of clustering and the interpretability of the results.
The Gap statistic is one of the standard methods to determine the appropriate number of clusters in a data set. This method can be used for the output of the k-means clustering algorithm, where it compares the change in within-cluster dispersion \(W_k\) standardized as \(\log(W_k)\) with that expected under an appropriate reference null distribution.
Although this method outperforms others it is sometimes not able to suggest the correct number of clusters given the data that was for example derived from exponential distributions. Mohajer et. al. suggest using \(W_k\) instead of \(\log(W_k)\) and to compare the expectations of \(W_k\) under a null reference distribution.
The method goes back to speculations by Robert L. Thorndike in 1953.
The elbow method is a popular technique used to determine the optimal number of clusters \(k\). It helps you find a suitable value for k by analyzing the inertia explained as a function of the number of clusters.
In the process of determining the ideal number of clusters for your data, you start by running the algorithm with different cluster counts. Following this, you calculate the sum of squared distances, known as inertia, for each cluster count. These inertia values are then plotted against the corresponding cluster counts. Your goal is to pinpoint the ‘elbow point’ on the plot, which signifies the juncture where the rate of inertia decrease begins to slow down. This ‘elbow point’ serves as a crucial indicator, representing the most suitable number of clusters.
It’s important to note that while the elbow method is a useful heuristic, it might not always yield a clear elbow point, especially for complex datasets or when clusters have irregular shapes and densities like we have in this example.
To calculate the elbow point, a package called kneedle was used. In this example, I have marked the optimal number of clusters using the KneeLocator method with parameters curve=’convex’ and direction=’decreasing’ from this package.
Logically, since two of the clusters, along with three others, are in close proximity to each other, the Elbow Method may suggest merging them. This is because placing a centroid between these clusters would result in short distances from data points to it.
Consequently, a need arises for a more reliable approach to determine the optimal number of clusters for our clustering task. This is where the Silhouette score comes into play.
The Silhouette Score was proposed in 1987 by the Belgian statistician Peter Rousseeuw.
The Silhouette score can generally help to determine the best number of clusters, although it’s not its primary purpose. The Silhouette score is primarily used to assess the quality of existing clusters, but it can be used in a heuristic manner to guide your choice of the number of clusters. It provides a measure by gauging how similar an object is to its own cluster (cohesion) compared to other clusters (separation).
The silhouette coefficient ranges from -1 to 1, where: A coefficient close to +1 indicates that the object is well-matched to its own cluster and poorly-matched to neighboring clusters, implying a good separation. A coefficient around 0 indicates that the object is on or very close to the decision boundary between two neighboring clusters. A coefficient close to -1 indicates that the object is probably assigned to the wrong cluster.
Starting with the lower score, in our example, it’s evident that when setting $k$ to 3, we achieve a silhouette coefficient of 0.58. However, when we increase $k$ to 6, the silhouette coefficient substantially improves to 0.76.
Combining the two methods for determining the optimal k, we can plot the silhouette score and inertia against the number of clusters. Here, we observe that 6 is a suitable choice for k according to both the elbow method and the silhouette score. We can confidently rule out 3 as a viable option for k.
Now that we have the appropriate k for our k-means clustering algorithm, we can proceed to cluster the data and plot it, assigning each point a color corresponding to its cluster. The resulting plot clearly displays the 6 clusters, each distinguished by its respective cluster centroid color.
After the successful clustering of our image data into 6 distinct groups, each representing one of the 6 most dominant colors, we can also pinpoint the primary color, the one that captures our visual perception.
This selection process is achieved by calculating the frequency of data points assigned to each cluster using the np.bincount
function. The cluster index with the highest count is designated as the most dominant cluster, and the color at the centroid of this cluster represents the primary and most dominant color in the image.