Data Intelligence, Business Analytics
Here I propose a solution that can be automated and does not require visual inspection by a human being. The solution can thus be applied to billions of clustering problems, automated and processed in batch mode.
Note that the concept of cluster is a fuzzy one. How do you define a cluster? How many clusters in the chart below?

Nevertheless, in many applications, there's a clear optimum number of clusters. The methodology described here will solve all easy and less easy cases, and will provide a "don't know" answer to cases that are ambiguous.
Methodology:
This is based on the fact that the piece-wise linear plot of number of cluster versus percentage of variance explained by clusters is a convex function with an elbow point, see chart below. The elbow point determines the optimum number of clusters. If the piece-wise linear function is approximated by a smooth curve, the optimum would be the point vanishing the 4-th derivative of the approximating smooth curve. This methodology is simply an application of this "elbow detection" technique in a discrete framework (the number of clusters being a discrete number).
Example:
1 2 3 4 5 6 7 8 9 ==> number of clusters
40 65 80 85 88 90 91 91 ==> percentage of variance explained by clusters
25 15 5 3 2 1 0 ==> 1st difference
-10 -10 -2 -1 -1 -1 ==> 2nd difference
0 8 1 0 0 ==> 3rd difference
The optimum number of cluster in this example is 4, corresponding to maximum = 8 in the 3rd differences.
Note:
If you have already a strong minimum in the 2nd difference (not the case here), you don't need to go to 3rd difference: stop at level 2.
Comment
Comment by Anand Singh on October 25, 2012 at 8:34am Another idea, the plot of 1/RSQ_ratio and RSQ itself can be used to select the optimal number of clusters because RSQ_ratio = Between cluster variance/within cluster variance( RSQ/1-RSQ) and will always increase with the increase in the number of clusters . Inverse of this would be a decreasing curve and RSQ will be an increasing curve. The intersection point of these two curves would give us the optimal number of clusters.
Hi there,
If you are interested about validity index for clustering, I have written 3 posts on my blog about it:
http://www.dataminingblog.com/cluster-validity-introduction-to-clus...
http://www.dataminingblog.com/cluster-validity-clustering-algorithms/
http://www.dataminingblog.com/cluster-validity-existing-indices/
Hope this helps!
Comment by Vincent Granville on February 29, 2012 at 9:28pm See also the gap statistics to determine the number of clusters: http://www.stanford.edu/~hastie/Papers/gap.pdf
Comment by Vincent Granville on October 4, 2011 at 9:07pm
Comment by Cristian Mesiano on October 4, 2011 at 10:54am Hi Vincent,
I believe that your approach has a missing preamble:
It optimizes the number of the cluster when the clustering method is maximizing the variance among the clusters.
In different scenarios it doesn't work.
Consider the following scenario where we have to clusters:
If you are using for example K-means as clustering algorithm, your method will fail for every number of cluster you try to use! Because the problem is not related to the number of the clusters, but to the function that the clustering is trying to maximize:
using 2 clusters:
using 4 clusters:
and using 8 clusters:
As you can see doesn't exist the right number of clusters, for this problem using the "naive" kmeans.
BTW I've seen for kmeans and density based clustering algo, methods based on EM (expectation and maximizazion) and Bayesian information criterion (BIC) that are a little bit more robust than this method.
...But as I like say: be practical: if your method solve your problem, then your method is good enough :).
just a question:
you plotted the points in red/blu: does it mean that are you looking to find a cluster containing only red points and another points containing only blu points?
Could you share the table of the points...just to play a little bit with them :)
Cheers
cristian
The solution to finding the number of clusters is as follows:
This is the solution to the min-angle rule proposed by Amy. SAS should implement it.
Comment by Vincent Granville on October 3, 2011 at 10:36am @Sandeep: I think 3rd or 4th derivative is usually not necessary, except in a few rare cases where elbow is barely apparent (and thus clusters not well defined).
I believe that Amy's solution, based on a discrete version of curvature, is even better. And curvature only involves 1st derivative, and the angle (or its sinus) discussed by Amy is also very easy to compute. What do you think?
Comment by Sandeep Sunkara on October 3, 2011 at 9:32am There is Cubic Clustering Criterion (CCC) available in PROC CLUSTER which helps in deciding the number of clusters.
@Vincent: I am little confused in going up to 4th derivative, since the plot is, when you smooth it, a second degree curve, which will have the second differences constant and the third differences vanish. As per my understanding "We can see the second differences being constant after certain point" and that point would be the no.of clusters".
Please correct me if wrong :)
Comment by Amy on October 3, 2011 at 9:08am
Comment by Rick Wicklin on October 3, 2011 at 7:40am This is reminiscent of a "scree plot," which can be used to estimate the number of principal components. I like graphical methods like this.
There has been a lot of work done on methods for estimating the number of clusters. The "percentage of variance explained by clusters" is the R-square value reported by SAS in the PROC Cluster output. PROC cluster also provides three methods for determining the number of clusters, one of which is the "Pseudo t-squared criterion," which is similar to yours in that it looks at "elbows" in the graph of a quantity. Like you, I've found that this criterion works well on a number of examples. If you want to compare your method, there are several examples in the PROC CLUSTER documentation that you can experiment with.
© 2013 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge