How do you pick "k" when running k-means clustering?
Posted on 2014-02-05
The more I Google, the more I get the sense that picking the number of "k" clusters to run k-means clustering on is more of an art than a precise science. Even Wikipedia throws out many options with no clear winner: Determining the number of clusters in a data set - Wikipedia, the free encyclopedia
I'd love to hear from my Big Data colleagues across the firm how they pick the number "k" clusters when running this very popular (and common) unsupervised machine learning algorithm. I've been more of supervised learning classification type of guy up until now, so I'm hoping to benefit from your hard earned best practices as I delve deeper into clustering.
Note: I'm using the "kmeans" tool in Mahout for Hadoop on a corpus of text documents transformed into sparse TF-IDF vectors. However, I suppose the technique to select a reasonable starting "k" should really be independent of the technology one uses to run the k-means clustering.