• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 435
  • Last Modified:

How do you pick "k" when running k-means clustering?

The more I Google, the more I get the sense that picking the number of "k" clusters to run k-means clustering on is more of an art than a precise science.  Even Wikipedia throws out many options with no clear winner: Determining the number of clusters in a data set - Wikipedia, the free encyclopedia

 

I'd love to hear from my Big Data colleagues across the firm how they pick the number "k" clusters when running this very popular (and common) unsupervised machine learning algorithm.  I've been more of supervised learning classification type of guy up until now, so I'm hoping to benefit from your hard earned best practices as I delve deeper into clustering.

 

Note: I'm using the "kmeans" tool in Mahout for Hadoop on a corpus of text documents transformed into sparse TF-IDF vectors.  However, I suppose the  technique to select a reasonable starting "k" should really be independent of the technology one uses to run the k-means clustering.
0
AlHal2
Asked:
AlHal2
1 Solution
 
TommySzalapskiCommented:
In the absence of any other experts, I'll throw in my 1.5 cents.

Yes, how you choose k should be independent of the tool.

How you pick k is really a combination of trial and error and what k means to your application.

It also depends on what kind of performance you need. The higher k is, the longer it will take to run the algorithm. In my research in sensor networks, we usually pick much lower values for k than you might use because the devices are more constrained.

You just have to look at what you have and what you need and make a decision. Sometimes a higher k will give better results; other times it muddies things.

You really just need to play around and see what you get.
0
 
AlHal2Author Commented:
thanks.
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now