Solved

How do you pick "k" when running k-means clustering?

Posted on 2014-02-05
2
416 Views
Last Modified: 2016-03-23
The more I Google, the more I get the sense that picking the number of "k" clusters to run k-means clustering on is more of an art than a precise science.  Even Wikipedia throws out many options with no clear winner: Determining the number of clusters in a data set - Wikipedia, the free encyclopedia

 

I'd love to hear from my Big Data colleagues across the firm how they pick the number "k" clusters when running this very popular (and common) unsupervised machine learning algorithm.  I've been more of supervised learning classification type of guy up until now, so I'm hoping to benefit from your hard earned best practices as I delve deeper into clustering.

 

Note: I'm using the "kmeans" tool in Mahout for Hadoop on a corpus of text documents transformed into sparse TF-IDF vectors.  However, I suppose the  technique to select a reasonable starting "k" should really be independent of the technology one uses to run the k-means clustering.
0
Comment
Question by:AlHal2
2 Comments
 
LVL 37

Accepted Solution

by:
TommySzalapski earned 250 total points
ID: 39837095
In the absence of any other experts, I'll throw in my 1.5 cents.

Yes, how you choose k should be independent of the tool.

How you pick k is really a combination of trial and error and what k means to your application.

It also depends on what kind of performance you need. The higher k is, the longer it will take to run the algorithm. In my research in sensor networks, we usually pick much lower values for k than you might use because the devices are more constrained.

You just have to look at what you have and what you need and make a decision. Sometimes a higher k will give better results; other times it muddies things.

You really just need to play around and see what you get.
0
 

Author Closing Comment

by:AlHal2
ID: 39845959
thanks.
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Although it can be difficult to imagine, someday your child will have a career of his or her own. He or she will likely start a family, buy a home and start having their own children. So, while being a kid is still extremely important, it’s also …
This article will inform Clients about common and important expectations from the freelancers (Experts) who are looking at your Gig.
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…
I've attached the XLSM Excel spreadsheet I used in the video and also text files containing the macros used below. https://filedb.experts-exchange.com/incoming/2017/03_w12/1151775/Permutations.txt https://filedb.experts-exchange.com/incoming/201…

829 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question