Solved

How do you pick "k" when running k-means clustering?

Posted on 2014-02-05
2
403 Views
Last Modified: 2016-03-23
The more I Google, the more I get the sense that picking the number of "k" clusters to run k-means clustering on is more of an art than a precise science.  Even Wikipedia throws out many options with no clear winner: Determining the number of clusters in a data set - Wikipedia, the free encyclopedia

 

I'd love to hear from my Big Data colleagues across the firm how they pick the number "k" clusters when running this very popular (and common) unsupervised machine learning algorithm.  I've been more of supervised learning classification type of guy up until now, so I'm hoping to benefit from your hard earned best practices as I delve deeper into clustering.

 

Note: I'm using the "kmeans" tool in Mahout for Hadoop on a corpus of text documents transformed into sparse TF-IDF vectors.  However, I suppose the  technique to select a reasonable starting "k" should really be independent of the technology one uses to run the k-means clustering.
0
Comment
Question by:AlHal2
2 Comments
 
LVL 37

Accepted Solution

by:
TommySzalapski earned 250 total points
ID: 39837095
In the absence of any other experts, I'll throw in my 1.5 cents.

Yes, how you choose k should be independent of the tool.

How you pick k is really a combination of trial and error and what k means to your application.

It also depends on what kind of performance you need. The higher k is, the longer it will take to run the algorithm. In my research in sensor networks, we usually pick much lower values for k than you might use because the devices are more constrained.

You just have to look at what you have and what you need and make a decision. Sometimes a higher k will give better results; other times it muddies things.

You really just need to play around and see what you get.
0
 

Author Closing Comment

by:AlHal2
ID: 39845959
thanks.
0

Featured Post

Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
How to split this in C++ 4 77
Which School? 2 32
listing all functions in JavaScript 19 105
Windows Service to Receive TCP Packets 4 39
If you’re thinking to yourself “That description sounds a lot like two people doing the work that one could accomplish,” you’re not alone.
In this post we will learn how to connect and configure Android Device (Smartphone etc.) with Android Studio. After that we will run a simple Hello World Program.
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now