Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

How do you pick "k" when running k-means clustering?

Posted on 2014-02-05
2
Medium Priority
?
429 Views
Last Modified: 2016-03-23
The more I Google, the more I get the sense that picking the number of "k" clusters to run k-means clustering on is more of an art than a precise science.  Even Wikipedia throws out many options with no clear winner: Determining the number of clusters in a data set - Wikipedia, the free encyclopedia

 

I'd love to hear from my Big Data colleagues across the firm how they pick the number "k" clusters when running this very popular (and common) unsupervised machine learning algorithm.  I've been more of supervised learning classification type of guy up until now, so I'm hoping to benefit from your hard earned best practices as I delve deeper into clustering.

 

Note: I'm using the "kmeans" tool in Mahout for Hadoop on a corpus of text documents transformed into sparse TF-IDF vectors.  However, I suppose the  technique to select a reasonable starting "k" should really be independent of the technology one uses to run the k-means clustering.
0
Comment
Question by:AlHal2
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 
LVL 37

Accepted Solution

by:
TommySzalapski earned 1000 total points
ID: 39837095
In the absence of any other experts, I'll throw in my 1.5 cents.

Yes, how you choose k should be independent of the tool.

How you pick k is really a combination of trial and error and what k means to your application.

It also depends on what kind of performance you need. The higher k is, the longer it will take to run the algorithm. In my research in sensor networks, we usually pick much lower values for k than you might use because the devices are more constrained.

You just have to look at what you have and what you need and make a decision. Sometimes a higher k will give better results; other times it muddies things.

You really just need to play around and see what you get.
0
 

Author Closing Comment

by:AlHal2
ID: 39845959
thanks.
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Today, the web development industry is booming, and many people consider it to be their vocation. The question you may be asking yourself is – how do I become a web developer?
By, Vadim Tkachenko. In this article we’ll look at ClickHouse on its one year anniversary.
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…
Introduction to Processes
Suggested Courses

636 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question