dadadude

asked on

# Python: Kmeans, Within-class and Inter-class Variances

Hello,

is that a library in python that allows me to computer the inter and intra cluster variance?

I use Kmeans from scipy but it doesn't help me.

thank you

is that a library in python that allows me to computer the inter and intra cluster variance?

I use Kmeans from scipy but it doesn't help me.

thank you

ASKER

what? sorry i didn't understand what u typed at all

ASKER CERTIFIED SOLUTION

membership

Create an account to see this answer

Signing up is free. No credit card required.

ASKER

thank you for your help!! it worked perfectly!!

ASKER

good

<html>

<head>

<title>K-means clustering using Lloyd's algorithm</title>

<link rel='stylesheet' type='text/css' href='http://toolbox.lucifero.us/kmeans/main.css'>

</head>

<body>

<div id='title_bar'><h2>K-means

<div id='main_block'>

<div style='width: 200px;' class='caption'><img src='http://toolbox.lucifero.us/kmeans/v_ic.png' />

<div class='caption_text'>

<strong>intra-cluster variance</strong><br/>

C<sub>i</sub> is the i'th cluster, p is a data point, and m<sub>i</sub> is the center of the i'th cluster

</div>

</div>

<p>K-means is a technique used to partition <em>n</em> objects into <em>k</em> partitions, such that <em>k ≤ n</em>,, where the objects

can be viewed as points in multidimensional space (each dimension being some attribute of the object). As the name might imply, this

process involves repeatedly computing centroids (i.e. means) of these <em>k</em> partitions. The heart of the k-means method is the

minimization of the intra-cluster variance, shown to the right, which is the sum of the distances from every point to its cluster's center.

Given this information, a valid

question would be how to determine the initial cluster centers. For now, lets just assume that these initial cluster centers are chosen

randomly. With that in mind, the paritioning process can be described with the following:</p>

<div class='code'>Obtain a set of data points and a value k

Randomly choose k cluster centers in the range of the given data points

Set 'previous' intra-cluster variance to infinity

Repeat:

Assign each data point to the cluster whose center is closest

Compute current intra-cluster variance

if no difference between previous and current intra-cluster variance, then

output the clusters

Compute new cluster centers as centroids over each cluster

Store current intra-cluster variance as previous intra-cluster variance</div>

<p>Using Python, we can represent this process with the function <tt>getClusters()</tt>, as follows:</p>

<div class='code'>from sys import maxint

INFINITY = maxint

def getClusters(data, k):

"Return the clusters computed using Lloyd's algorithm"

centers = getKRandomPoints(data, k)

clusters = [[] for i in range(k)]

prevVariance = INF

while True:

for point in data:

bestIndex = getBestClusterForPoint(poi

clusters[bestIndex].append

curVariance = getVariance(clusters, centers)

if prevVariance - curVariance == 0:

return clusters

centers = computeCentroids(clusters)

prevVariance = curVariance

def getKRandomPoints(data, k):

"returns k points, where the value of each dimension of the points is \

bounded by the minimum and maximum of that dimension over all points"

def getBestClusterForPoint(poi

"returns the index of the center that is closest to the given point"

def getVariance(clusters, centers):

"Computes the intra-cluster variance over all clusters"

def computeCentroids(clusters)

"Computes the centroids of each cluster"</div>

<div class='caption'><img style='margin: 10px 0 10px 0;' src='http://toolbox.lucifero.us/kmeans/maxima.png'/>

<div class='caption_text'>

A function representing the optimality of clusters for each 2D cluster center. Any choice in the red shaded

region will lead to a local maximum clustering result, whereas any choice in the greed shaded region will

result in a global maximum.

</div></div>

<p>This method will continue to recalculate the cluster centers and reassign points until the intra-cluster

variance stops decreasing. Unfortunately, this does not always produce an optimal solution, which means that Lloyd's

algorithm is really a heuristic. Consider the function depicted to the right, representing the optimality of clusters

for a particular choice of cluster centers (the latter shown here in two dimensions). If the inital, random choice of cluster

centers lies somewhere within the red shaded region, the end-result will be a local maximum, not a global maximum.

This is due to the fact that the intra-cluster variance will decrease at each iteration -- when it stops

decreasing, we return the clusters at that point. In the case that the initial cluster center choice lies

in the red shaded region, the iterations will stop when the it reaches the highest point (the local maximum) in

that region, because any subsequent iteration would produce centers that result in a higher intra-cluster

variance.</p>

<p>Nonetheless, Lloyd's algorithm is still very effective at clustering multi-dimensional data, and is

widely used for this purpose. Notice that most of the clustering process (as described above) is independent of

the domain of the data being partitioned. Specifically, creating a templated class for clustering with Lloyd's

algorithm allows for use in any domain, where the only distinctions per subclass would be the k value, the

data points, and the distance measurement. For instance, one such subclass would include points determined

by the red, green, and blue values of each pixel in an image, and the distance measure would be the Euclidean

distance between each point. A illustrative example of this is shown below, using multiple k value clusterings.</p>

<div style='float:left; margin-right: 20px;'>

<h3>Original Image</h3>

<img src='http://toolbox.lucifero.us/kmeans/pumpkin.jpg' style='padding-top: 5px;' width='180px' />

</div>

<h3>Clustered Images</h3>

<img src='http://toolbox.lucifero.us/kmeans/kvalues.png' style='margin-bottom: 20px;' width='560px' />

<p>In this particular case, using color for the data being clustered could allow for some compression. Consider the space

typically used to store color information for a pixel in a bitmap image; each red, blue, and green value is in the range

[0, 255], making a total of 3 bytes = 24bits per pixel. As can be seen above, the last several clusterings all represent

the original image fairly well, and if reduced in size, would be close to indisinguishable from the original. If we use

the k=8 clustering, there are only eight colors in the entire image, so we can represent the color information for each

pixel in 3 bits (an 87.5% reduction in size from the original!); this could be especially useful for services that

store large numbers of small images, like instant messaging services or web sites that store user icons. The following

two pictures illustrate that a small 8-color image (created using Lloyd's algorithm) can effectively replace the original. Can

you tell which image is the original?</p>

<center style='margin-bottom: 20px;'>

<img src='http://toolbox.lucifero.us/kmeans/red_starfish.jpg' width='90px' style='border: 1px solid #000; margin-right: 20px;' />

<img src='http://toolbox.lucifero.us/kmeans/red_starfish_clustered.jpg' width='90px' style='border: 1px solid #000;' />

</center>

<p>Obviously color is not the only data that can be clustered on when dealing with images. In fact, image segmentation

(e.g. removing background objects from an image) can be done with this type of clustering by using additional features of pixels,

such as position, texture, and intensity, and assigning weights to certain dimensions that are considered more significant.</p>

<p>Another useful subclass would simply be on numerical values, where the distance measure is simply the absolute value of the

difference between two values. For instance:</p>

<h3>Values</h3>

<tt>{1, 3, 4, 12, 52, 22, 11, 12, 13, 567, 743, 423, 456, 8, 45, 60, 2, 11}</tt><br /><br />

<h3>k = 2</h3>

<tt>

{567, 743, 423, 456} <br/>

{1, 3, 4, 12, 52, 22, 11, 12, 13, 8, 45, 60, 2, 11}

</tt><br /><br />

<h3>k = 3</h3>

{743} <br />

{567, 423, 456} <br />

{1, 3, 4, 12, 52, 22, 11, 12, 13, 8, 45, 60, 2, 11} <br /><br />

<p>Other interesting subclasses include documents (where the objects are the distributions of words in the

document, and the distance would be the Kullback-Leibler Divergence between documents) and gene microarrays (where

the objects are gene expression intensities, and distance is Euclidean).</p>

<p>An obvious downside to k-means clustering is that you need to know the number of clusters in advance. However, the

intra-cluster variance of the final clustering can essentially be viewed as the amount of information lost by representing

all values in each cluster with the cluster's center. If an 'acceptable information loss' is defined, then the k value which

produces a clustering with intra-cluster variance just below this threshold can be viewed as the 'best' clustering of

the data. Since defining such an absolute threshold might be unreasonable, a relative threshold (information loss compared

to the previous one or two clusterings) could alternatively be used.</p>

<a href='http://toolbox.lucifero.us/kmeans/kmeans-0.1.zip'>source code and sample images</a>

<div id='footer'>

<div>Questions or comments? Contact <a href='mailto:jtimmermann@g

</div>

</div>

</body>

</html>