Link to home
Create AccountLog in
Avatar of dadadude

asked on

Python: Kmeans, Within-class and Inter-class Variances

is that a library in python that allows me to computer the inter and intra cluster variance?

I use Kmeans from scipy but it doesn't help me.

thank you
Avatar of mish33
Flag of United States of America image

This page taken from with fixed formatting:
            <title>K-means clustering using Lloyd's algorithm</title>
                <link rel='stylesheet' type='text/css' href=''>
            <div id='title_bar'><h2>K-means clustering using Lloyd's algorithm</h2></div>
            <div id='main_block'>
      <div style='width: 200px;' class='caption'><img src='' />
      <div class='caption_text'>
      <strong>intra-cluster variance</strong><br/>
      C<sub>i</sub> is the i'th cluster, p is a data point, and m<sub>i</sub> is the center of the i'th cluster
      <p>K-means is a technique used to partition <em>n</em> objects into <em>k</em> partitions, such that <em>k &le; n</em>,, where the objects
      can be viewed as points in multidimensional space (each dimension being some attribute of the object). As the name might imply, this
      process involves repeatedly computing centroids (i.e. means) of these <em>k</em> partitions. The heart of the k-means method is the
      minimization of the intra-cluster variance, shown to the right, which is the sum of the distances from every point to its cluster's center.
      Given this information, a valid
      question would be how to determine the initial cluster centers. For now, lets just assume that these initial cluster centers are chosen
      randomly. With that in mind, the paritioning process can be described with the following:</p>
      <div class='code'>Obtain a set of data points and a value k
Randomly choose k cluster centers in the range of the given data points
Set 'previous' intra-cluster variance to infinity
   Assign each data point to the cluster whose center is closest
   Compute current intra-cluster variance
   if no difference between previous and current intra-cluster variance, then
      output the clusters
   Compute new cluster centers as centroids over each cluster
   Store current intra-cluster variance as previous intra-cluster variance</div>
      <p>Using Python, we can represent this process with the function <tt>getClusters()</tt>, as follows:</p>
      <div class='code'>from sys import maxint
INFINITY = maxint
def getClusters(data, k):
   "Return the clusters computed using Lloyd's algorithm"
   centers = getKRandomPoints(data, k)
   clusters = [[] for i in range(k)]
   prevVariance = INF
   while True:
      for point in data:
         bestIndex = getBestClusterForPoint(point,centers)
      curVariance = getVariance(clusters, centers)
      if prevVariance - curVariance == 0:
         return clusters
      centers = computeCentroids(clusters)
      prevVariance = curVariance
def getKRandomPoints(data, k):
   "returns k points, where the value of each dimension of the points is \
    bounded by the minimum and maximum of that dimension over all points"
def getBestClusterForPoint(point, centers):
   "returns the index of the center that is closest to the given point"
def getVariance(clusters, centers):
   "Computes the intra-cluster variance over all clusters"
def computeCentroids(clusters):
   "Computes the centroids of each cluster"</div>
      <div class='caption'><img style='margin: 10px 0 10px 0;' src=''/>
      <div class='caption_text'>
      A function representing the optimality of clusters for each 2D cluster center. Any choice in the red shaded
      region will lead to a local maximum clustering result, whereas any choice in the greed shaded region will
      result in a global maximum.
      <p>This method will continue to recalculate the cluster centers and reassign points until the intra-cluster
      variance stops decreasing. Unfortunately, this does not always produce an optimal solution, which means that Lloyd's
      algorithm is really a heuristic. Consider the function depicted to the right, representing the optimality of clusters
      for a particular choice of cluster centers (the latter shown here in two dimensions). If the inital, random choice of cluster
      centers lies somewhere within the red shaded region, the end-result will be a local maximum, not a global maximum.
      This is due to the fact that the intra-cluster variance will decrease at each iteration -- when it stops
      decreasing, we return the clusters at that point. In the case that the initial cluster center choice lies
      in the red shaded region, the iterations will stop when the it reaches the highest point (the local maximum) in
      that region, because any subsequent iteration would produce centers that result in a higher intra-cluster
      <p>Nonetheless, Lloyd's algorithm is still very effective at clustering multi-dimensional data, and is
      widely used for this purpose. Notice that most of the clustering process (as described above) is independent of
      the domain of the data being partitioned. Specifically, creating a templated class for clustering with Lloyd's
      algorithm allows for use in any domain, where the only distinctions per subclass would be the k value, the
      data points, and the distance measurement. For instance, one such subclass would include points determined
      by the red, green, and blue values of each pixel in an image, and the distance measure would be the Euclidean
      distance between each point. A illustrative example of this is shown below, using multiple k value clusterings.</p>
      <div style='float:left; margin-right: 20px;'>
      <h3>Original Image</h3>
      <img src='' style='padding-top: 5px;' width='180px' />
      <h3>Clustered Images</h3>
      <img src='' style='margin-bottom: 20px;' width='560px' />
      <p>In this particular case, using color for the data being clustered could allow for some compression. Consider the space
      typically used to store color information for a pixel in a bitmap image; each red, blue, and green value is in the range
      [0, 255], making a total of 3 bytes = 24bits per pixel. As can be seen above, the last several clusterings all represent
      the original image fairly well, and if reduced in size, would be close to indisinguishable from the original. If we use
      the k=8 clustering, there are only eight colors in the entire image, so we can represent the color information for each
      pixel in 3 bits (an 87.5% reduction in size from the original!); this could be especially useful for services that
      store large numbers of small images, like instant messaging services or web sites that store user icons. The following
      two pictures illustrate that a small 8-color image (created using Lloyd's algorithm) can effectively replace the original. Can
      you tell which image is the original?</p>
      <center style='margin-bottom: 20px;'>
            <img src='' width='90px' style='border: 1px solid #000; margin-right: 20px;' />
            <img src='' width='90px' style='border: 1px solid #000;' />
      <p>Obviously color is not the only data that can be clustered on when dealing with images. In fact, image segmentation
      (e.g. removing background objects from an image) can be done with this type of clustering by using additional features of pixels,
      such as position, texture, and intensity, and assigning weights to certain dimensions that are considered more significant.</p>
      <p>Another useful subclass would simply be on numerical values, where the distance measure is simply the absolute value of the
      difference between two values. For instance:</p>
      <tt>{1, 3, 4, 12, 52, 22, 11, 12, 13, 567, 743, 423, 456, 8, 45, 60, 2, 11}</tt><br /><br />
      <h3>k = 2</h3>
      {567, 743, 423, 456} <br/>
      {1, 3, 4, 12, 52, 22, 11, 12, 13, 8, 45, 60, 2, 11}
      </tt><br /><br />
      <h3>k = 3</h3>
      {743} <br />
      {567, 423, 456} <br />
      {1, 3, 4, 12, 52, 22, 11, 12, 13, 8, 45, 60, 2, 11} <br /><br />
      <p>Other interesting subclasses include documents (where the objects are the distributions of words in the
      document, and the distance would be the Kullback-Leibler Divergence between documents) and gene microarrays (where
      the objects are gene expression intensities, and distance is Euclidean).</p>
      <p>An obvious downside to k-means clustering is that you need to know the number of clusters in advance. However, the
      intra-cluster variance of the final clustering can essentially be viewed as the amount of information lost by representing
      all values in each cluster with the cluster's center. If an 'acceptable information loss' is defined, then the k value which
      produces a clustering with intra-cluster variance just below this threshold can be viewed as the 'best' clustering of
      the data. Since defining such an absolute threshold might be unreasonable, a relative threshold (information loss compared
      to the previous one or two clusterings) could alternatively be used.</p>
      <a href=''>source code and sample images</a>
      <div id='footer'>
         <div>Questions or comments? Contact <a href=''>jtimmermann [at] gmail (dot) com</a></div>
Avatar of dadadude


what? sorry i didn't understand what u typed at all
Avatar of mish33
Flag of United States of America image

Link to home
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
thank you for your help!! it worked perfectly!!