K-Means implementation in C#

Hello,

I'm looking for a working implementation of the k-means algorithm in C#.

The implementation needs to work with image data that I have normalized into a 2 dimensional float array:

float [PixelCount, 3] data;
   where
PixelCount is the number of pixels in the image and 3 holds the RGB or HSV/HSB values.

Thanks,

Khan
raheelasadkhanAsked:
Who is Participating?
 
AGBrownConnect With a Mentor Commented:
Khan,

Its really not a problem; this was an interesting question, I got a lot out of it. I think that's why lots of people do this, to learn from answering other people's questions, and to teach someone else to make up for the fact that someone taught me at one point! Its nice when people appreciate it, some don't at all.

The exception you are getting is built into the code I've given. You have to decide what is acceptable to do when you come across a cluster with 0 points in it. You can't carry on assigning to this cluster as it has no mean, so you may choose to delete it, and so reduce NUM_CLUSTERS (you would then have to reassign clusterList as the remaining cluster indexes would change), or you might choose to randomly pick a pixel out of the "largest" cluster and use that as the new mean.

I was considering how you might do this in the code, and I think that if you get a cluster with 0 points, you have to exit the improvement cycle for (c) loop, pick up a point from the largest cluster, and start the improvement cycle again.

You tend to end up with 0 members assigned to a cluster if you have a significant number of clusters compared to points, or just if the initial conditions are right to allow it. e.g. one of the initial random cluster centres lies in a region of your data-space where all the closest points are closer to another cluster. That would tend to imply either a large number of clusters, or a grouped (i.e. not homegenous) data space would make this likely.

Anyway, to fix this, you might choose to do the following:
-Find the line "throw new Exception(String.Format("Cluster with 0 members found: cluster {0}", c));" in the code I gave you.
-Instead of that, get the number of points in each cluster.
-Then choose a random number between 0 and the number of points in the largest cluster.
-Assign that point to the current cluster (e.g. clusterList[randomNumber] = c;)
-return true, so that we go through the reassignment/improvement loop again.

Bear in mind that in some situations you will always get this, and won't be able to do anything about it, because this is a non-sharing clustering algorithm. In which case, you would end up going round in circles. That's why I also have the iterationNumber check in there.

An interesting second (kind of related) problem to look at would be a picture where all pixels are either white (255,255,255) or black (0,0,0). If you try and put this into 3 clusters, you will end up with two of them having the same mean, i.e. one cluster will be white, and one black, and the other black or white (not both). This is a problem you can't get around unless you recognise that two clusters have the same mean, and either merge them, or try and reset one of them to a random mean somewhere else.

Does that help?

W.r.t. getting a particular point out of a cluster, that's pretty hard as I've only built information into the points about which cluster they belong to (through clusterList) and not into the clusters about which points they contain (that would require each cluster to have a list of the indexes of the points that it owns).

There is a way around this, but it requires either changing the code, or just looping through clusterList until you have picked up the indexes of all points in the cluster you want, and translating this to a continuous index for that cluster. I deliberately made all that code as simple as possible so you could extend it, but therefore it was bound to fall short when you wanted to do something like this.

The quick fix would be:
ArrayList alClusterListIndex = new ArrayList;
for int (i = 0; i < clusterList.Length; i++)
{
    if (clusterList[i] = intRequiredCluster)
    {
        alClusterListIndex.Add(i);
    }
}
//    Now get the "jth" point in that cluster
return points[(int)alClusterListIndex[j]];

Obviously this falls down terribly w.r.t. performance with repeated calls, in which case there are much better ways. One would be to maintain an array of pointsIndex arrays; one for each cluster. These would store the index of the point. The problem with that is that removing the point from one cluster and moving it to another is much harder. A better way would be to create an ICluster index, and modify the IDataPoint interface:

interface ICluster
{
    DataPoint this[int index]; // if possible you make classes that implement ICluster based on CollectionBase, or something similar
}
IDataPoint
{
    ... as before ...
    ICluster CurrentCluster {get;}
}

You would then, each time you assign a datapoint, make sure that it changes its CurrentCluster reference, and also make sure that the relevant old cluster loses the point out of its list, and the new cluster gets it added. You may even find that it's not necessary to ask the datapoint to reference it's current cluster, and that making this cluster-centric instead of point-centric works better.

Did that answer all the questions? Let me know.

A
0
 
MuhammadAdilCommented:
Hello Dear

Describe in Detail what u want to do with image
0
Never miss a deadline with monday.com

The revolutionary project management tool is here!   Plan visually with a single glance and make sure your projects get done.

 
Bob LearnedCommented:
0
 
AGBrownCommented:
Hi Bob,

I just tried it and it works fine for me, though the final article isn't there, its at:
http://www.kdkeys.net/forums/3538/ShowPost.aspx

Hth,

Andy
0
 
Bob LearnedCommented:
Thanks, Andy, that is much better :D

Bob
0
 
raheelasadkhanAuthor Commented:
Hi,

I went through the above article but could not find the source code. Both pages have a broken link to the source code. I found another version at:
http://www.kdkeys.net/forums/3992/ShowPost.aspx
http://www.kdkeys.net/forums/6051/ShowPost.aspx

These samples, however, do not implement the algorithm so to speak like the article explains.

The article was more help since it explains the algorithm but it is not possible for me to interpret and traslate it into code. I could do that with code in another language or even psuedo code. And help would be appreciated.

Thanks,

Khan
0
 
AGBrownCommented:
Hi Khan,

I'll help you work through writing it if you like. It doesn't look too hard. Does that sound helpful?

Andy
0
 
AGBrownCommented:
I'll kick off with the logic:
1) Get array of data objects
2) Initialise clusters: Randomly pick a number of records equal to the number of desired clusters
3) Calculate cluster mean
4) Assign each record to the cluster with the closest mean based on an algorithm
5) Calculate cluster mean
6) Compare cluster mean with previous mean. If deviations are small, or we have got caught in some kind of minima and are oscillating around, then exit, otherwise go to (4).

Does that sound good? Next step is to start defining our objects to help us do this.

Andy
0
 
raheelasadkhanAuthor Commented:
Andy: Thanks. That definately helps. Before reading your post, I got frustrated and started reading the article and wrote some skeleton code. It's taking good shape and it is only after writing this code that I understand the KdKeys sample. Since you are familiar with algorithms, where do you stand on static vs instance methods with respect to performance. I use instance members only and only when contextual data storage is needed. But is there a downside to static methods when you are allocating lots of memory? This algorithm, for example, will be dealing with fairly large images resulting in hundreds of megabytes in float arrays. Also, with very large images, simply creating the array sometimes causes an OutOfMemoryException exception. The only option I see is to use malloc and work with unsafe code (pointers). Any suggestions?

public   const   int      NumberOfClusters   = 20;
public   const   int      VectorLength      = 3;

public struct Cluster
{
   public   float   []      Mean;
   public   float   [, ]   Rows;
   public   int            RowCount;
}

public static void ComputeVector (Clustering.Library.Image image)
{
   // Convert image data into the k-means algorithm-specific format:
   //   float[pixelCount][3]

   float   [, ]   data   = null;

   data   = new float [image.SizeInPixels, Clustering.Library.DominantColor.VectorLength];

   for (int i=0; i<data.Length; i++)
   {
      data[i, 0]   = image.H[i];
      data[i, 1]   = image.S[i];
      data[i, 2]   = image.V[i];
   }

   ComputeVector(data);
}

public static void ComputeVector (float [, ] data)
{
   int            index      = 0;
   System.Random   random      = null;

   // Initialize
   _Data      = data;
   _Clusters   = new Clustering.Library.DominantColor.Cluster [Clustering.Library.DominantColor.NumberOfClusters];
   for (int i=0; i<_Clusters.Length; i++)
   {
      _Clusters[i].RowCount   = 0;
      _Clusters[i].Mean      = new float [Clustering.Library.DominantColor.VectorLength];
      for (int j=0; j<_Clusters[i].Mean.Length; j++)
      {
         _Clusters[i].Mean[j]   = 0.0F;
      }
      _Clusters[i].Rows      = new float [Clustering.Library.DominantColor.NumberOfClusters, Clustering.Library.DominantColor.VectorLength];
      for (int j=0; j<_Clusters[i].Rows.Length; j++)
      {
         _Clusters[i].Rows[j, 0]   = 0.0F;
         _Clusters[i].Rows[j, 1]   = 0.0F;
         _Clusters[i].Rows[j, 2]   = 0.0F;
      }
   }

   // Create initial clusters by assigning one random row to each
   for (int i=0; i<Clustering.Library.DominantColor.NumberOfClusters; i++)
   {
      index   = random.Next(0, _Data.Length);

      for (int j=0; j<Clustering.Library.DominantColor.VectorLength; j++)
      {
         _Clusters[i].Rows[0, j]   = _Data[index, j];
      }
   }

   ComputeClusterMean(); // Already implemented
}
0
 
raheelasadkhanAuthor Commented:
Sorry, the code was not supposed to be there in my last message.
0
 
AGBrownCommented:
Very close to a working example, just give me a few minutes to test it.

Andy
0
 
AGBrownCommented:
OK, here's the first working example. Unfortunately, it sounds like its going to take a lot of work to bring it up to the capacity you want. I'll post the code, and then my thoughts on it. Its a console application, so if you create a new console application, and paste it over the top, you should be ok. I haven't used any .NET 2.0 specific features so that you could use it in either .NET 1.1 or 2.0 - what do you use?

using System;
using System.Diagnostics;
using System.Text;

namespace ConsoleApplicationCS2
{
      class Program
      {

            const int NUM_CLUSTERS = 3;
            const int NUM_POINTS = 1000;
            const double MEANS_CUTOFF_TO_END = 0.001;

            static IDataPoint[] points = new IDataPoint[NUM_POINTS];
            //      Keep a track of which point is at which cluster through another array
            static int[] clusterList = new int[NUM_POINTS];
            static IDataPoint[] oldMeans = new IDataPoint[NUM_CLUSTERS];
            
            static void Main(string[] args)
            {
                  //      I have tried to keep the following loop variables reserved for particular cases:
                  //            c: the cluster index
                  //            n: the datapoint coordinate index
                  //            i: the points array index.

                  LoadDataPoints();
                  //      Load the data points into the "old means" array.
                  PickInitialClusterPoints(NUM_CLUSTERS);

                  double dblClosestClusterDistance = 0;
                  double dblCurrentClusterDistance = 0;
                  IDistanceCommand cmd = new EuclideanDistanceCommand();
                  //      Now start the cluster-improvement loop
                  int intNumLoops = 0;
                  do
                  {
                        //      for each point, find it's closest cluster mean, and assign it to that cluster
                        //      so loop for each point
                        //      Note that at this point we could split the list into equal parts and send each section off to a different thread
                        //      and because of the way we do this, we wouldn't need any locks to slow things down as each thread would be handling seperate parts
                        //      of all the arrays.
                        for (int i = 0; i < NUM_POINTS; i++)
                        {
                              //      for each cluster, find the distance between the cluster mean and the point
                              //      TODO: once tested and happy this works, optimise this:
                              //            Initialise the clusterList array with -1. Then if the value of the clusterList[i] != -1,
                              //            and c == 0, we check the old closest cluster first, and then compare to the current one only if different.
                              for (int c = 0; c < NUM_CLUSTERS; c++)
                              {
                                    if (c == 0)
                                    {
                                          //      For the initial cluster, just get the distance, assume it is closest, and log the cluster index
                                          //      Get the initial cluster distance
                                          dblClosestClusterDistance = cmd.CalculateDistance(points[i], oldMeans[c]);
                                          //      And note the cluster index in the clusterList array
                                          clusterList[i] = 0;
                                    }
                                    else
                                    {
                                          //      For the other clusters, get the distance
                                          dblCurrentClusterDistance = cmd.CalculateDistance(points[i], oldMeans[c]);
                                          //      then compare to the currently stored closest distance
                                          if (dblCurrentClusterDistance < dblClosestClusterDistance)
                                          {
                                                //      we have a new closest distance, so keep it
                                                dblClosestClusterDistance = dblCurrentClusterDistance;
                                                //      and note the new cluster index
                                                clusterList[i] = c;
                                          }
                                    }
                              }
                        }
                        intNumLoops++;
                  } while (CalculateNewMeansAndDecideOnContinuation(intNumLoops));
                  Console.ReadLine();
            }
            /// <summary>
            ///            This checks the old means against the new ones, assigns the new ones to the old ones if necessary
            /// </summary>
            /// <returns></returns>
            private static bool CalculateNewMeansAndDecideOnContinuation(int iterationNumber)
            {
                  bool blnContinue = false;

                  //      pretty certain that you could decide on the max number of iterations based on the number of points and the number of clusters
                  //      but we'll just settle for 10000 at the moment.
                  if (iterationNumber > 10000)
                        return false;

                  //      Create a datapoint array for storing the mean values
                  IDataPoint[] newMeans = new IDataPoint[oldMeans.Length];
                  //      Create an int array to keep a count of the number in each array
                  int intNumInCluster = 0;

                  //      Initialise means array to 0;
                  for (int c = 0; c < newMeans.Length; c++)
                  {
                        newMeans[c] = new RgbDataPoint(0d, 0d, 0d);
                  }

                  for (int c = 0; c < newMeans.Length; c++)
                  {
                        //      Now calculate the means array by summing all the values into the new means array
                        intNumInCluster = 0;
                        for (int i = 0; i < points.Length; i++)
                        {
                              if (clusterList[i] == c)
                              {
                                    //      The point is in the cluster 'c', so
                                    intNumInCluster++; // increment the number in the cluster
                                    //      add the points's values to the tally in the current means array
                                    for (int n = 0; n < newMeans[c].Count; n++)
                                    {
                                          newMeans[c][n] += points[i][n];
                                    }
                              }
                        }

                        //      We have found all the points in this cluster, so we can calculate the mean by dividing our sum by the number of points:
                        //      if there are no points in this cluster, we can't continue as we won't be able to assign points to our cluster in the next round.
                        //      If we wanted to, we could still calc the other clusters and then exit, but for the moment we will just exit.
                        if (intNumInCluster == 0)
                        {
                              throw new Exception(String.Format("Cluster with 0 members found: cluster {0}", c));
                        }
                        for (int n = 0; n < newMeans[c].Count; n++)
                        {
                              newMeans[c][n] /= intNumInCluster;
                              //      Now check to see if the difference between this number and the old mean is still large (using an arbitrary cutoff)
                              //      If blnContinue is true, it won't evaluate the equation, otherwise we just check the modulus of the means difference.
                              blnContinue = blnContinue || (Math.Sqrt(Math.Pow(newMeans[c][n] - oldMeans[c][n], 2)) > MEANS_CUTOFF_TO_END);
                        }
                        //      Finally write the new mean to the old mean.
                        oldMeans[c] = newMeans[c];
                        Console.WriteLine(String.Format("Iteration {0}, Cluster {1}, Member Count {2}", iterationNumber, c, intNumInCluster));
                  }
                  return blnContinue;
            }

            private static void PickInitialClusterPoints(int numClusters)
            {
                  int[] startCluster = new int[numClusters];
                  int rand;
                  bool blnAlreadyFound;
                  Random rnd = new Random();

                  //      initialise the initial cluster array so we can tell if our random point has already been picked in the random-selection loop
                  for (int c = 0; c < numClusters; c++)
                  {
                        startCluster[c] = -1;
                  }
                  //      Get the inital points:
                  for (int c = 0; c < numClusters; c++)
                  {
                        //      Now we have a random number, check it against the existing numbers
                        blnAlreadyFound = false;
                        do
                        {
                              //      Pick a random number
                              rand = rnd.Next(1000);
                              blnAlreadyFound = false;
                              //      only check for indexes less than our current index
                              for (int j = 0; j < c; j++)
                              {
                                    if (startCluster[j] == rand)
                                    {
                                          // if we matched then mark our flag as true and break out of the for loop
                                          blnAlreadyFound = true;
                                          break;
                                    }
                              }
                        } while (blnAlreadyFound);
                        //      if we get to here then the flag did not get set to true at all, so use that random number
                        startCluster[c] = rand;
                  }
                  //      Now that we have our initial point indexes, assign those points to the oldMeans
                  for (int c = 0; c < numClusters; c++)
                  {
                        oldMeans[c] = points[startCluster[c]];
                  }
            }

            private static void LoadDataPoints()
            {
                  //      Generate a bunch of "random pixels"
                  Random rnd = new Random();
                  for (int i = 0; i < points.Length; i++)
                  {
                        points[i] = new RgbDataPoint(rnd.Next(256), rnd.Next(256), rnd.Next(256));
                  }
            }
      }
      public interface IDistanceCommand
      {
            double CalculateDistance(IDataPoint from, IDataPoint to);
      }
      public interface IDataPoint
      {
            double this[int coordinateIndex] { get;set; }
            int Count { get; }
      }
      public struct RgbDataPoint : IDataPoint
      {
            private double[] coordinateArray;
            public double this[int coordinateIndex]
            {
                  get
                  {
                        return this.coordinateArray[coordinateIndex];
                  }
                  set
                  {
                        this.coordinateArray[coordinateIndex] = value;
                  }
            }
            public int Count
            {
                  get
                  {
                        return this.coordinateArray.Length;
                  }
            }
            public RgbDataPoint(double red, double green, double blue)
            {
                  this.coordinateArray = new double[3];
                  this.coordinateArray[0] = red;
                  this.coordinateArray[1] = green;
                  this.coordinateArray[2] = blue;
            }
      }
      public class EuclideanDistanceCommand : IDistanceCommand
      {
            public double CalculateDistance(IDataPoint from, IDataPoint to)
            {
                  double result = 0;
                  //      Performance may improve if this is removed.
                  if (from.Count != to.Count)
                        throw new ArgumentException("IDataPoint lengths for 'from' and 'to' parameters must match");
                  //      Go through the coordinates and calculate the distance
                  for (int i = 0; i < from.Count; i++)
                  {
                        result = System.Math.Pow(to[i] - from[i], 2d);
                  }
                  return System.Math.Sqrt(result);
            }
      }
}
0
 
AGBrownCommented:
Firstly, does it work. With 1000 "pixels" I would expect roughly 333 in each cluster by the time it finishes, and we see that within very few iterations, so I would say yes, it works.

Importantly (and unfortunately) does it scale? The answer is a flat no. Obvious problems are the use of integers as indexes for the arrays - this implies an upper limit to the array sizes. The second is memory. You should be able to reference up to 2 GB with a single program in windows (3 GB with an, apparently, stable windows startup switch). The GC is good, I've run tests before with algorithms using a Dictionary of up to 1.5 GB in size with no problems.

BUT, and this is pretty crappy. I couldn't even get decent performance out of 50 million data points, and 2,147,483,647 points fell over on the assignment of the points array. I would therefore wax lyrical about the design of the algorithm, but the fact is it's basically useless :$

In terms of instance vs. static variables, its all about performance particular to a given problem, and exactly what kind of variable you have. If you have a single array that never changes, then either is fine. This is quite a functional problem, though it can benefit from OOP in terms of multi-threading (then you would use instances), and patterns to let you be flexible with the algorithm and type of data point. I don't think the GC will mind either, but don't use static variables for things that should be disposed of quickly.

Now, other than us starting to work on ways of sorting this out in C#, what might be worth a look at to do this is F#. It's a C#-esque ML variant, so not completely horrible. It has Visual Studio support. It is also high-performance, according to what I've read (there are loads of google links for it):
http://research.microsoft.com/fsharp/fsharp.aspx

So where do you think we go from here?

Andy
0
 
AGBrownCommented:
Actually, its not as bad as I thought. I've just done 4 clusters on a 1 mega-pixel "image" and it collapsed pretty quickly. I'm doing more tests at the moment.

How big are your images?

A
0
 
AGBrownCommented:
And 20 clusters on a 5 mega-pixel "image" just went ok as well. Obviously the time it takes goes up with the number of clusters, which is where the "nearest cluster" bit could be improved by starting off with the existing assigned cluster instead of the 0th cluster each time.

A.
0
 
AGBrownCommented:
Well I've just finished running it with 30,000,000 pixels, and 50 clusters, and it ran for long enough to show it was stable, and even gave up when it ended up with a 0-population cluster, as it should do. That took up 1.8 GB of memory. So this is, if a little basic, workable.

I'll wait to hear from you about how many pixels you have in each image. Obviously if you have more than 30 million we could implement some kind of indexing instead of storing all the pixels, but if every pixel was different you would be back to square 1 with worse performance.

A.
0
 
raheelasadkhanAuthor Commented:
Hi Andy,

To start off with, I just have to stop and appreciate the EE concept and what people such as yourself bring to it. This is absolutely invaluable and I hope I can contribute back to it in some way. That aside, the sheer length you went to, to help me out on this, is overwhelming. I'm sure this post will help others as well.

I've just started working with your code and it looks great. Will reply back in a bit after some tests.

Thanks,

Khan
0
 
raheelasadkhanAuthor Commented:
Tried the code and ran into one problem and a question.

QUESTION:
Once the clusters have been computed, I want to retrieve the following:
 - Clusters[clusterIndex, rowIndex, vectorIndex]
   WHERE
      clusterIndex ranges from 0 to (NUM_CLUSTERS - 1)
      rowIndex ranges from 0 to ((total number of points in current cluster) - 1)
      vectorIndex ranges 0 to 2 in our case (RGB)

Looking at the code, I think the following code should display something like:
Cluster: 1, Row: 1, RGB: #,#,#
Cluster: 2, Row: 1, RGB: #,#,#
Cluster: 1, Row: 2, RGB: #,#,#
...

// Used to display row index per cluster since the displayed values will be unsorted
int [] rowIndex = new int [NUM_CLUSTER];
for (int i=0; i<NUM_POINTS; i++)
   rowIndex[i] = 0;
for (int i=0; i<NUM_POINTS; i++)
{
   Console.WriteLine
   (
      "Cluster: "
      + clusterList[i].ToString()
      + ", Row: "
      + rowIndex[clusterList[i]].ToString()
      + ", RGB: "
      + points[i][0].ToString()
      + ", "
      + points[i][1].ToString()
      + ", "
      + points[i][2].ToString()
   );
   rowIndex[clusterList[i]]++;
}

PROBLEM:
I have customized the code to fit into my project by changing conventions and making everything instance instead of static. Getting the exception "Cluster with 0 members found: cluster x". This error has come up in all of the images I have tried. They have been 100x100 and 640x480 bitmap files. Since I'm not familiar with clustering concepts, I'm not sure if this is a result of an error on my part. Pelase advise.
0
 
raheelasadkhanAuthor Commented:
In case it helps, the error seems to always occur in the first iteration with a varying cluster number depending on the image. In other words, CalculateNewMeansAndDecideOnContinuation() only gets called once and always throws that exception.

I'm completely novice at this so please bear with me. Don;t know enough about the algorithm to figure out what factors would cause 0 members to be asigned to a cluster.
0
 
raheelasadkhanAuthor Commented:
Andy,

Thanks. That definately did help.

I actually wrote the algorithm from scratch using your code as the logic template and came up with the class below. I've probably made a few mistakes in there but I'd love to hear your opinion. In addition to the "clusterList[NUM_DATAPOINTS]" array, this approach uses an arrray of ArrayList objects "ArrayList[CLUSTER_COUNT]". Each element in this array holds an array list which in turn contains the indexes of data points for that cluster.

I'm not sure if the terminating condition is right and this code is painfully slow with images as small as 640x480.

Since the logic is the same as your code, the only thing that may improve performance would be to have an 2D array instead of an array of ArrayLists. What do you think?

Thanks,

Khan

namespace Clustering.Library
{
   //****************************************************************************************************/
   // Structures
   //****************************************************************************************************/

   public struct DataPoint
   {
      public      double      H;
      public      double      S;
      public      double      V;
      public      int         ClusterIndex;

      public DataPoint (double h, double s, double v)
      {
         this.ClusterIndex   = -1;

         this.H            = h;
         this.S            = s;
         this.V            = v;
      }

      public void Initialize ()
      {
         this.ClusterIndex   = -1;

         this.H            = 0D;
         this.S            = 0D;
         this.V            = 0D;
      }
   }

   public class DominantColor
   {
      //****************************************************************************************************/
      // Constants
      //****************************************************************************************************/

      public   const   int         CLUSTER_COUNT         = 20;
      public   const   double      MEANS_CUTOFF_TO_END      = 0.001D;

      //****************************************************************************************************/
      // Members
      //****************************************************************************************************/

      // The main array of data points that will hold the image data
      private      Clustering.Library.DataPoint   []      _DataPointCollection;
      // Simple int array to hold the cluster index for each data point
      //   Length will be the same as _DataPointCollection
      private      int                        []      _ClusterIndexCollection;
      // An array of ArrayList objects. The array will be of length CLUSTER_COUNT and
      //   will hold one ArrayList for each cluster. Each ArrayList will store only the
      //   data point index from _DataPointCollection. I've used this approach since it
      //   saves memory and the BinarySearch feature in ArrayList objects is lightning fast.
      // This array is not neccessary but I've included it since this aproach increases
      //   performance dramatically
      private      System.Collections.ArrayList   []      _ClusterIndexCollectionCollection;
      // Simple data point array to hold the mean of each cluster
      //   Length will be the same as CLUSTER_COUNT
      private      Clustering.Library.DataPoint   []      _MeanCollection;
      // Simple data point array to hold the old mean of each cluster
      //   Length will be the same as CLUSTER_COUNT
      private      Clustering.Library.DataPoint   []      _MeanOldCollection;

      //****************************************************************************************************/
      // Properties
      //****************************************************************************************************/

      public Clustering.Library.DataPoint [] DataPointCollection
      {
         get   {return (this._DataPointCollection);}
      }

      public int [] ClusterIndexCollection
      {
         get   {return (this._ClusterIndexCollection);}
      }

      public System.Collections.ArrayList [] ClusterIndexCollectionCollection
      {
         get   {return (this._ClusterIndexCollectionCollection);}
      }

      public Clustering.Library.DataPoint [] MeanCollection
      {
         get   {return (this._MeanCollection);}
      }

      public Clustering.Library.DataPoint [] MeanOldCollection
      {
         get   {return (this._MeanOldCollection);}
      }

      //****************************************************************************************************/
      // Functions
      //****************************************************************************************************/

      public void Initialize (int rowCount)
      {
         this._DataPointCollection      = new Clustering.Library.DataPoint[rowCount];
         for (int i=0; i<this._DataPointCollection.Length; i++)
         {
            this._DataPointCollection[i].Initialize();
         }

         this._ClusterIndexCollection   = new int [rowCount];
         for (int i=0; i<this._ClusterIndexCollection.Length; i++)
         {
            this._ClusterIndexCollection[i]   = -1;
         }

         this._ClusterIndexCollectionCollection   = new System.Collections.ArrayList [CLUSTER_COUNT];
         for (int i=0; i<this._ClusterIndexCollectionCollection.Length; i++)
         {
            this._ClusterIndexCollectionCollection[i]   = new System.Collections.ArrayList();
         }

         this._MeanCollection   = new Clustering.Library.DataPoint [CLUSTER_COUNT];
         for (int i=0; i<this._MeanCollection.Length; i++)
         {
            this._MeanCollection[i].H   = 0D;
            this._MeanCollection[i].S   = 0D;
            this._MeanCollection[i].V   = 0D;
         }

         this._MeanOldCollection   = new Clustering.Library.DataPoint [CLUSTER_COUNT];
         for (int i=0; i<this._MeanOldCollection.Length; i++)
         {
            this._MeanOldCollection[i].H   = 0D;
            this._MeanOldCollection[i].S   = 0D;
            this._MeanOldCollection[i].V   = 0D;
         }

         // This forced clean up is useful if the Compute function is called multiple times on the same object
         //   Frees up space taken up by previous calls
         System.GC.Collect();
      }

      public void Compute (Clustering.Library.Image image)
      {
         int            indexDataPoint      = 0;
         double         distanceCurrent      = 0.0D;
         double         distanceClosest      = 0.0D;
         int            clusterIndex      = -1;
         int            displacement      = 0;
         int            clusterIndexTemp   = 0;
         System.Random   random            = null;

         // Initialize
         this.Initialize(image.SizeInPixels);
         //   Data
         for (int i=0; i<image.SizeInPixels; i++)
         {
            this._DataPointCollection[i].H   = image.H[i];
            this._DataPointCollection[i].S   = image.S[i];
            this._DataPointCollection[i].V   = image.V[i];
         }

         // Create initial clusters by assigning one random row to each
         random   = new System.Random();
         for (int i=0; i<CLUSTER_COUNT; i++)
         {
            indexDataPoint      = random.Next(0, this._DataPointCollection.Length);
            while (this._DataPointCollection[indexDataPoint].ClusterIndex >= 0)
            {
               indexDataPoint      = random.Next(0, this._DataPointCollection.Length);
            }

            this.AssignDataPointToCluster(i, indexDataPoint);
            this.ComputeClusterMean(i);
         }

         // Assign all unassigned data points to the relevant cluster
         for (int i=0; i<this._DataPointCollection.Length; i++)
         {
            // Filter out only unassigned data points
            switch (this._DataPointCollection[i].ClusterIndex)
            {
               case -1:
               {
                  clusterIndex   = -1;
                  for (int j=0; j<CLUSTER_COUNT; j++)
                  {
                     switch (j)
                     {
                        case 0:
                        {
                           distanceClosest   = this.EuclideanDistance(ref this._DataPointCollection[i], ref this._MeanCollection[j]);

                           clusterIndex   = j;

                           break;
                        }
                        default:
                        {
                           distanceCurrent   = this.EuclideanDistance(ref this._DataPointCollection[i], ref this._MeanCollection[j]);

                           if (distanceCurrent < distanceClosest)
                           {
                              distanceClosest   = distanceCurrent;

                              clusterIndex   = j;

                              break;
                           }

                           break;
                        }
                     }
                  }

                  // No validation checks needed since assigned data points have already been filtered out
                  //   and clusterIndex will always hold a valid value
                  displacement++;
                  this.AssignDataPointToCluster(clusterIndex, i);
                  this.ComputeClusterMean(clusterIndex);

                  break;
               }
            }
         }

         // All data points have already been assigned to some cluster
         //   This loop re-assigns each one
         while (true)
         {
            displacement   = 0;
            for (int i=0; i<this._DataPointCollection.Length; i++)
            {
               clusterIndex   = -1;
               for (int j=0; j<CLUSTER_COUNT; j++)
               {
                  switch (j)
                  {
                     case 0:
                     {
                        distanceClosest   = this.EuclideanDistance(ref this._DataPointCollection[i], ref this._MeanCollection[j]);

                        clusterIndex   = j;

                        break;
                     }
                     default:
                     {
                        distanceCurrent   = this.EuclideanDistance(ref this._DataPointCollection[i], ref this._MeanCollection[j]);

                        if (distanceCurrent < distanceClosest)
                        {
                           distanceClosest   = distanceCurrent;

                           clusterIndex   = j;

                           break;
                        }

                        break;
                     }
                  }
               }

               // Avoid redundant reassigning to same cluster
               if (this._DataPointCollection[i].ClusterIndex != clusterIndex)
               {
                  // The data point was already assigned to another cluster
                  //   Due to reassignment, the From and To cluster Mean values
                  //   will be affected so we need to recalculate new Mean values
                  //   for both
                  displacement++;
                  clusterIndexTemp   = this._DataPointCollection[i].ClusterIndex;
                  this.AssignDataPointToCluster(clusterIndex, i);
                  this.ComputeClusterMean(clusterIndex);
                  this.ComputeClusterMean(clusterIndexTemp);
               }
            }

            // I'm probably wrong but I figured that zero displacement could act as a potential
            //   terminating condition. Not sure what to do here
            if ((clusterIndex == -1) || (displacement == 0))
            {
               break;
            }
         }
      }

      private void AssignDataPointToCluster (int indexCluster, int indexDataPoint)
      {
         int   index   = 0;

         if (this._DataPointCollection[indexDataPoint].ClusterIndex >= 0)
         {
            // Remove data point index from old cluster array list
            index   = this._ClusterIndexCollectionCollection[this._DataPointCollection[indexDataPoint].ClusterIndex].BinarySearch(indexDataPoint);
            this._ClusterIndexCollectionCollection[this._DataPointCollection[indexDataPoint].ClusterIndex].RemoveAt(index);
         }

         this._DataPointCollection[indexDataPoint].ClusterIndex   = indexCluster;
         this._ClusterIndexCollection[indexCluster]            = indexDataPoint;

         index   = this._ClusterIndexCollectionCollection[indexCluster].BinarySearch(indexDataPoint);
         if (index < 0)
         {
            this._ClusterIndexCollectionCollection[indexCluster].Insert(~index, indexDataPoint);
         }
      }

      // Passing the parameters by reference is good to avoid unnecessary copying since we are using structs
      private double EuclideanDistance (ref Clustering.Library.DataPoint dataPoint1, ref Clustering.Library.DataPoint dataPoint2)
      {
         double   result   = 0.0D;

         result
            = System.Math.Pow(dataPoint1.H - dataPoint2.H, 2.0D)
            + System.Math.Pow(dataPoint1.S - dataPoint2.S, 2.0D)
            + System.Math.Pow(dataPoint1.V - dataPoint2.V, 2.0D)
            ;

         return (System.Math.Sqrt(result));
      }

      private void ComputeClusterMean (int indexCluster)
      {
         for (int i=0; i<CLUSTER_COUNT; i++)
         {
            this._MeanOldCollection[i].H   = this._MeanCollection[i].H;
            this._MeanOldCollection[i].S   = this._MeanCollection[i].S;
            this._MeanOldCollection[i].V   = this._MeanCollection[i].V;
         }

         for (int i=0; i<CLUSTER_COUNT; i++)
         {
            this._MeanCollection[i].H   = 0D;
            this._MeanCollection[i].S   = 0D;
            this._MeanCollection[i].V   = 0D;
         }

         for (int i=0; i<CLUSTER_COUNT; i++)
         {
            for (int j=0; j<this._ClusterIndexCollectionCollection[i].Count; j++)
            {
               this._MeanCollection[i].H   += this._DataPointCollection[(int) this._ClusterIndexCollectionCollection[i][j]].H;
               this._MeanCollection[i].S   += this._DataPointCollection[(int) this._ClusterIndexCollectionCollection[i][j]].S;
               this._MeanCollection[i].V   += this._DataPointCollection[(int) this._ClusterIndexCollectionCollection[i][j]].V;
            }
         }

         for (int i=0; i<CLUSTER_COUNT; i++)
         {
            switch (this._ClusterIndexCollectionCollection[i].Count)
            {
               case 0:
               {
                  this._MeanCollection[i].H   = 0D;
                  this._MeanCollection[i].S   = 0D;
                  this._MeanCollection[i].V   = 0D;

                  break;
               }
               default:
               {
                  this._MeanCollection[i].H   /= this._ClusterIndexCollectionCollection[i].Count;
                  this._MeanCollection[i].S   /= this._ClusterIndexCollectionCollection[i].Count;
                  this._MeanCollection[i].V   /= this._ClusterIndexCollectionCollection[i].Count;

                  break;
               }
            }
         }
      }

   }
}
0
 
raheelasadkhanAuthor Commented:
I tuned the code further to increase performance and it's still useless :(.

It took:
10 seconds for a 100x100 image into 20 clusters
116 MINUTES for a 640x480 image into 20 clusters!!!

How long do you reckon an average implementation should take to cluster a 640x480 image?
0
 
AGBrownCommented:
Khan,

I'll have a look at this when I get a chance, but it might not be until Friday as this week is pretty hectic.

Performance with this will be a problem, but it is an exceptionally easy problem to multithread, which would bring the times down almost proportionally to the number of CPUs running the problem (in the scenario that all the points are reasonably homogenously distributed).

What kind of performance do you need? How long would you like a 100x100 image into 20 clusters to take? Are we talking milliseconds, seconds, or minutes?

Andy
0
 
raheelasadkhanAuthor Commented:
Well, I'm not familiar with other working implementations of K-Means so I have nothing to compare to. If you have come across any, what kind of times are considered average?

You've already done a lot to help me out on this and the code is complete so I'm accepting your answer. Performance issues can be dealt with in another question.

Thanks again.

Khan
0
 
AGBrownCommented:
I'm not sure about performance for this. At a guess, I would imagine that it will work significantly slower in C# than it would be in a lower level language (such as Assembler ;) ). If you look at a lot of data processing algorithms they are usually written in C to C++ (not managed C++), or have sections written in assembler (take GROMACS as an example) to take advantage of the performance it offers; they often interact directly with the particular hardware that they are using as well, so they might use SSEs etc. to really speed things up.

That said, your biggest initial gains would come from multi-threading, if that is something that is appropriate for your situation. If you are building something for customers to use on normal machines though, then that isn't going to work. Mind you, requiring 1.5 GB of RAM to process a few million points isn't going to be that practical either for a shrink-wrapped piece of software.

There may be a way to speed this up further, such as not having to iterate through every cluster to find the closest cluster for each point. As it stands, that part of the algorithm implies a first order complexity, something like O(c x n) (c = number of clusters, n = number of points), which is pretty nasty. Doing things such as setting your arraylists to the correct size from the beginning will help speed things up, as each increase in size of an arraylist is a pretty slow thing to do when your arraylists are already large.

I'll get some time over the weekend and I'll take a look at what you've got. At a glance it looks pretty good already though, and there will be a limit to the performance you can get from managed code, which needs to be weighted up against the advantages of using managed code in the first place.

If you have any follow up questions to this, then could you paste links in this question. That way I'll get notified of them so can take a look for you, and it will also provide a complete trail for anyone interested in this in the future.

A
0
 
AGBrownCommented:
Khan,

Are you testing with random data sets, or particularly specific ones? If the latter then if you post a few representative data sets that you are using in plain text format (e.g. csv or tab-separated format) on the web somewhere, I can put them through my local implementation.

Andy.
0
 
raheelasadkhanAuthor Commented:
Andy,

Regarding your last question, I'm dealing with random images including bmp, jpg, gif, etc. In fact, to be on the safe side, I generate my own random images for each test. There is no specific size preference. The only constant element is that my code will target about 20 clusters.

Something rather strange happened while testing my code. I had imagined that using 10 clusters instead of 20 would reduce the time signifigantly but it did not make a difference at all! I tested with images as small as 100x100 and up to 600x600 resulting in 8 seconds and 2 hours respectively.

Since you have yet to use the above code, use the one on this link instead (http://www.experts-exchange.com/Programming/Programming_Languages/C_Sharp/Q_21880086.html). It has all the dependant code to save you time. We can continue discussing here or on this post. It's totally up to you.

You are absolutely right about the loss of performance with managed code. Also, I agree that we can only increase performance so much using C#. Considering this, I am also considering a VC7 (unmanaged) implementation. Also, I wonder how much increase we can get using unsafe code from within C#. My guess is not much since although the memory manager will not interfere, we will still have IL instead of native code.

With respect to multi-threading, I had thought about this but since this application is targeting the average home machine, it's pointless to optimize for multi-processor systems.

Khan
0
 
raheelasadkhanAuthor Commented:
Andy,

Some very interesting points came up at the other thread (http://www.experts-exchange.com/Programming/Programming_Languages/C_Sharp/Q_21880086.html). Will probably increase performance drastically.

Just thought I'd let you know in case you want to have a look.

Khan
0
All Courses

From novice to tech pro — start learning today.