Cluster Point

Hi Experts,

I just need to know how do I find cluster point from scattered points. i.e. which point have very less distance from other points.

Obviously using Excel.

Thanks
LVL 8
Who is Participating?

Commented:
Here is an example in Excel using formulas.
The Within D column is how many points are within that distance of the point. (You can change the max difference in cell F2 and it will update everything).

The Deviation is the distance from the point to the average of the cluster (lower is better).
The Score is the cluster size/(deviation+1). The higher the score, the better the cluster point. A high score means the point is in the middle of a large cluster.

In your sample data, 54 is clearly the best candidate "cluster point"
cluster.xlsx
0

Commented:
itjockey,
Could you please post a sample workbook and describe the results you hope to receive from the macro?

I am guessing that you have a scatter chart (as opposed to line chart). I am also guessing that you want to distinguish points that are clustered as opposed to scattered.

One approach you might use is to establish criteria for clustered points, such as having more than 6 other points within a distance of 10% of square root(X-axis scale^2 + Y--axis scale^2). This criteria assumes that the scale (by which I mean the difference between maximum and minimum scale values on the chart) has comparable values on X-axis and Y-axis.

If X-axis and Y-axis have drastically different scale values, then you might use an alternative criteria, such as more than 6 other points within both 7% of X-axis scale and 7% of Y-axis scale.

The code will be easy once we agree on how to distinguish between clustered and scattered. That said, the number of points on your scatter chart will greatly affect how long it takes to make the calculations. This is because each of the n points on a chart must be compared against all n-1 other points. So the difficulty increases as the square of the number of points on the chart.

0

There is no x axis, for understanding sake there is only y axis which represents prices I.e. Points.
I need to find out which price point is minimum distance from all prices. If I average it out then result will not perfect as say 7 points are at 3% near by and 3 points are 10% away from these 7 points. So my average give me false result.i guess you understood my point.
there is no sample spreadsheet it is just thought. Criteria for find cluster point 10% is enough. You can take any number of random numbers. Still you want sample pls let me know.
One personal question.if you wish to share.
What is the meaning of "byundt".i guess there is some meaning behind that.wild guess ancient Egyptian civilization - Pyramid and all that stuff ;-)

Thank You
0

Commented:
itjockey,
If you are looking at prices, then I assume each category item (X-axis) has to be considered separately. If so, you might consider calculating the mean (average) and standard deviation for the prices on each item. Prices that are within two standard deviations of the mean might be considered "clustered," while the ones farther from the mean than that would be considered "scattered."

While I could generate data using random numbers, it would certainly be considered "scattered." And if I tried to create two populations of data, one that are clustered and the other that are scattered, then whatever assumptions I made to create the data would be echoed in the criteria to distinguish clustered from scattered. For this reason, I am reluctant to generate test data and hope that you might have real data to better focus this discussion.

Rather than trying to distinguish clustered from scattered data, another question you might ask is whether certain points should be considered "outliers." There are well-established techniques for doing so in statistics. Wikipedia discusses various means for distinguishing outliers in http://en.wikipedia.org/wiki/Outlier.

Even if we can agree on how to distinguish clustered from scattered data, I'm still not sure how you want to use that distinction. What should the results be from any macro?

Vis-Ã -vis your question on my screen name, the truth is more prosaic than pyramids. You can probably figure it out from this link.

0

You took my question to new dimension. one quote comes in my mind "universe is not that big that you can imagine but it is that big that you cant  imagine". Sir you have too much of knowledge & may others have too. but best part of you is "you believe in sharing".Sir your each & every post is very  help full & knowledge added. it may obvious for you but for me it is add to knowledge part. frequently I saw your solutions which you solved for different authors.

Now back to question - there is no chart requirement but only for understanding. want to find cluster point (not cluster points). so In attached screen there is 2 Outlier cluster & one is actual cluster. from that cluster points want to find cluster point.

Criteria for finding cluster point - say X is cluster point then X have minimum % distance from all points excluding outlier (10 % variation).
Apology if I wrote anything wrong as I am not good writer & English is not my native language. and I am just Account graduate so may be some time I don't understand - statistic - coding which is obvious for you.

Thank You
Book1.xlsx
0

Commented:
Since you are doing this in one dimension, when you have a cluster, the "cluster point" that has the least total distance from all the other points will be the point closest to the average of all the values.

For finding clusters, if you have a minimum distance D to consider, then you can just test each point and see how many points are within D of the point.
0

Mr.TommySzalapski,

What if in case of i want count Difference in % not in point term. as in cell F2=10% rather than 8 points.

Thanks
0

Commented:
It depends on how you define 10% near.

Instead of B2-\$F\$2, you could use B2-(B2*\$F\$2) and put 10% in F2. (and do the same for the + case also). But then the farther a cluster is from 0 the more clustered you will think it is because the 10% is higher there. If that makes sense, then good.

If it should be consistent across the whole dataset, then maybe you should keep what I used but set F2 to be MAX(B:B)*.10 or something.

What do you mean by 10% near?
0

I just want calculate based on % least difference from points not numerical point base (Cell F2= 10% not like 8). as this is only example. if in case points values in column B is multiple of 1000 then max difference 8 in Cell F2 doesn't make any sense.

See attached I had done what you said. & produced proper result.

Thank You
cluster.xlsx
0

Sir byundt / Mr. TommySzalapski,

Anything more you want to share or anything which I have to take in account in future. as I thing I got my answer. may I accept this question?

Thanks
0

Commented:
Just remember that when you use percents in that way it will not find many clusters near 0.
You could use 3% of the max value instead of 3% of the current point. It all depends on which makes sense for the data.

0

Got it !!!

Thank You Very Much
0

Actually I am in dilemma how to split points.

Thanks
0

Sir byundt / Mr. TommySzalapski,

Any Views On This Question "Math Cross Check"

if have spare time then pls help me on this.

Thank You
0