Solved

# Cluster Point

Posted on 2014-03-08
365 Views
Hi Experts,

I just need to know how do I find cluster point from scattered points. i.e. which point have very less distance from other points.

Obviously using Excel.

Thanks
0
Question by:itjockey
• 8
• 4
• 2

LVL 80

Expert Comment

itjockey,
Could you please post a sample workbook and describe the results you hope to receive from the macro?

I am guessing that you have a scatter chart (as opposed to line chart). I am also guessing that you want to distinguish points that are clustered as opposed to scattered.

One approach you might use is to establish criteria for clustered points, such as having more than 6 other points within a distance of 10% of square root(X-axis scale^2 + Y--axis scale^2). This criteria assumes that the scale (by which I mean the difference between maximum and minimum scale values on the chart) has comparable values on X-axis and Y-axis.

If X-axis and Y-axis have drastically different scale values, then you might use an alternative criteria, such as more than 6 other points within both 7% of X-axis scale and 7% of Y-axis scale.

The code will be easy once we agree on how to distinguish between clustered and scattered. That said, the number of points on your scatter chart will greatly affect how long it takes to make the calculations. This is because each of the n points on a chart must be compared against all n-1 other points. So the difficulty increases as the square of the number of points on the chart.

0

LVL 8

Author Comment

There is no x axis, for understanding sake there is only y axis which represents prices I.e. Points.
I need to find out which price point is minimum distance from all prices. If I average it out then result will not perfect as say 7 points are at 3% near by and 3 points are 10% away from these 7 points. So my average give me false result.i guess you understood my point.
there is no sample spreadsheet it is just thought. Criteria for find cluster point 10% is enough. You can take any number of random numbers. Still you want sample pls let me know.
One personal question.if you wish to share.
What is the meaning of "byundt".i guess there is some meaning behind that.wild guess ancient Egyptian civilization - Pyramid and all that stuff ;-)

Thank You
0

LVL 80

Assisted Solution

byundt earned 150 total points
itjockey,
If you are looking at prices, then I assume each category item (X-axis) has to be considered separately. If so, you might consider calculating the mean (average) and standard deviation for the prices on each item. Prices that are within two standard deviations of the mean might be considered "clustered," while the ones farther from the mean than that would be considered "scattered."

While I could generate data using random numbers, it would certainly be considered "scattered." And if I tried to create two populations of data, one that are clustered and the other that are scattered, then whatever assumptions I made to create the data would be echoed in the criteria to distinguish clustered from scattered. For this reason, I am reluctant to generate test data and hope that you might have real data to better focus this discussion.

Rather than trying to distinguish clustered from scattered data, another question you might ask is whether certain points should be considered "outliers." There are well-established techniques for doing so in statistics. Wikipedia discusses various means for distinguishing outliers in http://en.wikipedia.org/wiki/Outlier.

Even if we can agree on how to distinguish clustered from scattered data, I'm still not sure how you want to use that distinction. What should the results be from any macro?

Vis-Ă -vis your question on my screen name, the truth is more prosaic than pyramids. You can probably figure it out from this link.

0

LVL 8

Author Comment

You took my question to new dimension. one quote comes in my mind "universe is not that big that you can imagine but it is that big that you cant  imagine". Sir you have too much of knowledge & may others have too. but best part of you is "you believe in sharing".Sir your each & every post is very  help full & knowledge added. it may obvious for you but for me it is add to knowledge part. frequently I saw your solutions which you solved for different authors.

Now back to question - there is no chart requirement but only for understanding. want to find cluster point (not cluster points). so In attached screen there is 2 Outlier cluster & one is actual cluster. from that cluster points want to find cluster point.

Criteria for finding cluster point - say X is cluster point then X have minimum % distance from all points excluding outlier (10 % variation).
Apology if I wrote anything wrong as I am not good writer & English is not my native language. and I am just Account graduate so may be some time I don't understand - statistic - coding which is obvious for you.

Thank You
Book1.xlsx
0

LVL 37

Expert Comment

Since you are doing this in one dimension, when you have a cluster, the "cluster point" that has the least total distance from all the other points will be the point closest to the average of all the values.

For finding clusters, if you have a minimum distance D to consider, then you can just test each point and see how many points are within D of the point.
0

LVL 37

Accepted Solution

TommySzalapski earned 350 total points
Here is an example in Excel using formulas.
The Within D column is how many points are within that distance of the point. (You can change the max difference in cell F2 and it will update everything).

The Deviation is the distance from the point to the average of the cluster (lower is better).
The Score is the cluster size/(deviation+1). The higher the score, the better the cluster point. A high score means the point is in the middle of a large cluster.

In your sample data, 54 is clearly the best candidate "cluster point"
cluster.xlsx
0

LVL 8

Author Comment

Mr.TommySzalapski,

What if in case of i want count Difference in % not in point term. as in cell F2=10% rather than 8 points.

Thanks
0

LVL 37

Expert Comment

It depends on how you define 10% near.

Instead of B2-\$F\$2, you could use B2-(B2*\$F\$2) and put 10% in F2. (and do the same for the + case also). But then the farther a cluster is from 0 the more clustered you will think it is because the 10% is higher there. If that makes sense, then good.

If it should be consistent across the whole dataset, then maybe you should keep what I used but set F2 to be MAX(B:B)*.10 or something.

What do you mean by 10% near?
0

LVL 8

Author Comment

I just want calculate based on % least difference from points not numerical point base (Cell F2= 10% not like 8). as this is only example. if in case points values in column B is multiple of 1000 then max difference 8 in Cell F2 doesn't make any sense.

See attached I had done what you said. & produced proper result.

Thank You
cluster.xlsx
0

LVL 8

Author Comment

Sir byundt / Mr. TommySzalapski,

Anything more you want to share or anything which I have to take in account in future. as I thing I got my answer. may I accept this question?

Thanks
0

LVL 37

Expert Comment

Just remember that when you use percents in that way it will not find many clusters near 0.
You could use 3% of the max value instead of 3% of the current point. It all depends on which makes sense for the data.

0

LVL 8

Author Comment

Got it !!!

Thank You Very Much
0

LVL 8

Author Closing Comment

Actually I am in dilemma how to split points.

Thanks
0

LVL 8

Author Comment

Sir byundt / Mr. TommySzalapski,

Any Views On This Question "Math Cross Check"

if have spare time then pls help me on this.

Thank You
0

## Featured Post

Introduction This Article briefly covers methods of calculating the NPV and IRR variants in Excel as well as the limitations in calculating and interpreting IRR results. Paraphrasing Richard Shockley, author of my favourite finance reference texâ€¦
Workbook link problems after copying tabs to a new workbook? David Miller (dlmille) Intro Have you either copied sheets to a new workbook, and after having saved and opened that workbook, you find that there are links back to the original souâ€¦
The viewer will learn how to use the =DISCRINV command to create a discrete random variable, use this command to model a set of probabilities and outcomes in a Monte Carlo simulation, and learn how to find the standard deviation of a set of probabilâ€¦
This Micro Tutorial demonstrate the bugs in Microsoft Excel for Mac with Pivot Charts.