# Statistical test to find within class similarity

on
Hi!

I need your help to find the name of statistical test. I used Welch t-test to find the variance between classes. I used Welch t-test as the size of the classes was different. However, I now need to find within class similarity. Appreciate, if you could please advise on the name of the test.

Thanks!
Comment
Watch Question

Do more with

EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®

Commented:
Does correlation coefficient fit your needs?

https://en.m.wikipedia.org/wiki/Correlation_coefficient

Commented:
Sorry for the late response. Would this work for  with unequal sample size/variance? Thanks

Commented:
Sorry, the sample sizes are the same.
Principal

Commented:
I now need to find within class similarity
If this is the objective then what does "unequal sample size/variance" mean?
And, what sort of "similarity" measure is needed?

Commented:
@Fred,
Collect 200  apples from one field  and weigh each one to get the mean weight and weight variance. Do the same for another field except you collected only 150 apples.

For the two collections, the sample size, the mean weight, and the weight variance are all different.

Commented:
Thanks @phoffric and @Fred

@phoffric, would this by any chance help to reduce the variance, it seems to be yes ...  some normalisation is applied to the data so that it can be used ... I found this from: http://geog.uoregon.edu/GeogR/topics/correlation.pdf

Commented:
@phoffric

Using your example, is it possible, if I do a correlation between 150 apples from 200 apples collected from field 1 and then, with 150 apples from field 2. Do this test twice, where round 2, I take 150 apples from field 1 by ensuring the previously 50 apples that was not tested in round 1, is taken and do, it 150 apples from field 2.. Is this statistically accepted? Thanks!
Commented:
>> need to find within class similarity.
Could you clarify or expand upon this?

>> to reduce the variance

One way to reduce the variance of a random variable is to remove the outliers. For example, if your model is a gaussian distribution, then you pick a threshold factor, and discard all points in excess of standard deviation times the threshold factor, and keep repeating the calculation of a new mean and variance until their changes are below your desired thresholds.

>> Do this test twice, where round 2, I take 150 apples from field 1 by ensuring the previously 50 apples that was not tested in round 1.

This is close to a standard requirement to verify that cluster analysis is accurate. But you are expected to perform many tests by randomly permuting your input data and then make the selection as you described, to confirm that all the results are close. An inadequate separation of two clusters may be discovered by repeating the random permutation sample sets many times.

>> some normalisation is applied to the data

This normalization is part of the definition and is required to keep the correlation coefficient in the interval -1..+1. In this way the units of the data can be different causing one kind of units to dominate the results.

A +1.0 value means that as one random variable increases in value, then the other random variable is also increasing. It doesn't mean that they have the same mean and variance.

How did you  Identify the two classes ( clusters)?

Do more with