Link to home
Start Free TrialLog in
Avatar of wsyy
wsyy

asked on

Correlation algorithsm

Hi,

I have two sets of strings, some being overlapped. Say, in time T

Set A = {"a", "b", "c", "d", "e"}
Set B = {"e", "f", "g"}
Overlap={"e"}

In time S, they are

Set A = {"ab", "bd", "cga", "da1", "eda", "ka1", "ed2"}
Set B = {"cga", "fw2", "gae", "3e2"}
Overlap={"cga"}

The number of elements in each of the sets varies by date, and given time we should see that the elements for each of the set are relatively stable.

I want to design a java program to determine at time X which set a string Y likely belongs to. Let us say string Y appeared y1 times in set A and y2 times in set B in the past.

Any idea?

Thanks!

Avatar of for_yan
for_yan
Flag of United States of America image

Sorry, I could not understand.

So you are saying that you have statisical data fopr prvious times that certain string appeared y1 times in A and y2 times in B.
Now you want to determine the probablity of this string appearing in set A?
Well that would probably be  y1/(y1 + y2).
What is the significance of A and B having common strings?

Please, explain in a little bit more detail.
Avatar of wsyy
wsyy

ASKER

How to define significance?

y1/(y1+y2) seems reasonable at first glance. However, similar words tend to appear together right?

I would think the possibility has something to do the other overlapped strings. So you are right.

But I don't know what and how to measure the correction between one specific string and the other strings that were overlapped before and may or may not appear presently.

Still, if the lists are formed imdependently then the expectation of whether certain string will apear in A or in B
will not be affected by the fact that it sometimes appears in both.

This requires more understanding of what kind of lists these are
and how they are being formed
Avatar of wsyy

ASKER

Sorry, for_yan, your response doesn't solve my issue.
I think you need to define the problem more clearly.

>However, similar words tend to appear together right?

which words are similar, and do they really tend to appear together ?
If those are arbitrary combinations of letters and digits then similar words
would not appear together unless you impose certain policy on their selection process.

If you know anything about the underlying operations - where these lists come from, etc.
this may also help.

Otherwise with this little information , it is very difficult to give you any
sensible recommendation.

 
Avatar of TommySzalapski
If there is essentially a set number of words, you can build a correlation matrix. For each pair of words, count how many times they appear together and divide that by the total number of appearances to get a correlation score. That will be a very large matrix though.

Avatar of wsyy

ASKER

I don't think the matrix focused solution is doable.

Are there any metrics that can replace the pair correlation?
There are certainly possible solutions that don't require pairwise comparisons. It all depends on how the correlation surfaces. If the strings can be grouped into buckets and the presence of some number of strings in the bucket increase the chance of other strings in that bucket appearing, then the calculations could be very fast. It's all dependent on your application.
Avatar of wsyy

ASKER

Tommy, could you please provide some examples or point me to the right resources? thanks
Do the strings group naturally?
Avatar of wsyy

ASKER

sorry for very late response!

No I haven't grouped strings naturally.
If there is no way to group them, then you'll have to do pairwise at some point. If the correlations remain consistent over time, then you will only need to do the big part once.
Avatar of wsyy

ASKER

What does pairwise mean? How can I do so?

Sorry for late response.
ASKER CERTIFIED SOLUTION
Avatar of TommySzalapski
TommySzalapski
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial