asked on

Correlation algorithsm

Hi,

I have two sets of strings, some being overlapped. Say, in time T

Set A = {"a", "b", "c", "d", "e"}
Set B = {"e", "f", "g"}
Overlap={"e"}

In time S, they are

Set A = {"ab", "bd", "cga", "da1", "eda", "ka1", "ed2"}
Set B = {"cga", "fw2", "gae", "3e2"}
Overlap={"cga"}

The number of elements in each of the sets varies by date, and given time we should see that the elements for each of the set are relatively stable.

I want to design a java program to determine at time X which set a string Y likely belongs to. Let us say string Y appeared y1 times in set A and y2 times in set B in the past.

Any idea?

Thanks!

for_yan

Sorry, I could not understand.

So you are saying that you have statisical data fopr prvious times that certain string appeared y1 times in A and y2 times in B.
Now you want to determine the probablity of this string appearing in set A?
Well that would probably be y1/(y1 + y2).
What is the significance of A and B having common strings?

Please, explain in a little bit more detail.

wsyy

ASKER

How to define significance?

y1/(y1+y2) seems reasonable at first glance. However, similar words tend to appear together right?

I would think the possibility has something to do the other overlapped strings. So you are right.

But I don't know what and how to measure the correction between one specific string and the other strings that were overlapped before and may or may not appear presently.

for_yan

Still, if the lists are formed imdependently then the expectation of whether certain string will apear in A or in B
will not be affected by the fact that it sometimes appears in both.

This requires more understanding of what kind of lists these are
and how they are being formed

wsyy

ASKER

Sorry, for_yan, your response doesn't solve my issue.

for_yan

I think you need to define the problem more clearly.

>However, similar words tend to appear together right?

which words are similar, and do they really tend to appear together ?
If those are arbitrary combinations of letters and digits then similar words
would not appear together unless you impose certain policy on their selection process.

If you know anything about the underlying operations - where these lists come from, etc.
this may also help.

Otherwise with this little information , it is very difficult to give you any
sensible recommendation.

TommySzalapski

If there is essentially a set number of words, you can build a correlation matrix. For each pair of words, count how many times they appear together and divide that by the total number of appearances to get a correlation score. That will be a very large matrix though.

wsyy

ASKER

I don't think the matrix focused solution is doable.

Are there any metrics that can replace the pair correlation?

TommySzalapski

There are certainly possible solutions that don't require pairwise comparisons. It all depends on how the correlation surfaces. If the strings can be grouped into buckets and the presence of some number of strings in the bucket increase the chance of other strings in that bucket appearing, then the calculations could be very fast. It's all dependent on your application.

wsyy

ASKER

Tommy, could you please provide some examples or point me to the right resources? thanks

TommySzalapski

Do the strings group naturally?

wsyy

ASKER

sorry for very late response!

No I haven't grouped strings naturally.

TommySzalapski

If there is no way to group them, then you'll have to do pairwise at some point. If the correlations remain consistent over time, then you will only need to do the big part once.

wsyy

ASKER

What does pairwise mean? How can I do so?

Sorry for late response.

ASKER CERTIFIED SOLUTION

TommySzalapski

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial