asked on

Probability Question

Hi all,

This is a DNA-based question, but I would really like the general solution and to understand the process in which to get it.

Strings of length n are made up of four letters {GCAT}

I can assume the letters have equivalent (0.25) frequencies, but I would be interested in seeing a solution that doesn't assume that.

I would like to be able to calculate the probability of finding a specified substring within a string of length n. For instance, what is the probability of finding a given 8-letter string in a overall string of length 200?

There are three complicating factors:
1. the 8-letter string may be in either orientation -- forwards or reversed
2. I will have to deal with non-contiguous strings. For instance:
xxxNxxxNxx
where x = a specified letter and N = any letter whatsoever
3. Most problematically, I need to account for the pairing of the genetic code. That is for each string, a separate string exists with the following substitutions:
G <==> C
A <==> T
That is, if we have the string: GCGTG
The following additional string must also be considered: CGCAC

Can I analytically calculate the probability of finding some m-length string in an n-length string, taking into account all three complications? Gut instinct suggests that factors #1 and #3 each double the search size, while factor #2 can be ignored but I do not have enough probability-instinct to trust that! :)

-Tats

GwynforWeb

Hi Tats,
I am not sure I am going to have time to look at this today, there are other guys on this site who can do this. A of point of clarification is in xxxNxxxNxx can N also be any Number of letters (eg xxxNNNxNxxNNxxx). Point 2 can not be ignored and is the more difficult of the constrtaints to acount for.

I should also mention that there is a rapier fast algorithm for finding these strings in a given string, known as the longest common substring algorithm and can be found in most good algorithm books and on the web.

Cheers Gwynforweb

Itatsumaki

ASKER

Hi GfW,

In this case, an N really just refers to an ambiguous identification rather than an insertion/deletion (indel). Thus xxNxx has exactly five bases, but only the identity of the "outside" four are known.

I'll start reading up on the algorithm you mentioned. Currently I am doing matching in PERL via reg-ex's, which is slow but manageable in strings the size I'm using (~10000 letters in one string, ~20 in the other). The problem is, I am unable to assign probabilities of random occurrence to my results. I can always do so via permutation or simulation, but I would really prefer an analytical solution if one exists.

Regards,
Tats

GwynforWeb

Hi tats,
I am not clear about the xxNxx business, are the ambiguous N's in the smaller string or the larger string to be searched. If they are in the larger string then information about how they are distributed will be needed. This problem could be very complex.

If you are in essense you are looking for consequative matches then you want the Boyer-Moore string matching algorithm not the one I mentioned, but if reg-ex's are doing the job ok I would stick with that.

Cheers Gwynforweb

Itatsumaki

ASKER

Hi GfW,

The ambiguous Ns theoretically can be in either string, but I only wish to account for them in the shorter one. Thus the long string will be:
{GCAT} (10^5 letters)
The short string will be (for example):
TNCGAGTCTNT

I have written the software to go through and find all occurrences of the shorter string within the longer one, accounting for both directions and the fact that each string really has two meanings (points #1 and #3 in my initial question).

I just can't figure out how _significant_ my results are! If I find 10 occurrences of the short string in the longer one, is significantly more or less than might be expected by chance?

Does this help?
-Tats

GwynforWeb

Tats, I have not had time to give this question the time it deserves. The extra specifications are not a problem but even in its pure form, ie given a string what is the probability of occurance in the larger string, I can not decide if this is really simple or very very complicated (at the moment I think the latter)

GwynforWeb

Tats,
I gave this one some thought last night . This is very complicated and I ended up with a bunch of recurrance relationships (which I was not very confident of). I scanned the web and found nothing of any use. I suggest you run a simulation program to test your results.

The probality of the string occuring at all is a hard problem, the signifcance of a certain number of occurances is very complicated indeed. I find it difficult to believe that some thing has not been done on this some where. I would ask around Gnome/DNA people or scan some Journals if you have access to them. ( If nothing really has been done we might have a paper here, :-) )

GfW

mc_barron

Interesting discussion here. Not sure if my two cents will add value, but here goes:

Couldn't we look at the problem like this (starting out simple):

[Really simple example]
Search term: 'G'
Search space: 'ATCGAT'
Actual search terms: 'G' and 'C'
Actual search space: 'A', 'T', 'C', 'G', 'A', 'T'
2 search terms in a space of 6.
1/4 probability that a search term will match a search space. (0.25*length of search string)
12 matches to try (2 search terms * search space of 6)
12 * 1/4 = 3
There should be 3 matches on average.
[/End Really simple example]

[Slightly more complicated example]
Search term: 'GA'
Search space: 'ATCGAT'
Actual search terms: 'GA', 'AG', 'CT', 'TC'
Actual search space: 'AT', 'TC', 'CG', 'GA', 'AT'
4 search terms in a space of 5
1/8 probability that a search term will match a search space (0.25*length of search string)
20 matches to try (4 search terms * search space of 5)
20 * 1/8 = 2.5
There should be 2.5 matches on average
[/End Slightly more complicated example]

[Complicated example]
Search term: 'GAN'
Search space: 'ATCGAT'
Actual search terms: 'GAG', 'GAC', 'GAA', 'GAT' (*2 for reverse)(*2 for inverted letters)
Actual search space: 'ATC', 'TCG', 'CGA', 'GAT'
16 search terms in a space of 4
1/16 probability that a search term will match a search space (0.25*length of search string)
64 matches to try (16 search terms * search space of 4)
64 * 1/16 = 4
There should be 4 matches on average
[/End Complicated example]

I am not a mathematician (nor statistician), but in my mind this "feels" right.
None-the-less, a very interesting discussion...keep me posted on how this one turns out.

Mike
mc_barron at hotmail dot com

GwynforWeb

Mike,
A tempting idea but unfortunately a match or lack of one effects the probablity of a match in the next set of letters in the search space.

Itatsumaki

ASKER

Hi GfW,

Thanks for your consideration on this one. Let me ask this: what is the biggest problem in solving the problem? Is it the non-consecutive character of the probe string? In other words, is the general problem of "how many times would a substring be found within another string, given specified letter-frequencies" essentially unsolved?

Indeed, I haven't find anything on this in a literature search, though I felt that was because I didn't know the correct search terminology. I've begun implementing the permutation analysis. Any suggestions on how to fit my results against the observed null distribution? I could just do say 10k permutations and say "matches with the appropriate characteristics were only found in 0.1% of random permutations, therefore the significance level is p=0.999". Or is that overly simplistic?

-Tats

Itatsumaki

ASKER

Mike,

Indeed that was my first guess, too, but when I tried estimating things that way I got "expected numbers" that were massively out-of-whack with real-life data. That's when I noted GfW's point about overlapping search windows (where a lack of match in one window alters the probability for an adjacent one). Thanks for the idea, though.

-Tats

ASKER CERTIFIED SOLUTION

GwynforWeb

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Itatsumaki

ASKER

Thanks GfW, that makes sense. One last question, is there any way to calculate a probability that the experimental result is non-random based on the expected value that I simulate? In other words, can I assess the significance of the actual value in context of the expected (estimated) one?

-Tats

GwynforWeb

Tats,
To get a significance value of some sort you will need a distribution showing the probability of a given number of occurances of your string in a larger random string. This distribution will be a function of the smaller string and the length of the larger string. To produce this distribution computationally could be a time consuming task, (I would imagine this distribution is probably normal or binomial in nature)

If your longer string is length say 100,000 the you would have run a large number of tests on different random strings of length 100,000 counting the number of occurances of the test string. This will then allow you to build up the distribution. If your smaller string is a lot 'smaller' than the longer string then this distrubution will be very close to being a spike around the mean, (ie a delta function), and you will probably not have to run too many tests. The longer the smaller string is the more tests will be needed as the distribution will have a higher variance.

Once you have the distribution you can then do standard significance tests based on the number of standard deviation points you are from the mean. There are also what I call common sense tests based on intuition, all the mathematical tests are merely a formalisation of this anyway.

Computationally I think that using regex may not do the job, but give it a go you never know (it will save you time if it does) . You may have to write some more sophisticated code. You could get away with the string searching facitlities that come with most languages (eg Basic, C++, Lisp, Perl etc). The best search string algorithm is Boyer-Moore and there is a tons of canned code on the web for it, (I am not sure but most languages implement this as part of there string handling facilities anyway)

Cheers

GfW

GwynforWeb

I said "I would imagine this distribution is probably normal or binomial in nature" I think these distributions are more complex than this

jbuttery

>3. Most problematically, I need to account for the pairing of the genetic code. That is for each string, a separate string exists >with the following substitutions:
>G <==> C
>A <==> T
>That is, if we have the string: GCGTG
>The following additional string must also be considered: CGCAC

Won't you also need occurrences of:

GGCAC
GCCAC
CGCAC
CCCAC
etc.

or am I not understanding your substitution table? Do you mean that if a substitution occurs, all substitutions occur? Or can some, many or all substitutions occur?

I think if large amounts of historical data are available you will be able to get an approximate probability value to test the results of your formula. There are several approaches that will work. If the starting and ending sequences are known it will narrow down how many tests to run.

wytcom

Starting with the simplest part of this:

The probability of finding substring of length U in a string of length S is 1 - [1 - (1/4)^U]^(S - U + 1)

Each string is composed from an alphabet of 4 characters that are equally likely to occur at any position.

Example: The probability of finding a substring of length 8 in a string of length 200 is 1 - [1 - (1/4)^8]^(200 - 8 + 1) = 0.00294

Next we can take a look at the complications.

GwynforWeb

wytcom, Not that simple unfortunately, GfW

SOLUTION

wytcom

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

wytcom

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Itatsumaki

ASKER

Oh geez, I can't believe this was still unlocked. I'm so sorry guys.

First off, thanks for the answers. As best as I can determine, this problem is largely unsolved. The way people usually work on it is by simulating the underlying DNA with some model (e.g. HMM or conservation of higher-order statistics like pairs or triplets of letters). A more straight-forward probability approach like this one seems to work just as well as other methods, except in regions of highly irregular base-content. Why that might be, I'm not sure.

I'll post here again when I've done more work on this, if you guys are still interested.

Sorry again,
-Tats