Solved

compare two lists....

Posted on 2009-07-05
16
198 Views
Last Modified: 2012-05-07
Hi guys I had two files with below format.....


FILE-1:

>contig00001  length=11003   numreads=3312
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC

>contig00002  length=110423   numreads=3323
GCAGCGCCAGCAGGAGCGTGGCAAACGAACGCGTCATCGAAGGGTTTtCACCGCCGTACC
GCTTACGCTCACCATCAGCATGCTGGCATCTTtCCCGACCGTTTCGTAGTCGATATCAAT

>contig00003  length=11023   numreads=33233
GCACAGACTTATCCACAATGATACGAAAAAGTGAAATTGTGCGAGCGTTGCGCAAACGTT
TTCGTTAAAATGCTCGCGCTTAACAGGCATGCCCCGCCAGGTGTGTTAGATGAGTTTTTC

FILE-2

>contig00001  length=15918   numreads=6266
GCGGGCGCGGCTACTGCCCGCTGGGGCGCGAGACCGGCATCGCGCGCATTGCGTGGCGCG
ACGGCTGGCCGTTTGTCGAAGGCGGCAAACACGCGCAGCTGACTGTACCTGGCCCGCAGG


>contig00002  length=106210   numreads=27839
ACCCTCACCCCGGCCCTCTCCCTGAGGGAGAGGGGgTAAACATCAGCAGATGTTAAGCGG
GAGTGGGgTCGTCACCCTTCATCTCGCGCTTCACTTCCGTATATTCCTCACCTTTTTCAT

>contig00003  length=106213   numreads=26839
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC

>contig00004  length=1023433   numreads=23465
GAGTGGGgTCGTCACCCTTCATCTCGCGCTTCACTTCCGTATATTCCTCACCTTTTTCAT
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC
GACTATACTGGCGAAAAATACTCCCCGGCAGGCC

The program must compare the sequence in FILE-1 with sequences in FILE-2

i.e., take first sequence in FILE-1...i.e., take only sequence not its name and compare with sequences in FILE-2  if it matches to any of the sequences in FILE-2 then print both sequences names i.e., the sequences which is compared and sequences matched...

one sequence can be matched with many sequences....
one sequence may be part of another sequences in FILE-2....

return all the matched sequences ..

here in the sample sequences....

first sequence (contig00001) of FILE-1 is exactly matched with contig00003 and it is also present in contig0004 as a subpart... so the output is

contig00001--------------------- contig00003, contig00004

the contig00002, contig00003 of FILE-1 is not matched with any sequences of FILE-2 so
return

contig00002-------------- not matched
contig00003--------------notmatched.



guys I asked similar question, but there I did not put in a clear way so not to confuse them I deleted that question .... and posting new question........
0
Comment
Question by:shragi
16 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 24782531
open F1,"<FILE-1" or  die "FILE-1 $!";
open F2,"<FILE-2" or  die "FILE-2 $!";
$/="";
while( <F1> ){
    my($name,$re)=/>(\w+).*?^(.*)^/ms;
    $re = qr/\Q$re/;
    seek F2,0,0;
    print "$name-------------- ";
    $,="";
    while( <F2> ){
        next unless /$re/;
        print "",/>(\w+)/;
        $,=", ";
    }
    print $,?"\n":"not matched\n";
}
0
 
LVL 6

Expert Comment

by:zlobcho
ID: 24783710
ozo, you are really a genius. good work.
0
 
LVL 5

Expert Comment

by:vikaskhoria
ID: 24783900
There is just one small problem in the solution.
You are reading F2 in a while loop inside the loop of F1, so once the file2 reaches end of file, it will stop reading.
For example say, the first line matches with only the last line (or towards end), then for the first line itself the entire second file will be read.
<F2> will reach EOF.

One way around this is open the file in the loop itself, just before while(<F2>). So that for every line from 1st file, all the lines from second file are read.
0
 
LVL 39

Expert Comment

by:Adam314
ID: 24785532
vikaskhoria - Notice this line inside the first while loop (looping over F1):
    seek F2,0,0;
This resets the file pointer to the beginning of the file for F2.
0
 

Author Comment

by:shragi
ID: 24785849
hi OZO.....

ur code is working well for my sample sequence but it's not working for my files....

I am attaching the files can u plz check with these files.....


FILE-1.txt
FILE-2.txt
0
 

Author Comment

by:shragi
ID: 24788161
@ozo

U r code is working fine if all the sequences are in one case either upper or lower....

but I found an error that if it find one match it stops with out searching for other match....

it found match or not it should search entire FILE-2...
0
 
LVL 84

Expert Comment

by:ozo
ID: 24788223
where did you find the error?
when I tested it on your example, it found both
contig00001-------------- contig00003, contig00004
0
 
LVL 84

Expert Comment

by:ozo
ID: 24788360
#Unlike the file format in the question FILE-1,ext and FILE2.txt don't seem to have blank lines separating the sequences.
#here is a revision that separates sequences with >
#if you want to match either case, you can change qr/\Q$re/ to qr/\Q$re/i
$/=">";
while( <F1> ){
next unless  my($name,$re)=/\A(\w+).*?^(.*)^/ms;
    $re = qr/\Q$re/;
    seek F2,0,0;
    print "$name-------------- ";
    $,="";
    while( <F2> ){
        next unless /$re/;
        print "",/^(\w+)/;
        $,=", ";
    }
    print $,?"\n":"not matched\n";
}
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 

Author Comment

by:shragi
ID: 24789209
yes it worked but how about the matching the subpart....


in the sample sequence the CONTIG0001 of FILE-1 is matched with contig0004 of FILE-2 becoz contig0001 is a subpart of contig0004 of FILE-2

but while using this code for the files i had uploaded it..is searching for exact matches but not for sub parts...

i want both exact matches and subpart matchings also...
0
 
LVL 84

Expert Comment

by:ozo
ID: 24789322
It looks to me like
contig00004 of FILE-2.txt is smaller than contig00001 of FILE-1.txt
how then cat  contig00001 be a sub part of contig00004 ?
0
 

Author Comment

by:shragi
ID: 24789439
hey dude....I am talking about sample sequences.... that i gave    in the question.....

not in the files....

below are the sample sequences that i asked in question ..... ur code is not working for one case...... it for contig0001 it is matched to contig0001 and contig0004 ...becoz contig00001  of FILE-1 is identical to contig00003 of file-2 and it is a subpart of contig00004 of file-2....(the 2nd, 3rd line of conitg0004 are same as contig00001 of file-1)

so the output shuld be

contig00001------------contig00003, contig00004,

but i am getting

contig00001 -----------contig00003

i am not geting contig00004....

FILE-1:

>contig00001  length=11003   numreads=3312
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC

>contig00002  length=110423   numreads=3323
GCAGCGCCAGCAGGAGCGTGGCAAACGAACGCGTCATCGAAGGGTTTtCACCGCCGTACC
GCTTACGCTCACCATCAGCATGCTGGCATCTTtCCCGACCGTTTCGTAGTCGATATCAAT

>contig00003  length=11023   numreads=33233
GCACAGACTTATCCACAATGATACGAAAAAGTGAAATTGTGCGAGCGTTGCGCAAACGTT
TTCGTTAAAATGCTCGCGCTTAACAGGCATGCCCCGCCAGGTGTGTTAGATGAGTTTTTC

FILE-2

>contig00001  length=15918   numreads=6266
GCGGGCGCGGCTACTGCCCGCTGGGGCGCGAGACCGGCATCGCGCGCATTGCGTGGCGCG
ACGGCTGGCCGTTTGTCGAAGGCGGCAAACACGCGCAGCTGACTGTACCTGGCCCGCAGG


>contig00002  length=106210   numreads=27839
ACCCTCACCCCGGCCCTCTCCCTGAGGGAGAGGGGgTAAACATCAGCAGATGTTAAGCGG
GAGTGGGgTCGTCACCCTTCATCTCGCGCTTCACTTCCGTATATTCCTCACCTTTTTCAT

>contig00003  length=106213   numreads=26839
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC

>contig00004  length=1023433   numreads=23465
GAGTGGGgTCGTCACCCTTCATCTCGCGCTTCACTTCCGTATATTCCTCACCTTTTTCAT
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC
GACTATACTGGCGAAAAATACTCCCCGGCAGGCC



to be precise...while searching i need to get exact match and also if it is a part of another sequence....

0
 
LVL 84

Expert Comment

by:ozo
ID: 24789544
with the sample in the question, I get
contig00001-------------- contig00003, contig00004
0
 
LVL 84

Accepted Solution

by:
ozo earned 500 total points
ID: 24789725
If you make the change to separate sequences with > instead of on blank lines, the blank line at the end of
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC

in the question sample does not match
GAGTGGGgTCGTCACCCTTCATCTCGCGCTTCACTTCCGTATATTCCTCACCTTTTTCAT
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC
GACTATACTGGCGAAAAATACTCCCCGGCAGGCC
which has no blank line there.

If you want to ignore line breaks in the sequences, you can do
$/=">";
while( <F1> ){
next unless  my($name,$re)=/\A(\w+).*?^(.*)^/ms;
$re=~s/\s+//g;
    $re = qr/$re/ix;
    seek F2,0,0;
    print "$name-------------- ";
    $,="";
    while( <F2> ){
       s/[\r\n]+//g;
        next unless /$re/;
        print "",/^(\w+)/;
        $,=", ";
    }
    print $,?"\n":"not matched\n";
}

0
 

Author Comment

by:shragi
ID: 24797815
yes it worked for my sample sequences but not for the sequences in the files that i attached....
0
 
LVL 84

Expert Comment

by:ozo
ID: 24799197
In what way did it not work?
0
 
LVL 84

Expert Comment

by:ozo
ID: 24799211
Can you show me two sequences that should match that did not, or two matches that did match that should not?
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're looking for how to monitor bandwidth using netflow or packet s…

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now