Solved

compare two lists....

Posted on 2009-07-05
16
202 Views
Last Modified: 2012-05-07
Hi guys I had two files with below format.....


FILE-1:

>contig00001  length=11003   numreads=3312
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC

>contig00002  length=110423   numreads=3323
GCAGCGCCAGCAGGAGCGTGGCAAACGAACGCGTCATCGAAGGGTTTtCACCGCCGTACC
GCTTACGCTCACCATCAGCATGCTGGCATCTTtCCCGACCGTTTCGTAGTCGATATCAAT

>contig00003  length=11023   numreads=33233
GCACAGACTTATCCACAATGATACGAAAAAGTGAAATTGTGCGAGCGTTGCGCAAACGTT
TTCGTTAAAATGCTCGCGCTTAACAGGCATGCCCCGCCAGGTGTGTTAGATGAGTTTTTC

FILE-2

>contig00001  length=15918   numreads=6266
GCGGGCGCGGCTACTGCCCGCTGGGGCGCGAGACCGGCATCGCGCGCATTGCGTGGCGCG
ACGGCTGGCCGTTTGTCGAAGGCGGCAAACACGCGCAGCTGACTGTACCTGGCCCGCAGG


>contig00002  length=106210   numreads=27839
ACCCTCACCCCGGCCCTCTCCCTGAGGGAGAGGGGgTAAACATCAGCAGATGTTAAGCGG
GAGTGGGgTCGTCACCCTTCATCTCGCGCTTCACTTCCGTATATTCCTCACCTTTTTCAT

>contig00003  length=106213   numreads=26839
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC

>contig00004  length=1023433   numreads=23465
GAGTGGGgTCGTCACCCTTCATCTCGCGCTTCACTTCCGTATATTCCTCACCTTTTTCAT
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC
GACTATACTGGCGAAAAATACTCCCCGGCAGGCC

The program must compare the sequence in FILE-1 with sequences in FILE-2

i.e., take first sequence in FILE-1...i.e., take only sequence not its name and compare with sequences in FILE-2  if it matches to any of the sequences in FILE-2 then print both sequences names i.e., the sequences which is compared and sequences matched...

one sequence can be matched with many sequences....
one sequence may be part of another sequences in FILE-2....

return all the matched sequences ..

here in the sample sequences....

first sequence (contig00001) of FILE-1 is exactly matched with contig00003 and it is also present in contig0004 as a subpart... so the output is

contig00001--------------------- contig00003, contig00004

the contig00002, contig00003 of FILE-1 is not matched with any sequences of FILE-2 so
return

contig00002-------------- not matched
contig00003--------------notmatched.



guys I asked similar question, but there I did not put in a clear way so not to confuse them I deleted that question .... and posting new question........
0
Comment
Question by:shragi
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
16 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 24782531
open F1,"<FILE-1" or  die "FILE-1 $!";
open F2,"<FILE-2" or  die "FILE-2 $!";
$/="";
while( <F1> ){
    my($name,$re)=/>(\w+).*?^(.*)^/ms;
    $re = qr/\Q$re/;
    seek F2,0,0;
    print "$name-------------- ";
    $,="";
    while( <F2> ){
        next unless /$re/;
        print "",/>(\w+)/;
        $,=", ";
    }
    print $,?"\n":"not matched\n";
}
0
 
LVL 6

Expert Comment

by:zlobcho
ID: 24783710
ozo, you are really a genius. good work.
0
 
LVL 5

Expert Comment

by:vikaskhoria
ID: 24783900
There is just one small problem in the solution.
You are reading F2 in a while loop inside the loop of F1, so once the file2 reaches end of file, it will stop reading.
For example say, the first line matches with only the last line (or towards end), then for the first line itself the entire second file will be read.
<F2> will reach EOF.

One way around this is open the file in the loop itself, just before while(<F2>). So that for every line from 1st file, all the lines from second file are read.
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 39

Expert Comment

by:Adam314
ID: 24785532
vikaskhoria - Notice this line inside the first while loop (looping over F1):
    seek F2,0,0;
This resets the file pointer to the beginning of the file for F2.
0
 

Author Comment

by:shragi
ID: 24785849
hi OZO.....

ur code is working well for my sample sequence but it's not working for my files....

I am attaching the files can u plz check with these files.....


FILE-1.txt
FILE-2.txt
0
 

Author Comment

by:shragi
ID: 24788161
@ozo

U r code is working fine if all the sequences are in one case either upper or lower....

but I found an error that if it find one match it stops with out searching for other match....

it found match or not it should search entire FILE-2...
0
 
LVL 84

Expert Comment

by:ozo
ID: 24788223
where did you find the error?
when I tested it on your example, it found both
contig00001-------------- contig00003, contig00004
0
 
LVL 84

Expert Comment

by:ozo
ID: 24788360
#Unlike the file format in the question FILE-1,ext and FILE2.txt don't seem to have blank lines separating the sequences.
#here is a revision that separates sequences with >
#if you want to match either case, you can change qr/\Q$re/ to qr/\Q$re/i
$/=">";
while( <F1> ){
next unless  my($name,$re)=/\A(\w+).*?^(.*)^/ms;
    $re = qr/\Q$re/;
    seek F2,0,0;
    print "$name-------------- ";
    $,="";
    while( <F2> ){
        next unless /$re/;
        print "",/^(\w+)/;
        $,=", ";
    }
    print $,?"\n":"not matched\n";
}
0
 

Author Comment

by:shragi
ID: 24789209
yes it worked but how about the matching the subpart....


in the sample sequence the CONTIG0001 of FILE-1 is matched with contig0004 of FILE-2 becoz contig0001 is a subpart of contig0004 of FILE-2

but while using this code for the files i had uploaded it..is searching for exact matches but not for sub parts...

i want both exact matches and subpart matchings also...
0
 
LVL 84

Expert Comment

by:ozo
ID: 24789322
It looks to me like
contig00004 of FILE-2.txt is smaller than contig00001 of FILE-1.txt
how then cat  contig00001 be a sub part of contig00004 ?
0
 

Author Comment

by:shragi
ID: 24789439
hey dude....I am talking about sample sequences.... that i gave    in the question.....

not in the files....

below are the sample sequences that i asked in question ..... ur code is not working for one case...... it for contig0001 it is matched to contig0001 and contig0004 ...becoz contig00001  of FILE-1 is identical to contig00003 of file-2 and it is a subpart of contig00004 of file-2....(the 2nd, 3rd line of conitg0004 are same as contig00001 of file-1)

so the output shuld be

contig00001------------contig00003, contig00004,

but i am getting

contig00001 -----------contig00003

i am not geting contig00004....

FILE-1:

>contig00001  length=11003   numreads=3312
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC

>contig00002  length=110423   numreads=3323
GCAGCGCCAGCAGGAGCGTGGCAAACGAACGCGTCATCGAAGGGTTTtCACCGCCGTACC
GCTTACGCTCACCATCAGCATGCTGGCATCTTtCCCGACCGTTTCGTAGTCGATATCAAT

>contig00003  length=11023   numreads=33233
GCACAGACTTATCCACAATGATACGAAAAAGTGAAATTGTGCGAGCGTTGCGCAAACGTT
TTCGTTAAAATGCTCGCGCTTAACAGGCATGCCCCGCCAGGTGTGTTAGATGAGTTTTTC

FILE-2

>contig00001  length=15918   numreads=6266
GCGGGCGCGGCTACTGCCCGCTGGGGCGCGAGACCGGCATCGCGCGCATTGCGTGGCGCG
ACGGCTGGCCGTTTGTCGAAGGCGGCAAACACGCGCAGCTGACTGTACCTGGCCCGCAGG


>contig00002  length=106210   numreads=27839
ACCCTCACCCCGGCCCTCTCCCTGAGGGAGAGGGGgTAAACATCAGCAGATGTTAAGCGG
GAGTGGGgTCGTCACCCTTCATCTCGCGCTTCACTTCCGTATATTCCTCACCTTTTTCAT

>contig00003  length=106213   numreads=26839
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC

>contig00004  length=1023433   numreads=23465
GAGTGGGgTCGTCACCCTTCATCTCGCGCTTCACTTCCGTATATTCCTCACCTTTTTCAT
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC
GACTATACTGGCGAAAAATACTCCCCGGCAGGCC



to be precise...while searching i need to get exact match and also if it is a part of another sequence....

0
 
LVL 84

Expert Comment

by:ozo
ID: 24789544
with the sample in the question, I get
contig00001-------------- contig00003, contig00004
0
 
LVL 84

Accepted Solution

by:
ozo earned 500 total points
ID: 24789725
If you make the change to separate sequences with > instead of on blank lines, the blank line at the end of
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC

in the question sample does not match
GAGTGGGgTCGTCACCCTTCATCTCGCGCTTCACTTCCGTATATTCCTCACCTTTTTCAT
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC
GACTATACTGGCGAAAAATACTCCCCGGCAGGCC
which has no blank line there.

If you want to ignore line breaks in the sequences, you can do
$/=">";
while( <F1> ){
next unless  my($name,$re)=/\A(\w+).*?^(.*)^/ms;
$re=~s/\s+//g;
    $re = qr/$re/ix;
    seek F2,0,0;
    print "$name-------------- ";
    $,="";
    while( <F2> ){
       s/[\r\n]+//g;
        next unless /$re/;
        print "",/^(\w+)/;
        $,=", ";
    }
    print $,?"\n":"not matched\n";
}

0
 

Author Comment

by:shragi
ID: 24797815
yes it worked for my sample sequences but not for the sequences in the files that i attached....
0
 
LVL 84

Expert Comment

by:ozo
ID: 24799197
In what way did it not work?
0
 
LVL 84

Expert Comment

by:ozo
ID: 24799211
Can you show me two sequences that should match that did not, or two matches that did match that should not?
0

Featured Post

Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

738 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question