Link to home
Start Free TrialLog in
Avatar of shragi
shragiFlag for India

asked on

compare two lists....

Hi guys I had two files with below format.....


FILE-1:

>contig00001  length=11003   numreads=3312
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC

>contig00002  length=110423   numreads=3323
GCAGCGCCAGCAGGAGCGTGGCAAACGAACGCGTCATCGAAGGGTTTtCACCGCCGTACC
GCTTACGCTCACCATCAGCATGCTGGCATCTTtCCCGACCGTTTCGTAGTCGATATCAAT

>contig00003  length=11023   numreads=33233
GCACAGACTTATCCACAATGATACGAAAAAGTGAAATTGTGCGAGCGTTGCGCAAACGTT
TTCGTTAAAATGCTCGCGCTTAACAGGCATGCCCCGCCAGGTGTGTTAGATGAGTTTTTC

FILE-2

>contig00001  length=15918   numreads=6266
GCGGGCGCGGCTACTGCCCGCTGGGGCGCGAGACCGGCATCGCGCGCATTGCGTGGCGCG
ACGGCTGGCCGTTTGTCGAAGGCGGCAAACACGCGCAGCTGACTGTACCTGGCCCGCAGG


>contig00002  length=106210   numreads=27839
ACCCTCACCCCGGCCCTCTCCCTGAGGGAGAGGGGgTAAACATCAGCAGATGTTAAGCGG
GAGTGGGgTCGTCACCCTTCATCTCGCGCTTCACTTCCGTATATTCCTCACCTTTTTCAT

>contig00003  length=106213   numreads=26839
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC

>contig00004  length=1023433   numreads=23465
GAGTGGGgTCGTCACCCTTCATCTCGCGCTTCACTTCCGTATATTCCTCACCTTTTTCAT
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC
GACTATACTGGCGAAAAATACTCCCCGGCAGGCC

The program must compare the sequence in FILE-1 with sequences in FILE-2

i.e., take first sequence in FILE-1...i.e., take only sequence not its name and compare with sequences in FILE-2  if it matches to any of the sequences in FILE-2 then print both sequences names i.e., the sequences which is compared and sequences matched...

one sequence can be matched with many sequences....
one sequence may be part of another sequences in FILE-2....

return all the matched sequences ..

here in the sample sequences....

first sequence (contig00001) of FILE-1 is exactly matched with contig00003 and it is also present in contig0004 as a subpart... so the output is

contig00001--------------------- contig00003, contig00004

the contig00002, contig00003 of FILE-1 is not matched with any sequences of FILE-2 so
return

contig00002-------------- not matched
contig00003--------------notmatched.



guys I asked similar question, but there I did not put in a clear way so not to confuse them I deleted that question .... and posting new question........
Avatar of ozo
ozo
Flag of United States of America image

open F1,"<FILE-1" or  die "FILE-1 $!";
open F2,"<FILE-2" or  die "FILE-2 $!";
$/="";
while( <F1> ){
    my($name,$re)=/>(\w+).*?^(.*)^/ms;
    $re = qr/\Q$re/;
    seek F2,0,0;
    print "$name-------------- ";
    $,="";
    while( <F2> ){
        next unless /$re/;
        print "",/>(\w+)/;
        $,=", ";
    }
    print $,?"\n":"not matched\n";
}
ozo, you are really a genius. good work.
There is just one small problem in the solution.
You are reading F2 in a while loop inside the loop of F1, so once the file2 reaches end of file, it will stop reading.
For example say, the first line matches with only the last line (or towards end), then for the first line itself the entire second file will be read.
<F2> will reach EOF.

One way around this is open the file in the loop itself, just before while(<F2>). So that for every line from 1st file, all the lines from second file are read.
Avatar of Adam314
Adam314

vikaskhoria - Notice this line inside the first while loop (looping over F1):
    seek F2,0,0;
This resets the file pointer to the beginning of the file for F2.
Avatar of shragi

ASKER

hi OZO.....

ur code is working well for my sample sequence but it's not working for my files....

I am attaching the files can u plz check with these files.....


FILE-1.txt
FILE-2.txt
Avatar of shragi

ASKER

@ozo

U r code is working fine if all the sequences are in one case either upper or lower....

but I found an error that if it find one match it stops with out searching for other match....

it found match or not it should search entire FILE-2...
where did you find the error?
when I tested it on your example, it found both
contig00001-------------- contig00003, contig00004
#Unlike the file format in the question FILE-1,ext and FILE2.txt don't seem to have blank lines separating the sequences.
#here is a revision that separates sequences with >
#if you want to match either case, you can change qr/\Q$re/ to qr/\Q$re/i
$/=">";
while( <F1> ){
next unless  my($name,$re)=/\A(\w+).*?^(.*)^/ms;
    $re = qr/\Q$re/;
    seek F2,0,0;
    print "$name-------------- ";
    $,="";
    while( <F2> ){
        next unless /$re/;
        print "",/^(\w+)/;
        $,=", ";
    }
    print $,?"\n":"not matched\n";
}
Avatar of shragi

ASKER

yes it worked but how about the matching the subpart....


in the sample sequence the CONTIG0001 of FILE-1 is matched with contig0004 of FILE-2 becoz contig0001 is a subpart of contig0004 of FILE-2

but while using this code for the files i had uploaded it..is searching for exact matches but not for sub parts...

i want both exact matches and subpart matchings also...
It looks to me like
contig00004 of FILE-2.txt is smaller than contig00001 of FILE-1.txt
how then cat  contig00001 be a sub part of contig00004 ?
Avatar of shragi

ASKER

hey dude....I am talking about sample sequences.... that i gave    in the question.....

not in the files....

below are the sample sequences that i asked in question ..... ur code is not working for one case...... it for contig0001 it is matched to contig0001 and contig0004 ...becoz contig00001  of FILE-1 is identical to contig00003 of file-2 and it is a subpart of contig00004 of file-2....(the 2nd, 3rd line of conitg0004 are same as contig00001 of file-1)

so the output shuld be

contig00001------------contig00003, contig00004,

but i am getting

contig00001 -----------contig00003

i am not geting contig00004....

FILE-1:

>contig00001  length=11003   numreads=3312
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC

>contig00002  length=110423   numreads=3323
GCAGCGCCAGCAGGAGCGTGGCAAACGAACGCGTCATCGAAGGGTTTtCACCGCCGTACC
GCTTACGCTCACCATCAGCATGCTGGCATCTTtCCCGACCGTTTCGTAGTCGATATCAAT

>contig00003  length=11023   numreads=33233
GCACAGACTTATCCACAATGATACGAAAAAGTGAAATTGTGCGAGCGTTGCGCAAACGTT
TTCGTTAAAATGCTCGCGCTTAACAGGCATGCCCCGCCAGGTGTGTTAGATGAGTTTTTC

FILE-2

>contig00001  length=15918   numreads=6266
GCGGGCGCGGCTACTGCCCGCTGGGGCGCGAGACCGGCATCGCGCGCATTGCGTGGCGCG
ACGGCTGGCCGTTTGTCGAAGGCGGCAAACACGCGCAGCTGACTGTACCTGGCCCGCAGG


>contig00002  length=106210   numreads=27839
ACCCTCACCCCGGCCCTCTCCCTGAGGGAGAGGGGgTAAACATCAGCAGATGTTAAGCGG
GAGTGGGgTCGTCACCCTTCATCTCGCGCTTCACTTCCGTATATTCCTCACCTTTTTCAT

>contig00003  length=106213   numreads=26839
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC

>contig00004  length=1023433   numreads=23465
GAGTGGGgTCGTCACCCTTCATCTCGCGCTTCACTTCCGTATATTCCTCACCTTTTTCAT
ACACAGACTTATCCACAATCGGGCCTGCCCGCGCTGCGCGATCCTACATTTAGCGAGACA
AAATCGACTATACTGGCGAAAAATACTCCCCGGCAGGCCACCCCATGACAACACAACCTC
GACTATACTGGCGAAAAATACTCCCCGGCAGGCC



to be precise...while searching i need to get exact match and also if it is a part of another sequence....

with the sample in the question, I get
contig00001-------------- contig00003, contig00004
ASKER CERTIFIED SOLUTION
Avatar of ozo
ozo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of shragi

ASKER

yes it worked for my sample sequences but not for the sequences in the files that i attached....
In what way did it not work?
Can you show me two sequences that should match that did not, or two matches that did match that should not?