shragi
asked on
compare two lists....
Hi guys I had two files with below format.....
FILE-1:
>contig00001 length=11003 numreads=3312
ACACAGACTTATCCACAATCGGGCCT GCCCGCGCTG CGCGATCCTA CATTTAGCGA GACA
AAATCGACTATACTGGCGAAAAATAC TCCCCGGCAG GCCACCCCAT GACAACACAA CCTC
>contig00002 length=110423 numreads=3323
GCAGCGCCAGCAGGAGCGTGGCAAAC GAACGCGTCA TCGAAGGGTT TtCACCGCCG TACC
GCTTACGCTCACCATCAGCATGCTGG CATCTTtCCC GACCGTTTCG TAGTCGATAT CAAT
>contig00003 length=11023 numreads=33233
GCACAGACTTATCCACAATGATACGA AAAAGTGAAA TTGTGCGAGC GTTGCGCAAA CGTT
TTCGTTAAAATGCTCGCGCTTAACAG GCATGCCCCG CCAGGTGTGT TAGATGAGTT TTTC
FILE-2
>contig00001 length=15918 numreads=6266
GCGGGCGCGGCTACTGCCCGCTGGGG CGCGAGACCG GCATCGCGCG CATTGCGTGG CGCG
ACGGCTGGCCGTTTGTCGAAGGCGGC AAACACGCGC AGCTGACTGT ACCTGGCCCG CAGG
>contig00002 length=106210 numreads=27839
ACCCTCACCCCGGCCCTCTCCCTGAG GGAGAGGGGg TAAACATCAG CAGATGTTAA GCGG
GAGTGGGgTCGTCACCCTTCATCTCG CGCTTCACTT CCGTATATTC CTCACCTTTT TCAT
>contig00003 length=106213 numreads=26839
ACACAGACTTATCCACAATCGGGCCT GCCCGCGCTG CGCGATCCTA CATTTAGCGA GACA
AAATCGACTATACTGGCGAAAAATAC TCCCCGGCAG GCCACCCCAT GACAACACAA CCTC
>contig00004 length=1023433 numreads=23465
GAGTGGGgTCGTCACCCTTCATCTCG CGCTTCACTT CCGTATATTC CTCACCTTTT TCAT
ACACAGACTTATCCACAATCGGGCCT GCCCGCGCTG CGCGATCCTA CATTTAGCGA GACA
AAATCGACTATACTGGCGAAAAATAC TCCCCGGCAG GCCACCCCAT GACAACACAA CCTC
GACTATACTGGCGAAAAATACTCCCC GGCAGGCC
The program must compare the sequence in FILE-1 with sequences in FILE-2
i.e., take first sequence in FILE-1...i.e., take only sequence not its name and compare with sequences in FILE-2 if it matches to any of the sequences in FILE-2 then print both sequences names i.e., the sequences which is compared and sequences matched...
one sequence can be matched with many sequences....
one sequence may be part of another sequences in FILE-2....
return all the matched sequences ..
here in the sample sequences....
first sequence (contig00001) of FILE-1 is exactly matched with contig00003 and it is also present in contig0004 as a subpart... so the output is
contig00001--------------- ------ contig00003, contig00004
the contig00002, contig00003 of FILE-1 is not matched with any sequences of FILE-2 so
return
contig00002-------------- not matched
contig00003--------------n otmatched.
guys I asked similar question, but there I did not put in a clear way so not to confuse them I deleted that question .... and posting new question........
FILE-1:
>contig00001 length=11003 numreads=3312
ACACAGACTTATCCACAATCGGGCCT
AAATCGACTATACTGGCGAAAAATAC
>contig00002 length=110423 numreads=3323
GCAGCGCCAGCAGGAGCGTGGCAAAC
GCTTACGCTCACCATCAGCATGCTGG
>contig00003 length=11023 numreads=33233
GCACAGACTTATCCACAATGATACGA
TTCGTTAAAATGCTCGCGCTTAACAG
FILE-2
>contig00001 length=15918 numreads=6266
GCGGGCGCGGCTACTGCCCGCTGGGG
ACGGCTGGCCGTTTGTCGAAGGCGGC
>contig00002 length=106210 numreads=27839
ACCCTCACCCCGGCCCTCTCCCTGAG
GAGTGGGgTCGTCACCCTTCATCTCG
>contig00003 length=106213 numreads=26839
ACACAGACTTATCCACAATCGGGCCT
AAATCGACTATACTGGCGAAAAATAC
>contig00004 length=1023433 numreads=23465
GAGTGGGgTCGTCACCCTTCATCTCG
ACACAGACTTATCCACAATCGGGCCT
AAATCGACTATACTGGCGAAAAATAC
GACTATACTGGCGAAAAATACTCCCC
The program must compare the sequence in FILE-1 with sequences in FILE-2
i.e., take first sequence in FILE-1...i.e., take only sequence not its name and compare with sequences in FILE-2 if it matches to any of the sequences in FILE-2 then print both sequences names i.e., the sequences which is compared and sequences matched...
one sequence can be matched with many sequences....
one sequence may be part of another sequences in FILE-2....
return all the matched sequences ..
here in the sample sequences....
first sequence (contig00001) of FILE-1 is exactly matched with contig00003 and it is also present in contig0004 as a subpart... so the output is
contig00001---------------
the contig00002, contig00003 of FILE-1 is not matched with any sequences of FILE-2 so
return
contig00002-------------- not matched
contig00003--------------n
guys I asked similar question, but there I did not put in a clear way so not to confuse them I deleted that question .... and posting new question........
ozo, you are really a genius. good work.
There is just one small problem in the solution.
You are reading F2 in a while loop inside the loop of F1, so once the file2 reaches end of file, it will stop reading.
For example say, the first line matches with only the last line (or towards end), then for the first line itself the entire second file will be read.
<F2> will reach EOF.
One way around this is open the file in the loop itself, just before while(<F2>). So that for every line from 1st file, all the lines from second file are read.
You are reading F2 in a while loop inside the loop of F1, so once the file2 reaches end of file, it will stop reading.
For example say, the first line matches with only the last line (or towards end), then for the first line itself the entire second file will be read.
<F2> will reach EOF.
One way around this is open the file in the loop itself, just before while(<F2>). So that for every line from 1st file, all the lines from second file are read.
vikaskhoria - Notice this line inside the first while loop (looping over F1):
seek F2,0,0;
This resets the file pointer to the beginning of the file for F2.
seek F2,0,0;
This resets the file pointer to the beginning of the file for F2.
ASKER
hi OZO.....
ur code is working well for my sample sequence but it's not working for my files....
I am attaching the files can u plz check with these files.....
FILE-1.txt
FILE-2.txt
ur code is working well for my sample sequence but it's not working for my files....
I am attaching the files can u plz check with these files.....
FILE-1.txt
FILE-2.txt
ASKER
@ozo
U r code is working fine if all the sequences are in one case either upper or lower....
but I found an error that if it find one match it stops with out searching for other match....
it found match or not it should search entire FILE-2...
U r code is working fine if all the sequences are in one case either upper or lower....
but I found an error that if it find one match it stops with out searching for other match....
it found match or not it should search entire FILE-2...
where did you find the error?
when I tested it on your example, it found both
contig00001-------------- contig00003, contig00004
when I tested it on your example, it found both
contig00001-------------- contig00003, contig00004
#Unlike the file format in the question FILE-1,ext and FILE2.txt don't seem to have blank lines separating the sequences.
#here is a revision that separates sequences with >
#if you want to match either case, you can change qr/\Q$re/ to qr/\Q$re/i
$/=">";
while( <F1> ){
next unless my($name,$re)=/\A(\w+).*?^ (.*)^/ms;
$re = qr/\Q$re/;
seek F2,0,0;
print "$name-------------- ";
$,="";
while( <F2> ){
next unless /$re/;
print "",/^(\w+)/;
$,=", ";
}
print $,?"\n":"not matched\n";
}
#here is a revision that separates sequences with >
#if you want to match either case, you can change qr/\Q$re/ to qr/\Q$re/i
$/=">";
while( <F1> ){
next unless my($name,$re)=/\A(\w+).*?^
$re = qr/\Q$re/;
seek F2,0,0;
print "$name-------------- ";
$,="";
while( <F2> ){
next unless /$re/;
print "",/^(\w+)/;
$,=", ";
}
print $,?"\n":"not matched\n";
}
ASKER
yes it worked but how about the matching the subpart....
in the sample sequence the CONTIG0001 of FILE-1 is matched with contig0004 of FILE-2 becoz contig0001 is a subpart of contig0004 of FILE-2
but while using this code for the files i had uploaded it..is searching for exact matches but not for sub parts...
i want both exact matches and subpart matchings also...
in the sample sequence the CONTIG0001 of FILE-1 is matched with contig0004 of FILE-2 becoz contig0001 is a subpart of contig0004 of FILE-2
but while using this code for the files i had uploaded it..is searching for exact matches but not for sub parts...
i want both exact matches and subpart matchings also...
It looks to me like
contig00004 of FILE-2.txt is smaller than contig00001 of FILE-1.txt
how then cat contig00001 be a sub part of contig00004 ?
contig00004 of FILE-2.txt is smaller than contig00001 of FILE-1.txt
how then cat contig00001 be a sub part of contig00004 ?
ASKER
hey dude....I am talking about sample sequences.... that i gave in the question.....
not in the files....
below are the sample sequences that i asked in question ..... ur code is not working for one case...... it for contig0001 it is matched to contig0001 and contig0004 ...becoz contig00001 of FILE-1 is identical to contig00003 of file-2 and it is a subpart of contig00004 of file-2....(the 2nd, 3rd line of conitg0004 are same as contig00001 of file-1)
so the output shuld be
contig00001------------con tig00003, contig00004,
but i am getting
contig00001 -----------contig00003
i am not geting contig00004....
FILE-1:
>contig00001 length=11003 numreads=3312
ACACAGACTTATCCACAATCGGGCCT GCCCGCGCTG CGCGATCCTA CATTTAGCGA GACA
AAATCGACTATACTGGCGAAAAATAC TCCCCGGCAG GCCACCCCAT GACAACACAA CCTC
>contig00002 length=110423 numreads=3323
GCAGCGCCAGCAGGAGCGTGGCAAAC GAACGCGTCA TCGAAGGGTT TtCACCGCCG TACC
GCTTACGCTCACCATCAGCATGCTGG CATCTTtCCC GACCGTTTCG TAGTCGATAT CAAT
>contig00003 length=11023 numreads=33233
GCACAGACTTATCCACAATGATACGA AAAAGTGAAA TTGTGCGAGC GTTGCGCAAA CGTT
TTCGTTAAAATGCTCGCGCTTAACAG GCATGCCCCG CCAGGTGTGT TAGATGAGTT TTTC
FILE-2
>contig00001 length=15918 numreads=6266
GCGGGCGCGGCTACTGCCCGCTGGGG CGCGAGACCG GCATCGCGCG CATTGCGTGG CGCG
ACGGCTGGCCGTTTGTCGAAGGCGGC AAACACGCGC AGCTGACTGT ACCTGGCCCG CAGG
>contig00002 length=106210 numreads=27839
ACCCTCACCCCGGCCCTCTCCCTGAG GGAGAGGGGg TAAACATCAG CAGATGTTAA GCGG
GAGTGGGgTCGTCACCCTTCATCTCG CGCTTCACTT CCGTATATTC CTCACCTTTT TCAT
>contig00003 length=106213 numreads=26839
ACACAGACTTATCCACAATCGGGCCT GCCCGCGCTG CGCGATCCTA CATTTAGCGA GACA
AAATCGACTATACTGGCGAAAAATAC TCCCCGGCAG GCCACCCCAT GACAACACAA CCTC
>contig00004 length=1023433 numreads=23465
GAGTGGGgTCGTCACCCTTCATCTCG CGCTTCACTT CCGTATATTC CTCACCTTTT TCAT
ACACAGACTTATCCACAATCGGGCCT GCCCGCGCTG CGCGATCCTA CATTTAGCGA GACA
AAATCGACTATACTGGCGAAAAATAC TCCCCGGCAG GCCACCCCAT GACAACACAA CCTC
GACTATACTGGCGAAAAATACTCCCC GGCAGGCC
to be precise...while searching i need to get exact match and also if it is a part of another sequence....
not in the files....
below are the sample sequences that i asked in question ..... ur code is not working for one case...... it for contig0001 it is matched to contig0001 and contig0004 ...becoz contig00001 of FILE-1 is identical to contig00003 of file-2 and it is a subpart of contig00004 of file-2....(the 2nd, 3rd line of conitg0004 are same as contig00001 of file-1)
so the output shuld be
contig00001------------con
but i am getting
contig00001 -----------contig00003
i am not geting contig00004....
FILE-1:
>contig00001 length=11003 numreads=3312
ACACAGACTTATCCACAATCGGGCCT
AAATCGACTATACTGGCGAAAAATAC
>contig00002 length=110423 numreads=3323
GCAGCGCCAGCAGGAGCGTGGCAAAC
GCTTACGCTCACCATCAGCATGCTGG
>contig00003 length=11023 numreads=33233
GCACAGACTTATCCACAATGATACGA
TTCGTTAAAATGCTCGCGCTTAACAG
FILE-2
>contig00001 length=15918 numreads=6266
GCGGGCGCGGCTACTGCCCGCTGGGG
ACGGCTGGCCGTTTGTCGAAGGCGGC
>contig00002 length=106210 numreads=27839
ACCCTCACCCCGGCCCTCTCCCTGAG
GAGTGGGgTCGTCACCCTTCATCTCG
>contig00003 length=106213 numreads=26839
ACACAGACTTATCCACAATCGGGCCT
AAATCGACTATACTGGCGAAAAATAC
>contig00004 length=1023433 numreads=23465
GAGTGGGgTCGTCACCCTTCATCTCG
ACACAGACTTATCCACAATCGGGCCT
AAATCGACTATACTGGCGAAAAATAC
GACTATACTGGCGAAAAATACTCCCC
to be precise...while searching i need to get exact match and also if it is a part of another sequence....
with the sample in the question, I get
contig00001-------------- contig00003, contig00004
contig00001-------------- contig00003, contig00004
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
yes it worked for my sample sequences but not for the sequences in the files that i attached....
In what way did it not work?
Can you show me two sequences that should match that did not, or two matches that did match that should not?
open F2,"<FILE-2" or die "FILE-2 $!";
$/="";
while( <F1> ){
my($name,$re)=/>(\w+).*?^(
$re = qr/\Q$re/;
seek F2,0,0;
print "$name-------------- ";
$,="";
while( <F2> ){
next unless /$re/;
print "",/>(\w+)/;
$,=", ";
}
print $,?"\n":"not matched\n";
}