haravallabhan
asked on
Pattern search in a text file, Perl script
Hi,
I want to do a pattern searching on a FASTA format file (text), for eg given a FASTA format file, I would like to search for a specific pattern of sequences in the file, the output needs to give me the position to which the pattern is found
For eg in the sequence here (pls note the line following > should be ignored and the position 1 starts from where the sequence charaters starts in this case C (but not limited to)
>gi|12222253|gb|AL438840.1 |AL438840 AL438840 XBC0AA Debaryomyces hansenii var.AL438840 AL438840 XBC0AA Debaryomyces hansenii var.
CGGTTTTAATTTAGCAGTTAACAGGC AAGTTTAGCT TAATACAACC ATTTAGCCAT CAGCAGTTAA TTGATAGATT AATATC
GACAATCTCTGAGTTACGTTCTATAA CATTTTTTCT TTTTTAGACG ATTTTACCGA AATTGCAGGC AATA
AATTTTCTTTTTCACGCTTAGCACAG AACAGTAGCT GACGAGGCAA TTGTTGATTT AGGGAAGAAA TACG
AAGATAAAAGAAGATGACAAGCACAC CTAGTAATGA GGTGATTGAA CAATTGAATC AGGCCCGTAA TTTA
GCGTTTTCGAGTAAAGAAACATTTCC ACAGGTATTA AGACAAATCT TGCAATTTGC AAGCAATCCA GATA
TCCAGATCCAAAGATGGTGTTCTAAA TTCTTTAAGG AATCGTTTTT GGCTGACGAA ACAGTGTTAA GCAG
AGCCGATAAGGTTGACTTGGCGATAG ACTCGATCGA CAGTTTGATA ATCTTGTTAG AAATTCGTGA TGCG
GAAATATTTAAAGATTGTATTGATAC AGCGATAGTA GTATTTAGAC TAGTATTTCG CTACGTTGCT GAAA
ACGATGGATGTGGTGATGTATGGCAG AAATTGAATG AGTTAAAGAA TACGTTAACT AATAAGTTTC AAAG
CACATTTCCTCTAGCACCATCTGACG ATGAAGAACA TGATATGGTA CGCAGCATAG ATTCTAAGTT GGAA
ATCTTGAAATTTGTGATACTAGTAAT TGACTATCAG TCTAAATCCC CCTCCAATAT AACCAGCTTT TCTT
TGTCACAAGTCCCACCAAATCATTCA CTCATCAAAC AGTCAATAGA GGCTGAAGCA TACGGCCTAG TGGA
CGTATGTGTGAAAGTTATTACCAATG ATATACTCAT ACCGCCATTG GTCACTGCCG TATTTAACCA TTTT
TCAGTTCTAGCAAGAAGAAAACCCCA ATTCGTTTCA AAAATGTTAA ATGTGATAGA GAATTTTTGA CACC
AATACAAAATTACAGTCAAATTATCA GACGATCGAT GAATATAAGC TATCTAAAAA ATATGTTGAT AGAG
TCTTGAGARTTTCTATTTAAAGATTG TATTGATACA GCGATAGTAG TATTTAGACT AGTATTTCGC TACG
TTGCTGAAAACGATGGATGTGGTGAT GTATGGCAGA AATTGAATGA GTTAAAGAAT ACGTTAACTA ATAA
GTTTCAAAGCACATTTCCTCTAGCAC CATCTGACGA TGAAGAACAT GATATGGTAC GCAGCATAGA TTCT
ATCTTGAAATTTGTGATACTAGTAAT TGACTATCAG TCTAAATCCC CCTCCAATAT AACCAGCTTT TCTT
TGTCACAAGTCCCACCAAATCATTCA CTCATCAAAC AGTCAATAGA GGCTGAAGCA TACGGCCTAG TGGA
CGTATGTGTGAAAGTTATTACCAATG ATATACTCAT ACCGCCATTG GTCACTGCCG TATTTAACCA TTTT
TCAGTTCTAGCAAGAAGAAAACCCCA ATTCGTTTCA AAAATGTTAA ATGTGATAGA GAATTTTTGA CACC
AATACAAAATTACAGTCAAATTATCA GACGATCGAT GAATATAAGC TATCTAAAAA ATATGTTGAT AGAG
TCTTGAGARTTTC
Output
Pattern1 TTTAGC 10 15
Pattern2 TTTAGC 30 35
Pattern3 TTTAGC 48 53
However the trick is that I need to use wildcharacters in the pattern as Input to search for the sequences
the wild characters being
A for A
C for C
G for G
T for T
U for U
R for either A or G
Y for C or T
M for A or C
K for G or T or U
S for C or G
W for A or T or U
B for C or G or T or U
D for A or G or T or U
H for A or C or T or U
V for A or C or G
N for any nucleotide (ie A/C/G/T/U)
* for any character
for instance I should be able to search for all sequences with the following patterns , TTTAGC, TTNABV,TTDRY
If there are any softwares to do this or any perl code to do this it will be quite useful.
Thank you
I want to do a pattern searching on a FASTA format file (text), for eg given a FASTA format file, I would like to search for a specific pattern of sequences in the file, the output needs to give me the position to which the pattern is found
For eg in the sequence here (pls note the line following > should be ignored and the position 1 starts from where the sequence charaters starts in this case C (but not limited to)
>gi|12222253|gb|AL438840.1
CGGTTTTAATTTAGCAGTTAACAGGC
GACAATCTCTGAGTTACGTTCTATAA
AATTTTCTTTTTCACGCTTAGCACAG
AAGATAAAAGAAGATGACAAGCACAC
GCGTTTTCGAGTAAAGAAACATTTCC
TCCAGATCCAAAGATGGTGTTCTAAA
AGCCGATAAGGTTGACTTGGCGATAG
GAAATATTTAAAGATTGTATTGATAC
ACGATGGATGTGGTGATGTATGGCAG
CACATTTCCTCTAGCACCATCTGACG
ATCTTGAAATTTGTGATACTAGTAAT
TGTCACAAGTCCCACCAAATCATTCA
CGTATGTGTGAAAGTTATTACCAATG
TCAGTTCTAGCAAGAAGAAAACCCCA
AATACAAAATTACAGTCAAATTATCA
TCTTGAGARTTTCTATTTAAAGATTG
TTGCTGAAAACGATGGATGTGGTGAT
GTTTCAAAGCACATTTCCTCTAGCAC
ATCTTGAAATTTGTGATACTAGTAAT
TGTCACAAGTCCCACCAAATCATTCA
CGTATGTGTGAAAGTTATTACCAATG
TCAGTTCTAGCAAGAAGAAAACCCCA
AATACAAAATTACAGTCAAATTATCA
TCTTGAGARTTTC
Output
Pattern1 TTTAGC 10 15
Pattern2 TTTAGC 30 35
Pattern3 TTTAGC 48 53
However the trick is that I need to use wildcharacters in the pattern as Input to search for the sequences
the wild characters being
A for A
C for C
G for G
T for T
U for U
R for either A or G
Y for C or T
M for A or C
K for G or T or U
S for C or G
W for A or T or U
B for C or G or T or U
D for A or G or T or U
H for A or C or T or U
V for A or C or G
N for any nucleotide (ie A/C/G/T/U)
* for any character
for instance I should be able to search for all sequences with the following patterns , TTTAGC, TTNABV,TTDRY
If there are any softwares to do this or any perl code to do this it will be quite useful.
Thank you
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Sure.
#!/usr/local/bin/perl
use warnings;
$patrn = shift;
#$patrn = "TTTAR*";
$patrn =~ s/R/(?:A|G)/g;
$patrn =~ s/Y/(?:C|T)/g;
$patrn =~ s/M/(?:A|C)/g;
$patrn =~ s/K/(?:G|T|U)/g;
$patrn =~ s/S/(?:C|G)/g;
$patrn =~ s/W/(?:A|T|U)/g;
$patrn =~ s/B/(?:C|G|T|U)/g;
$patrn =~ s/D/(?:A|G|T|U)/g;
$patrn =~ s/H/(?:A|C|T|U)/g;
$patrn =~ s/V/(?:A|C|G)/g;
$patrn =~ s/N/(?:A|C|G|T|U)/g;
$patrn =~ s/\*/./g;
open IN, "V:\\Doc\\KB\\Tech\\ExpExc h\\perl\\g nome2.txt" or die "Can't open gnome2.txt: $!";
<IN>; #skip first line
$/ = undef;
$lines = <IN>;
$lines =~ s/\s//g;
while($lines =~ s/($patrn)/'_'x(length($1) )/e)
{
print "Pattern: $& ", $-[$_]+1," ", $+[$_], "\n" foreach (1..$#+);
}
#!/usr/local/bin/perl
use warnings;
$patrn = shift;
#$patrn = "TTTAR*";
$patrn =~ s/R/(?:A|G)/g;
$patrn =~ s/Y/(?:C|T)/g;
$patrn =~ s/M/(?:A|C)/g;
$patrn =~ s/K/(?:G|T|U)/g;
$patrn =~ s/S/(?:C|G)/g;
$patrn =~ s/W/(?:A|T|U)/g;
$patrn =~ s/B/(?:C|G|T|U)/g;
$patrn =~ s/D/(?:A|G|T|U)/g;
$patrn =~ s/H/(?:A|C|T|U)/g;
$patrn =~ s/V/(?:A|C|G)/g;
$patrn =~ s/N/(?:A|C|G|T|U)/g;
$patrn =~ s/\*/./g;
open IN, "V:\\Doc\\KB\\Tech\\ExpExc
<IN>; #skip first line
$/ = undef;
$lines = <IN>;
$lines =~ s/\s//g;
while($lines =~ s/($patrn)/'_'x(length($1)
{
print "Pattern: $& ", $-[$_]+1," ", $+[$_], "\n" foreach (1..$#+);
}
For your "AAC" test, did you change the input? I used the sample input from your OP and it gives me different results than what you posted in your last comment. I did correct the missing "the position 1 starts from where the sequence charaters starts" from my last post in the code below.
#!C:\Perl\bin\perl.exe
$string = '';
$argc = @ARGV;
if ($argc != 2) { die "Usage:\n\t$0 code_sequence source_file\n" };
{
local $/=undef;
open FILE, $ARGV[1] or die "Couldn't open file: $ARGV[1]";
$string = <FILE>;
close FILE;
}
%map = ( 'A', 'A', 'C', 'C', 'G', 'G', 'T', 'T', 'U', 'U', 'R', '[AG]', 'Y', '[CT]', 'M', '[AC]', 'K', '[GTU]', 'S', '[CG]', 'W', '[ATU]', 'B', '[CGTU]', 'D', '[AGTU]', 'H', '[ACTU]', 'V', '[ACG]', 'N', '[ACGTU]', '\*', '.' );
while (($key, $value) = each(%map))
{
$ARGV[0] =~ s/$key/$value/g;
}
print "replaced string is: $ARGV[0]\n";
$i = 1;
$string = substr($string, index($string, "\n") + 1);
while ($string =~ /$ARGV[0]/g)
{
print "Pattern$i: $& @- @+\n";
$i++;
}
ASKER
@kaufmed Thanks, this is good just one more thing I guess the string starts with 0 here and not 1, how do I change this ?
ie it should be 1 3
8 10 etc
replaced string is: AAC
Pattern1: AAC 0 3
Pattern2: AAC 7 10
Pattern3: AAC 24 27
Pattern4: AAC 88 91
Pattern5: AAC 161 164
Pattern6: AAC 250 253
Pattern7: AAC 294 297
Pattern8: AAC 403 406
Pattern9: AAC 613 616
Pattern10: AAC 664 667
Pattern11: AAC 759 762
Pattern12: AAC 807 810
Pattern13: AAC 906 909
Pattern14: AAC 935 938
Pattern15: AAC 1137 1140
Pattern16: AAC 1190 1193
Pattern17: AAC 1241 1244
Pattern18: AAC 1327 1330
Pattern19: AAC 1375 1378
Pattern20: AAC 1474 1477
Pattern21: AAC 1503 1506
ie it should be 1 3
8 10 etc
replaced string is: AAC
Pattern1: AAC 0 3
Pattern2: AAC 7 10
Pattern3: AAC 24 27
Pattern4: AAC 88 91
Pattern5: AAC 161 164
Pattern6: AAC 250 253
Pattern7: AAC 294 297
Pattern8: AAC 403 406
Pattern9: AAC 613 616
Pattern10: AAC 664 667
Pattern11: AAC 759 762
Pattern12: AAC 807 810
Pattern13: AAC 906 909
Pattern14: AAC 935 938
Pattern15: AAC 1137 1140
Pattern16: AAC 1190 1193
Pattern17: AAC 1241 1244
Pattern18: AAC 1327 1330
Pattern19: AAC 1375 1378
Pattern20: AAC 1474 1477
Pattern21: AAC 1503 1506
Change line 30 in my last post to:
$start = @- + 1;
print "Pattern$i: $& $start @+\n";
ASKER
eg output
Pattern1 TTTAGC 10 15
Pattern2 TTTAGC 30 35
Pattern3 TTTAGC 48 53
@kaufmed
The program works great but the values dont seem to match up for eg given the string 'AAC' which is at the position 1-3 the programme gives me these values
replaced string is: AAC
Pattern: AAC 245 248
Pattern: AAC 252 255
Pattern: AAC 269 272
Pattern: AAC 333 336
Pattern: AAC 406 409
Pattern: AAC 495 498
Pattern: AAC 539 542
Pattern: AAC 648 651
Pattern: AAC 858 861
Pattern: AAC 909 912
Pattern: AAC 1004 1007
Pattern: AAC 1052 1055
Pattern: AAC 1151 1154
Pattern: AAC 1180 1183
Pattern: AAC 1382 1385
Pattern: AAC 1435 1438
Pattern: AAC 1486 1489
Pattern: AAC 1572 1575
Pattern: AAC 1620 1623
Pattern: AAC 1719 1722
Pattern: AAC 1748 1751
where as it should be
1 3
8 10
25 27
88 90
160 162
248 250
291 293
399 401
553 555
606 608
656 658
750 752
797 799
895 897
923 925
1122 1124
1175 1177
1225 1227
1310 1312
1357 1359
1455 1457
1483 1485
I would want to use your progr too if it will given the output it will generate to what its should be, can you please modify this for me
Thank you