asked on

Pattern search in a text file, Perl script

Hi,
I want to do a pattern searching on a FASTA format file (text), for eg given a FASTA format file, I would like to search for a specific pattern of sequences in the file, the output needs to give me the position to which the pattern is found

For eg in the sequence here (pls note the line following > should be ignored and the position 1 starts from where the sequence charaters starts in this case C (but not limited to)

>gi|12222253|gb|AL438840.1|AL438840 AL438840 XBC0AA Debaryomyces hansenii var.AL438840 AL438840 XBC0AA Debaryomyces hansenii var.
CGGTTTTAATTTAGCAGTTAACAGGCAAGTTTAGCTTAATACAACCATTTAGCCATCAGCAGTTAATTGATAGATTAATATC
GACAATCTCTGAGTTACGTTCTATAACATTTTTTCTTTTTTAGACGATTTTACCGAAATTGCAGGCAATA
AATTTTCTTTTTCACGCTTAGCACAGAACAGTAGCTGACGAGGCAATTGTTGATTTAGGGAAGAAATACG
AAGATAAAAGAAGATGACAAGCACACCTAGTAATGAGGTGATTGAACAATTGAATCAGGCCCGTAATTTA
GCGTTTTCGAGTAAAGAAACATTTCCACAGGTATTAAGACAAATCTTGCAATTTGCAAGCAATCCAGATA
TCCAGATCCAAAGATGGTGTTCTAAATTCTTTAAGGAATCGTTTTTGGCTGACGAAACAGTGTTAAGCAG
AGCCGATAAGGTTGACTTGGCGATAGACTCGATCGACAGTTTGATAATCTTGTTAGAAATTCGTGATGCG
GAAATATTTAAAGATTGTATTGATACAGCGATAGTAGTATTTAGACTAGTATTTCGCTACGTTGCTGAAA
ACGATGGATGTGGTGATGTATGGCAGAAATTGAATGAGTTAAAGAATACGTTAACTAATAAGTTTCAAAG
CACATTTCCTCTAGCACCATCTGACGATGAAGAACATGATATGGTACGCAGCATAGATTCTAAGTTGGAA
ATCTTGAAATTTGTGATACTAGTAATTGACTATCAGTCTAAATCCCCCTCCAATATAACCAGCTTTTCTT
TGTCACAAGTCCCACCAAATCATTCACTCATCAAACAGTCAATAGAGGCTGAAGCATACGGCCTAGTGGA
CGTATGTGTGAAAGTTATTACCAATGATATACTCATACCGCCATTGGTCACTGCCGTATTTAACCATTTT
TCAGTTCTAGCAAGAAGAAAACCCCAATTCGTTTCAAAAATGTTAAATGTGATAGAGAATTTTTGACACC
AATACAAAATTACAGTCAAATTATCAGACGATCGATGAATATAAGCTATCTAAAAAATATGTTGATAGAG
TCTTGAGARTTTCTATTTAAAGATTGTATTGATACAGCGATAGTAGTATTTAGACTAGTATTTCGCTACG
TTGCTGAAAACGATGGATGTGGTGATGTATGGCAGAAATTGAATGAGTTAAAGAATACGTTAACTAATAA
GTTTCAAAGCACATTTCCTCTAGCACCATCTGACGATGAAGAACATGATATGGTACGCAGCATAGATTCT
ATCTTGAAATTTGTGATACTAGTAATTGACTATCAGTCTAAATCCCCCTCCAATATAACCAGCTTTTCTT
TGTCACAAGTCCCACCAAATCATTCACTCATCAAACAGTCAATAGAGGCTGAAGCATACGGCCTAGTGGA
CGTATGTGTGAAAGTTATTACCAATGATATACTCATACCGCCATTGGTCACTGCCGTATTTAACCATTTT
TCAGTTCTAGCAAGAAGAAAACCCCAATTCGTTTCAAAAATGTTAAATGTGATAGAGAATTTTTGACACC
AATACAAAATTACAGTCAAATTATCAGACGATCGATGAATATAAGCTATCTAAAAAATATGTTGATAGAG
TCTTGAGARTTTC

Output

Pattern1 TTTAGC 10 15
Pattern2 TTTAGC 30 35
Pattern3 TTTAGC 48 53

However the trick is that I need to use wildcharacters in the pattern as Input to search for the sequences
the wild characters being

A for A
C for C
G for G
T for T
U for U
R for either A or G
Y for C or T
M for A or C
K for G or T or U
S for C or G
W for A or T or U
B for C or G or T or U
D for A or G or T or U
H for A or C or T or U
V for A or C or G
N for any nucleotide (ie A/C/G/T/U)
* for any character

for instance I should be able to search for all sequences with the following patterns , TTTAGC, TTNABV,TTDRY
If there are any softwares to do this or any perl code to do this it will be quite useful.
Thank you

ASKER CERTIFIED SOLUTION

Justin Mathews

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

kaufmed

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

haravallabhan

ASKER

@jmatix - Thanks the programme works fine, is there a way I can get the output to look like this with the string aswell given
eg output
Pattern1 TTTAGC 10 15
Pattern2 TTTAGC 30 35
Pattern3 TTTAGC 48 53

@kaufmed
The program works great but the values dont seem to match up for eg given the string 'AAC' which is at the position 1-3 the programme gives me these values

replaced string is: AAC
Pattern: AAC 245 248
Pattern: AAC 252 255
Pattern: AAC 269 272
Pattern: AAC 333 336
Pattern: AAC 406 409
Pattern: AAC 495 498
Pattern: AAC 539 542
Pattern: AAC 648 651
Pattern: AAC 858 861
Pattern: AAC 909 912
Pattern: AAC 1004 1007
Pattern: AAC 1052 1055
Pattern: AAC 1151 1154
Pattern: AAC 1180 1183
Pattern: AAC 1382 1385
Pattern: AAC 1435 1438
Pattern: AAC 1486 1489
Pattern: AAC 1572 1575
Pattern: AAC 1620 1623
Pattern: AAC 1719 1722
Pattern: AAC 1748 1751

where as it should be
1 3
8 10
25 27
88 90
160 162
248 250
291 293
399 401
553 555
606 608
656 658
750 752
797 799
895 897
923 925
1122 1124
1175 1177
1225 1227
1310 1312
1357 1359
1455 1457
1483 1485

I would want to use your progr too if it will given the output it will generate to what its should be, can you please modify this for me
Thank you

Justin Mathews

Sure.

#!/usr/local/bin/perl
use warnings;

$patrn = shift;
#$patrn = "TTTAR*";
$patrn =~ s/R/(?:A|G)/g;
$patrn =~ s/Y/(?:C|T)/g;
$patrn =~ s/M/(?:A|C)/g;
$patrn =~ s/K/(?:G|T|U)/g;
$patrn =~ s/S/(?:C|G)/g;
$patrn =~ s/W/(?:A|T|U)/g;
$patrn =~ s/B/(?:C|G|T|U)/g;
$patrn =~ s/D/(?:A|G|T|U)/g;
$patrn =~ s/H/(?:A|C|T|U)/g;
$patrn =~ s/V/(?:A|C|G)/g;
$patrn =~ s/N/(?:A|C|G|T|U)/g;
$patrn =~ s/\*/./g;

open IN, "V:\\Doc\\KB\\Tech\\ExpExch\\perl\\gnome2.txt" or die "Can't open gnome2.txt: $!";
<IN>; #skip first line
$/ = undef;
$lines = <IN>;
$lines =~ s/\s//g;

while($lines =~ s/($patrn)/'_'x(length($1))/e)
{
print "Pattern: $& ", $-[$_]+1," ", $+[$_], "\n" foreach (1..$#+);
}

kaufmed

For your "AAC" test, did you change the input? I used the sample input from your OP and it gives me different results than what you posted in your last comment. I did correct the missing "the position 1 starts from where the sequence charaters starts" from my last post in the code below.

#!C:\Perl\bin\perl.exe

$string = '';

$argc = @ARGV;

if ($argc != 2) { die "Usage:\n\t$0 code_sequence source_file\n" };

{
  local $/=undef;
  open FILE, $ARGV[1] or die "Couldn't open file: $ARGV[1]";
  $string = <FILE>;
  close FILE;
}

%map = ( 'A', 'A', 'C', 'C', 'G', 'G', 'T', 'T', 'U', 'U', 'R', '[AG]', 'Y', '[CT]', 'M', '[AC]', 'K', '[GTU]', 'S', '[CG]', 'W', '[ATU]', 'B', '[CGTU]', 'D', '[AGTU]', 'H', '[ACTU]', 'V', '[ACG]', 'N', '[ACGTU]', '\*', '.' );

while (($key, $value) = each(%map))
{
    $ARGV[0] =~ s/$key/$value/g;
}

print "replaced string is: $ARGV[0]\n";
$i = 1;

$string = substr($string, index($string, "\n") + 1);

while ($string =~ /$ARGV[0]/g)
{
    print "Pattern$i: $& @- @+\n";
    $i++;
}

Open in new window

haravallabhan

ASKER

@kaufmed Thanks, this is good just one more thing I guess the string starts with 0 here and not 1, how do I change this ?
ie it should be 1 3
8 10 etc

replaced string is: AAC
Pattern1: AAC 0 3
Pattern2: AAC 7 10
Pattern3: AAC 24 27
Pattern4: AAC 88 91
Pattern5: AAC 161 164
Pattern6: AAC 250 253
Pattern7: AAC 294 297
Pattern8: AAC 403 406
Pattern9: AAC 613 616
Pattern10: AAC 664 667
Pattern11: AAC 759 762
Pattern12: AAC 807 810
Pattern13: AAC 906 909
Pattern14: AAC 935 938
Pattern15: AAC 1137 1140
Pattern16: AAC 1190 1193
Pattern17: AAC 1241 1244
Pattern18: AAC 1327 1330
Pattern19: AAC 1375 1378
Pattern20: AAC 1474 1477
Pattern21: AAC 1503 1506

kaufmed

Change line 30 in my last post to:

$start = @- + 1;
print "Pattern$i: $& $start @+\n";

Open in new window