Parsing and extracting from text file (FASTA format)

haravallabhan
haravallabhan used Ask the Experts™
on
Hi,

This question is from a specialised field but guess could be solved by any expert in Perl programing.
I have an genome file in a FASTA format (text) and a file with some coordinates (i.e positions giving the start and end position) in an excel file. Given the coordinates I would like to extract the sequence information from the FASTA file (text file) and then additionally also calculate the number of A's, G's, T's and C's in the sequence extracted in a seperate file (csv file)

For example the FASTA file looks like this

>gi|12222253|gb|AL438840.1|AL438840 AL438840 XBC0AA Debaryomyces hansenii var. hansenii genomic clone XBC0AA002G05 T7 similar to Saccharomyces cerevisiae ORF YAL043c [ PTA1 ; pre-tRNA processing protein / PF I subunit ], genomic survey sequence
CGGTTTTAACAGTTAACAGGCAAGTTAATACAACCACATCAGCAGTTAATTGATAGATTAATATCAAGTA
GACAATCTCTGAGTTACGTTCTATAACATTTTTTCTTTTTTAGACGATTTTACCGAAATTGCAGGCAATA
AATTTTCTTTTTCACGCTTAGCACAGAACAGTAGCTGACGAGGCAATTGTTGATTTAGGGAAGAAATACG
AAGATAAAAGAAGATGACAAGCACACCTAGTAATGAGGTGATTGAACAATTGAATCAGGCCCGTAATTTA
GCGTTTTCGAGTAAAGAAACATTTCCACAGGTATTAAGACAAATCTTGCAATTTGCAAGCAATCCAGATA
TCCAGATCCAAAGATGGTGTTCTAAATTCTTTAAGGAATCGTTTTTGGCTGACGAAACAGTGTTAAGCAG
AGCCGATAAGGTTGACTTGGCGATAGACTCGATCGACAGTTTGATAATCTTGTTAGAAATTCGTGATGCG
GAAATATTTAAAGATTGTATTGATACAGCGATAGTAGTATTTAGACTAGTATTTCGCTACGTTGCTGAAA
ACGATGGATGTGGTGATGTATGGCAGAAATTGAATGAGTTAAAGAATACGTTAACTAATAAGTTTCAAAG
CACATTTCCTCTAGCACCATCTGACGATGAAGAACATGATATGGTACGCAGCATAGATTCTAAGTTGGAA
ATCTTGAAATTTGTGATACTAGTAATTGACTATCAGTCTAAATCCCCCTCCAATATAACCAGCTTTTCTT
TGTCACAAGTCCCACCAAATCATTCACTCATCAAACAGTCAATAGAGGCTGAAGCATACGGCCTAGTGGA
CGTATGTGTGAAAGTTATTACCAATGATATACTCATACCGCCATTGGTCACTGCCGTATTTAACCATTTT
TCAGTTCTAGCAAGAAGAAAACCCCAATTCGTTTCAAAAATGTTAAATGTGATAGAGAATTTTTGACACC
AATACAAAATTACAGTCAAATTATCAGACGATCGATGAATATAAGCTATCTAAAAAATATGTTGATAGAG
TCTTGAGARTTTCTATTTAAAGATTGTATTGATACAGCGATAGTAGTATTTAGACTAGTATTTCGCTACG
TTGCTGAAAACGATGGATGTGGTGATGTATGGCAGAAATTGAATGAGTTAAAGAATACGTTAACTAATAA
GTTTCAAAGCACATTTCCTCTAGCACCATCTGACGATGAAGAACATGATATGGTACGCAGCATAGATTCT
ATCTTGAAATTTGTGATACTAGTAATTGACTATCAGTCTAAATCCCCCTCCAATATAACCAGCTTTTCTT
TGTCACAAGTCCCACCAAATCATTCACTCATCAAACAGTCAATAGAGGCTGAAGCATACGGCCTAGTGGA
CGTATGTGTGAAAGTTATTACCAATGATATACTCATACCGCCATTGGTCACTGCCGTATTTAACCATTTT
TCAGTTCTAGCAAGAAGAAAACCCCAATTCGTTTCAAAAATGTTAAATGTGATAGAGAATTTTTGACACC
AATACAAAATTACAGTCAAATTATCAGACGATCGATGAATATAAGCTATCTAAAAAATATGTTGATAGAG
TCTTGAGARTTTC


In the fasta file the attributes following > doesnt matter and the position starts from the
CGGTTTTA
where C is position 1, G is position 2, G is position 3, T is position 4, T is position 5 and so on..

The output of the file should be like this (output1)

>MID1, Chr1, +strand, position 10-20, length 10
CAGTTAACAG
>MID2 chr2, +strand, position 30-35 length 5
CAACC


Output2 -No of AGCT
            A G C T
MID1    4 2 2 2
MID2    2 0 3 0


Can someone point me to a resource where I can do this automatically, like if there is any software or program already available or could some expert help solve this in perl.

Thank you
markposition.xls
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
You will need Perl module Spreadsheet::ParseExcel (if on unix) or Win32::OLE (if on windows) to parse Excel spreadsheet using Perl. The code in Perl would be simple. Let me know which OS platform you are running it on.

Author

Commented:
Hi, I have these perl modules, but if the excel sheet is converted to CSV format too I am okay with it. I am having Windows Vista.
Thanks for looking at it.
I am using Spreadsheet::ParseExcel. Assuming your FASTA file is gnome.txt and the spreadsheet in markposition.xls following code should work provided I understood your requirements correctly. Outputs will be in output.txt and AGCTcounts.csv. Replace file paths appropriately:

#!/usr/local/bin/perl
use warnings;
#use strict;
use Spreadsheet::ParseExcel;

my $Map = Spreadsheet::ParseExcel::Workbook->Parse('V:\\Doc\\KB\\Tech\\ExpExch\\xl\\markposition.xls');
$oWkS = $Map->Worksheet("Sheet1");

open IN, "V:\\Doc\\KB\\Tech\\ExpExch\\perl\\gnome.txt" or die "Can't open gnome.txt: $!";
<IN>; #skip first line
$/ = undef;
$lines = <IN>;
$lines =~ s/\s//g;

open OUT1, ">V:\\Doc\\KB\\Tech\\ExpExch\\perl\\output.txt" or die "Can't open output.txt: $!";
open OUT2, ">V:\\Doc\\KB\\Tech\\ExpExch\\perl\\AGCTcounts.csv" or die "Can't open AGCTcounts.csv: $!";
print OUT2 ", A, G, C, T\n";
$c = 0;
for(my $iR = $oWkS->{MinRow}+1 ;
        defined $oWkS->{MaxRow} && $iR <= $oWkS->{MaxRow} ; $iR++)
{
    $c++;
      $mid = $oWkS->{Cells}[$iR][0]->Value;
    $dir = $oWkS->{Cells}[$iR][2]->Value;
    $strt = $oWkS->{Cells}[$iR][3]->Value;
    $end = $oWkS->{Cells}[$iR][4]->Value;
      $str = substr($lines, $strt-1, $end-$strt);
      print OUT1 ">$mid, Chr$c, ${dir}strand, position $strt-$end\n$str\n";
      $Acount = ($str=~s/A/A/g)||0;
      $Gcount = ($str=~s/G/G/g)||0;
      $Ccount = ($str=~s/C/C/g)||0;
      $Tcount = ($str=~s/T/T/g)||0;
      print OUT2 "$mid, $Acount, $Gcount, $Ccount, $Tcount\n";
}
Commented:
use Data::Dumper;

my $file = shift;
my $start_coordinate = shift;
my $end_coordinate = shift;

my $first_char_pos = $start_coordinate - 1;
my $offset = $end_coordinate - $start_coordinate;


undef $/;
open FILE,"<$file";
my $file_content=<FILE>;
close FILE;

if($file_content=~m/^>(.*?)\n.{$first_char_pos}(.{$offset})/gis){
  my $metadata = $1;
  my $content = $2;
  my %counts;
  map {$counts{$_}++} split //, $content;
  map {print "$_ -> ",$counts{$_}," ocurrences.\n"} keys %counts;
  print "\n".$2;
}

This should be called with:
perl dna.pl <input file> <start coordinate> <end coordinate>
perl dna.pl dna.txt 2 6

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial