asked on

Parsing and extracting from text file (FASTA format)

Hi,

This question is from a specialised field but guess could be solved by any expert in Perl programing.
I have an genome file in a FASTA format (text) and a file with some coordinates (i.e positions giving the start and end position) in an excel file. Given the coordinates I would like to extract the sequence information from the FASTA file (text file) and then additionally also calculate the number of A's, G's, T's and C's in the sequence extracted in a seperate file (csv file)

For example the FASTA file looks like this

>gi|12222253|gb|AL438840.1|AL438840 AL438840 XBC0AA Debaryomyces hansenii var. hansenii genomic clone XBC0AA002G05 T7 similar to Saccharomyces cerevisiae ORF YAL043c [ PTA1 ; pre-tRNA processing protein / PF I subunit ], genomic survey sequence
CGGTTTTAACAGTTAACAGGCAAGTTAATACAACCACATCAGCAGTTAATTGATAGATTAATATCAAGTA
GACAATCTCTGAGTTACGTTCTATAACATTTTTTCTTTTTTAGACGATTTTACCGAAATTGCAGGCAATA
AATTTTCTTTTTCACGCTTAGCACAGAACAGTAGCTGACGAGGCAATTGTTGATTTAGGGAAGAAATACG
AAGATAAAAGAAGATGACAAGCACACCTAGTAATGAGGTGATTGAACAATTGAATCAGGCCCGTAATTTA
GCGTTTTCGAGTAAAGAAACATTTCCACAGGTATTAAGACAAATCTTGCAATTTGCAAGCAATCCAGATA
TCCAGATCCAAAGATGGTGTTCTAAATTCTTTAAGGAATCGTTTTTGGCTGACGAAACAGTGTTAAGCAG
AGCCGATAAGGTTGACTTGGCGATAGACTCGATCGACAGTTTGATAATCTTGTTAGAAATTCGTGATGCG
GAAATATTTAAAGATTGTATTGATACAGCGATAGTAGTATTTAGACTAGTATTTCGCTACGTTGCTGAAA
ACGATGGATGTGGTGATGTATGGCAGAAATTGAATGAGTTAAAGAATACGTTAACTAATAAGTTTCAAAG
CACATTTCCTCTAGCACCATCTGACGATGAAGAACATGATATGGTACGCAGCATAGATTCTAAGTTGGAA
ATCTTGAAATTTGTGATACTAGTAATTGACTATCAGTCTAAATCCCCCTCCAATATAACCAGCTTTTCTT
TGTCACAAGTCCCACCAAATCATTCACTCATCAAACAGTCAATAGAGGCTGAAGCATACGGCCTAGTGGA
CGTATGTGTGAAAGTTATTACCAATGATATACTCATACCGCCATTGGTCACTGCCGTATTTAACCATTTT
TCAGTTCTAGCAAGAAGAAAACCCCAATTCGTTTCAAAAATGTTAAATGTGATAGAGAATTTTTGACACC
AATACAAAATTACAGTCAAATTATCAGACGATCGATGAATATAAGCTATCTAAAAAATATGTTGATAGAG
TCTTGAGARTTTCTATTTAAAGATTGTATTGATACAGCGATAGTAGTATTTAGACTAGTATTTCGCTACG
TTGCTGAAAACGATGGATGTGGTGATGTATGGCAGAAATTGAATGAGTTAAAGAATACGTTAACTAATAA
GTTTCAAAGCACATTTCCTCTAGCACCATCTGACGATGAAGAACATGATATGGTACGCAGCATAGATTCT
ATCTTGAAATTTGTGATACTAGTAATTGACTATCAGTCTAAATCCCCCTCCAATATAACCAGCTTTTCTT
TGTCACAAGTCCCACCAAATCATTCACTCATCAAACAGTCAATAGAGGCTGAAGCATACGGCCTAGTGGA
CGTATGTGTGAAAGTTATTACCAATGATATACTCATACCGCCATTGGTCACTGCCGTATTTAACCATTTT
TCAGTTCTAGCAAGAAGAAAACCCCAATTCGTTTCAAAAATGTTAAATGTGATAGAGAATTTTTGACACC
AATACAAAATTACAGTCAAATTATCAGACGATCGATGAATATAAGCTATCTAAAAAATATGTTGATAGAG
TCTTGAGARTTTC

In the fasta file the attributes following > doesnt matter and the position starts from the
CGGTTTTA
where C is position 1, G is position 2, G is position 3, T is position 4, T is position 5 and so on..

The output of the file should be like this (output1)

>MID1, Chr1, +strand, position 10-20, length 10
CAGTTAACAG
>MID2 chr2, +strand, position 30-35 length 5
CAACC

Output2 -No of AGCT
A G C T
MID1 4 2 2 2
MID2 2 0 3 0

Can someone point me to a resource where I can do this automatically, like if there is any software or program already available or could some expert help solve this in perl.

Thank you
markposition.xls

Justin Mathews

You will need Perl module Spreadsheet::ParseExcel (if on unix) or Win32::OLE (if on windows) to parse Excel spreadsheet using Perl. The code in Perl would be simple. Let me know which OS platform you are running it on.

haravallabhan

ASKER

Hi, I have these perl modules, but if the excel sheet is converted to CSV format too I am okay with it. I am having Windows Vista.
Thanks for looking at it.

ASKER CERTIFIED SOLUTION

Justin Mathews

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

nervokid

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial