haravallabhan
asked on
Parsing and extracting from text file (FASTA format)
Hi,
This question is from a specialised field but guess could be solved by any expert in Perl programing.
I have an genome file in a FASTA format (text) and a file with some coordinates (i.e positions giving the start and end position) in an excel file. Given the coordinates I would like to extract the sequence information from the FASTA file (text file) and then additionally also calculate the number of A's, G's, T's and C's in the sequence extracted in a seperate file (csv file)
For example the FASTA file looks like this
>gi|12222253|gb|AL438840.1 |AL438840 AL438840 XBC0AA Debaryomyces hansenii var. hansenii genomic clone XBC0AA002G05 T7 similar to Saccharomyces cerevisiae ORF YAL043c [ PTA1 ; pre-tRNA processing protein / PF I subunit ], genomic survey sequence
CGGTTTTAACAGTTAACAGGCAAGTT AATACAACCA CATCAGCAGT TAATTGATAG ATTAATATCA AGTA
GACAATCTCTGAGTTACGTTCTATAA CATTTTTTCT TTTTTAGACG ATTTTACCGA AATTGCAGGC AATA
AATTTTCTTTTTCACGCTTAGCACAG AACAGTAGCT GACGAGGCAA TTGTTGATTT AGGGAAGAAA TACG
AAGATAAAAGAAGATGACAAGCACAC CTAGTAATGA GGTGATTGAA CAATTGAATC AGGCCCGTAA TTTA
GCGTTTTCGAGTAAAGAAACATTTCC ACAGGTATTA AGACAAATCT TGCAATTTGC AAGCAATCCA GATA
TCCAGATCCAAAGATGGTGTTCTAAA TTCTTTAAGG AATCGTTTTT GGCTGACGAA ACAGTGTTAA GCAG
AGCCGATAAGGTTGACTTGGCGATAG ACTCGATCGA CAGTTTGATA ATCTTGTTAG AAATTCGTGA TGCG
GAAATATTTAAAGATTGTATTGATAC AGCGATAGTA GTATTTAGAC TAGTATTTCG CTACGTTGCT GAAA
ACGATGGATGTGGTGATGTATGGCAG AAATTGAATG AGTTAAAGAA TACGTTAACT AATAAGTTTC AAAG
CACATTTCCTCTAGCACCATCTGACG ATGAAGAACA TGATATGGTA CGCAGCATAG ATTCTAAGTT GGAA
ATCTTGAAATTTGTGATACTAGTAAT TGACTATCAG TCTAAATCCC CCTCCAATAT AACCAGCTTT TCTT
TGTCACAAGTCCCACCAAATCATTCA CTCATCAAAC AGTCAATAGA GGCTGAAGCA TACGGCCTAG TGGA
CGTATGTGTGAAAGTTATTACCAATG ATATACTCAT ACCGCCATTG GTCACTGCCG TATTTAACCA TTTT
TCAGTTCTAGCAAGAAGAAAACCCCA ATTCGTTTCA AAAATGTTAA ATGTGATAGA GAATTTTTGA CACC
AATACAAAATTACAGTCAAATTATCA GACGATCGAT GAATATAAGC TATCTAAAAA ATATGTTGAT AGAG
TCTTGAGARTTTCTATTTAAAGATTG TATTGATACA GCGATAGTAG TATTTAGACT AGTATTTCGC TACG
TTGCTGAAAACGATGGATGTGGTGAT GTATGGCAGA AATTGAATGA GTTAAAGAAT ACGTTAACTA ATAA
GTTTCAAAGCACATTTCCTCTAGCAC CATCTGACGA TGAAGAACAT GATATGGTAC GCAGCATAGA TTCT
ATCTTGAAATTTGTGATACTAGTAAT TGACTATCAG TCTAAATCCC CCTCCAATAT AACCAGCTTT TCTT
TGTCACAAGTCCCACCAAATCATTCA CTCATCAAAC AGTCAATAGA GGCTGAAGCA TACGGCCTAG TGGA
CGTATGTGTGAAAGTTATTACCAATG ATATACTCAT ACCGCCATTG GTCACTGCCG TATTTAACCA TTTT
TCAGTTCTAGCAAGAAGAAAACCCCA ATTCGTTTCA AAAATGTTAA ATGTGATAGA GAATTTTTGA CACC
AATACAAAATTACAGTCAAATTATCA GACGATCGAT GAATATAAGC TATCTAAAAA ATATGTTGAT AGAG
TCTTGAGARTTTC
In the fasta file the attributes following > doesnt matter and the position starts from the
CGGTTTTA
where C is position 1, G is position 2, G is position 3, T is position 4, T is position 5 and so on..
The output of the file should be like this (output1)
>MID1, Chr1, +strand, position 10-20, length 10
CAGTTAACAG
>MID2 chr2, +strand, position 30-35 length 5
CAACC
Output2 -No of AGCT
A G C T
MID1 4 2 2 2
MID2 2 0 3 0
Can someone point me to a resource where I can do this automatically, like if there is any software or program already available or could some expert help solve this in perl.
Thank you
markposition.xls
This question is from a specialised field but guess could be solved by any expert in Perl programing.
I have an genome file in a FASTA format (text) and a file with some coordinates (i.e positions giving the start and end position) in an excel file. Given the coordinates I would like to extract the sequence information from the FASTA file (text file) and then additionally also calculate the number of A's, G's, T's and C's in the sequence extracted in a seperate file (csv file)
For example the FASTA file looks like this
>gi|12222253|gb|AL438840.1
CGGTTTTAACAGTTAACAGGCAAGTT
GACAATCTCTGAGTTACGTTCTATAA
AATTTTCTTTTTCACGCTTAGCACAG
AAGATAAAAGAAGATGACAAGCACAC
GCGTTTTCGAGTAAAGAAACATTTCC
TCCAGATCCAAAGATGGTGTTCTAAA
AGCCGATAAGGTTGACTTGGCGATAG
GAAATATTTAAAGATTGTATTGATAC
ACGATGGATGTGGTGATGTATGGCAG
CACATTTCCTCTAGCACCATCTGACG
ATCTTGAAATTTGTGATACTAGTAAT
TGTCACAAGTCCCACCAAATCATTCA
CGTATGTGTGAAAGTTATTACCAATG
TCAGTTCTAGCAAGAAGAAAACCCCA
AATACAAAATTACAGTCAAATTATCA
TCTTGAGARTTTCTATTTAAAGATTG
TTGCTGAAAACGATGGATGTGGTGAT
GTTTCAAAGCACATTTCCTCTAGCAC
ATCTTGAAATTTGTGATACTAGTAAT
TGTCACAAGTCCCACCAAATCATTCA
CGTATGTGTGAAAGTTATTACCAATG
TCAGTTCTAGCAAGAAGAAAACCCCA
AATACAAAATTACAGTCAAATTATCA
TCTTGAGARTTTC
In the fasta file the attributes following > doesnt matter and the position starts from the
CGGTTTTA
where C is position 1, G is position 2, G is position 3, T is position 4, T is position 5 and so on..
The output of the file should be like this (output1)
>MID1, Chr1, +strand, position 10-20, length 10
CAGTTAACAG
>MID2 chr2, +strand, position 30-35 length 5
CAACC
Output2 -No of AGCT
A G C T
MID1 4 2 2 2
MID2 2 0 3 0
Can someone point me to a resource where I can do this automatically, like if there is any software or program already available or could some expert help solve this in perl.
Thank you
markposition.xls
You will need Perl module Spreadsheet::ParseExcel (if on unix) or Win32::OLE (if on windows) to parse Excel spreadsheet using Perl. The code in Perl would be simple. Let me know which OS platform you are running it on.
ASKER
Hi, I have these perl modules, but if the excel sheet is converted to CSV format too I am okay with it. I am having Windows Vista.
Thanks for looking at it.
Thanks for looking at it.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.