asked on

Sequence splitter in BioPython

Hello everybody !

I'm a beginner in BioPython and I must write my first program for my Master in Biology.

The plot is to parse a big FASTA file (containing more than 10'000 SeqRecords) and slice each sequence in bits of 200 base pairs, first bit from 0 to 200, then 50 to 250, and so on until the end of the sequence.

Here is my current template :

It reads the sequences from one file and copies them to another (devoir_out). The problems :
- The first sequence is missing
- I now want to replace the mother sequence by a list of sub sequences split like I said above (using, I guess the built-in function slice)

From what I learn in the tutorials, the object SeqRecord in BioPython has three elements : the sequence (seq), the id and the description.

Thanks in advance !

# -*- coding: utf-8 -*-

from Bio.SeqRecord import SeqRecord 
from Bio import SeqIO
    #on importe ce qu'il nous faut

def seq_splitter(iterator, size):
    entry= True
    while entry:
        batch = []
        entry=iterator.next()
        batch.append(entry)
        if batch : #si batch est pas vide, renvoie batch et se met en attente
            yield batch

        #prend seq suivante, l'ajoute à batch et retourne batch

handle = open("/Users/nikedon/Documents/python/CDS_Danio.txt") 

records = SeqIO.parse(handle, "fasta") 


out_handle = open("devoir_out.faa", "w") 

for i, item in enumerate(seq_splitter(records, 123)):
    #print "Found %i" %(i)
    #print "There is %i characters in the sequence" %(len(item[0].seq))
    SeqIO.write(records, out_handle, "fasta") 


out_handle.close() 

handle.close()

Open in new window

ASKER CERTIFIED SOLUTION

Nikedon

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Markus Fischer

Well done! -- (^v°)