Link to home
Start Free TrialLog in
Avatar of Nikedon
NikedonFlag for Switzerland

asked on

Sequence splitter in BioPython

Hello everybody !

I'm a beginner in BioPython and I must write my first program for my Master in Biology.

The plot is to parse a big FASTA file (containing more than 10'000 SeqRecords) and slice each sequence in bits of 200 base pairs, first bit from 0 to 200, then 50 to 250, and so on until the end of the sequence.

Here is my current template :

It reads the sequences from one file and copies them to another (devoir_out). The problems :
- The first sequence is missing
- I now want to replace the mother sequence by a list of sub sequences split like I said above (using, I guess the built-in function slice)

From what I learn in the tutorials, the object SeqRecord in BioPython has three elements : the sequence (seq), the id and the description.

Thanks in advance !

# -*- coding: utf-8 -*-

from Bio.SeqRecord import SeqRecord 
from Bio import SeqIO
    #on importe ce qu'il nous faut

def seq_splitter(iterator, size):
    entry= True
    while entry:
        batch = []
        entry=iterator.next()
        batch.append(entry)
        if batch : #si batch est pas vide, renvoie batch et se met en attente
            yield batch

        #prend seq suivante, l'ajoute à batch et retourne batch

handle = open("/Users/nikedon/Documents/python/CDS_Danio.txt") 

records = SeqIO.parse(handle, "fasta") 


out_handle = open("devoir_out.faa", "w") 

for i, item in enumerate(seq_splitter(records, 123)):
    #print "Found %i" %(i)
    #print "There is %i characters in the sequence" %(len(item[0].seq))
    SeqIO.write(records, out_handle, "fasta") 


out_handle.close() 

handle.close()

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Nikedon
Nikedon
Flag of Switzerland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Well done! -- (^v°)