Link to home
Start Free TrialLog in
Avatar of Puneet Arora
Puneet AroraFlag for India

asked on

Find Repeated Values in Python

please check the code below and read the comments ., How can we avoid writing a file
if there is repeat of ID and Chain value in a respective file object file1 and file2 ..



 if chn == chain1:
                chA.append(chn);
                idA.append(id);
                #print (chA)
                print (chA + "  " + idA)
                # this   is the place
                #if value in chA and value of ID are   repeating
                # ignore
                #else  Write

                #print (line1)
                x = line1
                file1.write(x)
            if chn == chain2:
               # this   is the place
                #if value in chB and value of ID are   repeating
                # ignore
                #else  Write


chB.append(chn);
                idB.append(id);
                print (idB)
                #print (chB)
                #print (line1)
                file2.write(line1)

    counterA=counterA+1;
    counterB=counterB+1;
   
import urllib
import os
import math
import sys

sample = open("D:/dataset/transient.txt", 'r')
chA =[];
chB =[];
idA=[];
idB=[];
print (os.getcwd())
for a in sample.readlines():
    aa = a.split()
    id = aa[0]
    chain = aa[1].split(':')
    chain1 = chain[0]
    chain2 = chain[1]
    
    #print ("ID: " + id + " chain 1: "+ chain1 + " chain 2: " + chain2)
    
    try:
        lnk="http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId="+id
        page=urllib.urlopen(lnk)
        fname1=id+'.pdb'
        with open (id+".pdb", "w") as output_file:
            output_file.write(page.read())
        print (fname1 + " file downloaded")
        page.close()
    except:
        print (id + " reading Error ")
    
    filepdb = open(id+".pdb", 'r')

    counterB =0;
    counterA =0;
    
    for line1 in filepdb.readlines():
        if line1.startswith('ATOM') or line1.startswith('HETATM'):

            #lig =chain[0:1].strip()
            #rec=chain[2:3].strip()
            chn = line1[21:22].strip()
            
            #print (chn)
            file1 = open(id+ "_"+ chain1 + ".pdb",'a+')
            file2 = open(id+ "_"+ chain2 + ".pdb",'a+')
            
            if chn == chain1:
                chA.append(chn);
                idA.append(id);
                #print (chA)
                print (chA + "  " + idA)
                # ignore for first tym,
                # if id= 
                # this   is the place
                #if value in chA and value of ID are   repeating 
                # ignore
                #else  Write 

                #print (line1)
                x = line1
                file1.write(x)
            if chn == chain2:
                chB.append(chn);
                idB.append(id);
                print (idB)
                #print (chB)
                #print (line1)
                file2.write(line1)

    counterA=counterA+1;
    counterB=counterB+1;
    
    file11 = open(id+ "_"+ chain1 + ".pdb",'r')
    file22 = open(id+ "_"+ chain2 + ".pdb",'r')
    
    file_com = open(id+"_"+chain1+"_"+chain2+".pdb", 'w+')
    
    for abc in file11:
        file_com.write(abc)
        #print(abc)
        
    for xyz in file22:
        file_com.write(xyz)
        #print(xyz)
        
print('program end')
file_com.close()
file11.close()
file22.close()
filepdb.close()
sample.close()

Open in new window

Avatar of pepr
pepr

Can you write something more about the chA and idA lists?  What is the purpose of each of them?  If you want to check duplicities, you may want to build a set of id's and check the id against the set -- see  http://docs.python.org/library/stdtypes.html#set-types-set-frozenset.  The less efficient but basically the same way is to test the id against the list (like "id in chA").  The only difference is the time complexity of the operation (O(n) for lists vs. O(log n) for sets).

A note for reading the lines from the file: remove the .readlines().  Use just:

    for line1 in filepdb:
        if line1.startswith('ATOM') or line1.startswith('HETATM'):
            ...

Open in new window


The file object are ready for the iteration.  The readlines() reads all the file into memory (into the list structure) and then you iterate through the list.  You usually do not want to do that.

It would also be good to show a big picture of the problem.  There can be a different solution to your problem.
Avatar of Puneet Arora

ASKER

 

                         The format of D:/dataset/transient.txt  is this ...{ID,ChainA,ChainB)
1bj1 H:V
1ib1 B:E
1bj1 L:W
1efx C:D
1qfu A:L
1qfw A:M
1qkz A:L
1i4d B:D
1is8 C:M
1is8 B:L
1is8 E:O
1is8 D:N
1is8 A:K
1rlb A:E


Now , If {ID,ChainA,ChainB) is repeated , e.g

                      1rlb A:E
                      1rlb A:C

In this ID is same and ChainA is again repetitive   .

Therefor , it should be avoided to be written in file1, and file2  for respective cases
                 x = line1
                file1.write(x)



Is the order of items in the transient.txt file important?  Is the first non-repetive item better than some of the repeated items?
Order of items is important ( we need to maintain)  and and Non repetive items are better
Let me reformulate the problem to see if I understand it well.  You have some presctiption in the transient.txt that contains some lines.  The line contains some "well known" identifier (like 1bj1) which descibes a database file (text form) like:

HEADER    COMPLEX (ANTIBODY/ANTIGEN)              30-JUN-98   1BJ1              
TITLE     VASCULAR ENDOTHELIAL GROWTH FACTOR IN COMPLEX WITH A                  
TITLE    2 NEUTRALIZING ANTIBODY
...

Open in new window


You are interested in lines starting with ATOM or HETATM where the character on the position 22 (counted from 1, or 21 counted from 0) contains a letter that is somehow interesting for you -- it determines the chain (I am totally dumb considering these information, so you should check if it makes sense ;)

I can only guess that there is some reason for writing the transient.txt in the form like:

...
1rlb A:E
...
1rlb A:C
...

Open in new window


If I understand it well, the above two lines mean that you want to extract the interesting lines (ATOM or HETATM) from the 1rlb database file to separate chain files 1rlb_A.pdb, 1rlb_E.pdb, and 1rlb_C.pdb.  You want to avoid generating the 1rlb_A.pdb as it would produce the same content.

From your code above, I am not sure, whether you want to generate the 1rlb_C.pdb. It seems to me that you are not writing the C-lines anywhere (but I may have overlooked something).

Then you write the chanis back to the file like 1rlb_A_E.pdb (not sure again, whether you did not want to add the C thus getting 1rlb_A_E_C.pdb or possibly 1rlb_A_C_E.pdb.

In the last case, it looks as if you wanted to sort the 1rlb-lines by the column that marks the chain.

Can you comment on that?
I am a bit tired. I guess that you guess what I wanted to write :)
Dear Pepr,

Your reformulation the problem and guess is very close and in fact right .

I m simply want to avoid writing content that will be duplicated if   the ID and Chain combination is repeated
 
I really appreciate your efforts and let find the solution : I think some data structure like dictionary or sets
can used to keep a track of combination ID and Chain , and before writing in respective files it must check from that set/dict if all ready it has written that combination .

Dear

My real issue is python also , I m  very new to it and I m find it hard to write the code also
How big the transient.txt usually is?  When it contains the

1rlb A:E
...
1rlb A:C

Open in new window


What the result should be?  Would the reversed order in transient.txt make a difference?  If the items in transient.txt must be ordered for the same ID, the items with different ID's probably need not to be ordered.  Then it leads to a dictionary with ID as a key.  Then the value of the dictionary item can be a single A:E or the list [A:E, A:C] (symbolically) or the set of the couples or even ony a set of the {A, E, C}.  The question is whether you want to combine all three chain lines together or not.  Your first approach also changes the order of lines in the reduced .pdb file.  You may possibly want not to change their order and only leave out the lines that you do not want.

If the transient.txt is small enough, then it should be loaded to the dictionary with the information about chains inserted in to a set value.  Then the ID.pdb file should be open and read only once and the output file should be generated on the fly in one pass.

Does it make sense?




ASKER CERTIFIED SOLUTION
Avatar of pepr
pepr

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Not complete solution , any good work appreciated
Well, I was expecting more information from you to know what is inside your head ;) Anyway, thanks for the points.
Did you solve the problem?  Or do you want to continue?  (I don't mind that the question is closed, you can still continue. ;)