• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 325
  • Last Modified:

Find Repeated Values in Python

please check the code below and read the comments ., How can we avoid writing a file
if there is repeat of ID and Chain value in a respective file object file1 and file2 ..



 if chn == chain1:
                chA.append(chn);
                idA.append(id);
                #print (chA)
                print (chA + "  " + idA)
                # this   is the place
                #if value in chA and value of ID are   repeating
                # ignore
                #else  Write

                #print (line1)
                x = line1
                file1.write(x)
            if chn == chain2:
               # this   is the place
                #if value in chB and value of ID are   repeating
                # ignore
                #else  Write


chB.append(chn);
                idB.append(id);
                print (idB)
                #print (chB)
                #print (line1)
                file2.write(line1)

    counterA=counterA+1;
    counterB=counterB+1;
   
import urllib
import os
import math
import sys

sample = open("D:/dataset/transient.txt", 'r')
chA =[];
chB =[];
idA=[];
idB=[];
print (os.getcwd())
for a in sample.readlines():
    aa = a.split()
    id = aa[0]
    chain = aa[1].split(':')
    chain1 = chain[0]
    chain2 = chain[1]
    
    #print ("ID: " + id + " chain 1: "+ chain1 + " chain 2: " + chain2)
    
    try:
        lnk="http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId="+id
        page=urllib.urlopen(lnk)
        fname1=id+'.pdb'
        with open (id+".pdb", "w") as output_file:
            output_file.write(page.read())
        print (fname1 + " file downloaded")
        page.close()
    except:
        print (id + " reading Error ")
    
    filepdb = open(id+".pdb", 'r')

    counterB =0;
    counterA =0;
    
    for line1 in filepdb.readlines():
        if line1.startswith('ATOM') or line1.startswith('HETATM'):

            #lig =chain[0:1].strip()
            #rec=chain[2:3].strip()
            chn = line1[21:22].strip()
            
            #print (chn)
            file1 = open(id+ "_"+ chain1 + ".pdb",'a+')
            file2 = open(id+ "_"+ chain2 + ".pdb",'a+')
            
            if chn == chain1:
                chA.append(chn);
                idA.append(id);
                #print (chA)
                print (chA + "  " + idA)
                # ignore for first tym,
                # if id= 
                # this   is the place
                #if value in chA and value of ID are   repeating 
                # ignore
                #else  Write 

                #print (line1)
                x = line1
                file1.write(x)
            if chn == chain2:
                chB.append(chn);
                idB.append(id);
                print (idB)
                #print (chB)
                #print (line1)
                file2.write(line1)

    counterA=counterA+1;
    counterB=counterB+1;
    
    file11 = open(id+ "_"+ chain1 + ".pdb",'r')
    file22 = open(id+ "_"+ chain2 + ".pdb",'r')
    
    file_com = open(id+"_"+chain1+"_"+chain2+".pdb", 'w+')
    
    for abc in file11:
        file_com.write(abc)
        #print(abc)
        
    for xyz in file22:
        file_com.write(xyz)
        #print(xyz)
        
print('program end')
file_com.close()
file11.close()
file22.close()
filepdb.close()
sample.close()

Open in new window

0
Puneet Arora
Asked:
Puneet Arora
  • 8
  • 5
1 Solution
 
peprCommented:
Can you write something more about the chA and idA lists?  What is the purpose of each of them?  If you want to check duplicities, you may want to build a set of id's and check the id against the set -- see  http://docs.python.org/library/stdtypes.html#set-types-set-frozenset.  The less efficient but basically the same way is to test the id against the list (like "id in chA").  The only difference is the time complexity of the operation (O(n) for lists vs. O(log n) for sets).

A note for reading the lines from the file: remove the .readlines().  Use just:

    for line1 in filepdb:
        if line1.startswith('ATOM') or line1.startswith('HETATM'):
            ...

Open in new window


The file object are ready for the iteration.  The readlines() reads all the file into memory (into the list structure) and then you iterate through the list.  You usually do not want to do that.

It would also be good to show a big picture of the problem.  There can be a different solution to your problem.
0
 
Puneet AroraAuthor Commented:
 

                         The format of D:/dataset/transient.txt  is this ...{ID,ChainA,ChainB)
1bj1 H:V
1ib1 B:E
1bj1 L:W
1efx C:D
1qfu A:L
1qfw A:M
1qkz A:L
1i4d B:D
1is8 C:M
1is8 B:L
1is8 E:O
1is8 D:N
1is8 A:K
1rlb A:E


Now , If {ID,ChainA,ChainB) is repeated , e.g

                      1rlb A:E
                      1rlb A:C

In this ID is same and ChainA is again repetitive   .

Therefor , it should be avoided to be written in file1, and file2  for respective cases
                 x = line1
                file1.write(x)



0
 
peprCommented:
Is the order of items in the transient.txt file important?  Is the first non-repetive item better than some of the repeated items?
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
Puneet AroraAuthor Commented:
Order of items is important ( we need to maintain)  and and Non repetive items are better
0
 
peprCommented:
Let me reformulate the problem to see if I understand it well.  You have some presctiption in the transient.txt that contains some lines.  The line contains some "well known" identifier (like 1bj1) which descibes a database file (text form) like:

HEADER    COMPLEX (ANTIBODY/ANTIGEN)              30-JUN-98   1BJ1              
TITLE     VASCULAR ENDOTHELIAL GROWTH FACTOR IN COMPLEX WITH A                  
TITLE    2 NEUTRALIZING ANTIBODY
...

Open in new window


You are interested in lines starting with ATOM or HETATM where the character on the position 22 (counted from 1, or 21 counted from 0) contains a letter that is somehow interesting for you -- it determines the chain (I am totally dumb considering these information, so you should check if it makes sense ;)

I can only guess that there is some reason for writing the transient.txt in the form like:

...
1rlb A:E
...
1rlb A:C
...

Open in new window


If I understand it well, the above two lines mean that you want to extract the interesting lines (ATOM or HETATM) from the 1rlb database file to separate chain files 1rlb_A.pdb, 1rlb_E.pdb, and 1rlb_C.pdb.  You want to avoid generating the 1rlb_A.pdb as it would produce the same content.

From your code above, I am not sure, whether you want to generate the 1rlb_C.pdb. It seems to me that you are not writing the C-lines anywhere (but I may have overlooked something).

Then you write the chanis back to the file like 1rlb_A_E.pdb (not sure again, whether you did not want to add the C thus getting 1rlb_A_E_C.pdb or possibly 1rlb_A_C_E.pdb.

In the last case, it looks as if you wanted to sort the 1rlb-lines by the column that marks the chain.

Can you comment on that?
0
 
peprCommented:
I am a bit tired. I guess that you guess what I wanted to write :)
0
 
Puneet AroraAuthor Commented:
Dear Pepr,

Your reformulation the problem and guess is very close and in fact right .

I m simply want to avoid writing content that will be duplicated if   the ID and Chain combination is repeated
 
I really appreciate your efforts and let find the solution : I think some data structure like dictionary or sets
can used to keep a track of combination ID and Chain , and before writing in respective files it must check from that set/dict if all ready it has written that combination .

0
 
Puneet AroraAuthor Commented:
Dear

My real issue is python also , I m  very new to it and I m find it hard to write the code also
0
 
peprCommented:
How big the transient.txt usually is?  When it contains the

1rlb A:E
...
1rlb A:C

Open in new window


What the result should be?  Would the reversed order in transient.txt make a difference?  If the items in transient.txt must be ordered for the same ID, the items with different ID's probably need not to be ordered.  Then it leads to a dictionary with ID as a key.  Then the value of the dictionary item can be a single A:E or the list [A:E, A:C] (symbolically) or the set of the couples or even ony a set of the {A, E, C}.  The question is whether you want to combine all three chain lines together or not.  Your first approach also changes the order of lines in the reduced .pdb file.  You may possibly want not to change their order and only leave out the lines that you do not want.

If the transient.txt is small enough, then it should be loaded to the dictionary with the information about chains inserted in to a set value.  Then the ID.pdb file should be open and read only once and the output file should be generated on the fly in one pass.

Does it make sense?




0
 
peprCommented:
Try the following script for you transient.txt (modify the path to the file).  Does it make sense for you?

a.py
# Open the transient.txt file and build the dictionary of sets for filtering the files.
d = {}                            # empty dictionary
f = open('transient.txt')
for line in f:
    pdbId, couple = line.split()
    chainSet = set(couple.split(':')) # order of A:E is lost
    
    # If the pdbId already exists in the dictionary, you may want to updated its set.
    # If it does not exist, you may want to predend that the empty set i there.
    setForId = d.setdefault(pdbId, set())
    setForId.update(chainSet)
f.close()

# Let's have a look what is inside the dictionary.
for pdbId in d:
    print pdbId, d[pdbId]
    
# How the filtered pdb files can be named...
for pdbId in d:
    lst = [pdbId]     # the list with a single string element
    lst.extend(sorted(d[pdbId]))  # more elements, chain letters sorted
    ##print lst
    pdbname = '_'.join(lst) + '.pdb'
    print pdbname

Open in new window


It prints on my console:

c:\tmp\_Python\puneetarora2000\Q_27405950>python a.py
1rlb set(['A', 'E'])
1qfu set(['A', 'L'])
1is8 set(['A', 'C', 'B', 'E', 'D', 'K', 'M', 'L', 'O', 'N'])
1i4d set(['B', 'D'])
1efx set(['C', 'D'])
1ib1 set(['B', 'E'])
1qfw set(['A', 'M'])
1qkz set(['A', 'L'])
1bj1 set(['H', 'L', 'W', 'V'])
1rlb_A_E.pdb
1qfu_A_L.pdb
1is8_A_B_C_D_E_K_L_M_N_O.pdb
1i4d_B_D.pdb
1efx_C_D.pdb
1ib1_B_E.pdb
1qfw_A_M.pdb
1qkz_A_L.pdb
1bj1_H_L_V_W.pdb

Open in new window


The first half is the information collected in the dictionary, the second half shows the constructed filenames for the filtered output.  This way for example 1bj1.pdb input should be filtered into 1bj1_H_L_V_W.pdb containing only the lines for chains for H, L, V, W.  This can be done in one pass.  

Is that what you want?
0
 
Puneet AroraAuthor Commented:
Not complete solution , any good work appreciated
0
 
peprCommented:
Well, I was expecting more information from you to know what is inside your head ;) Anyway, thanks for the points.
0
 
peprCommented:
Did you solve the problem?  Or do you want to continue?  (I don't mind that the question is closed, you can still continue. ;)
0

Featured Post

Prep for the ITIL® Foundation Certification Exam

December’s Course of the Month is now available! Enroll to learn ITIL® Foundation best practices for delivering IT services effectively and efficiently.

  • 8
  • 5
Tackle projects and never again get stuck behind a technical roadblock.
Join Now