Solved

# Find Repeated Values in Python

Posted on 2011-10-19
285 Views
please check the code below and read the comments ., How can we avoid writing a file
if there is repeat of ID and Chain value in a respective file object file1 and file2 ..

if chn == chain1:
chA.append(chn);
idA.append(id);
#print (chA)
print (chA + "  " + idA)
# this   is the place
#if value in chA and value of ID are   repeating
# ignore
#else  Write

#print (line1)
x = line1
file1.write(x)
if chn == chain2:
# this   is the place
#if value in chB and value of ID are   repeating
# ignore
#else  Write

chB.append(chn);
idB.append(id);
print (idB)
#print (chB)
#print (line1)
file2.write(line1)

counterA=counterA+1;
counterB=counterB+1;

``````import urllib
import os
import math
import sys

sample = open("D:/dataset/transient.txt", 'r')
chA =[];
chB =[];
idA=[];
idB=[];
print (os.getcwd())
aa = a.split()
id = aa[0]
chain = aa[1].split(':')
chain1 = chain[0]
chain2 = chain[1]

#print ("ID: " + id + " chain 1: "+ chain1 + " chain 2: " + chain2)

try:
page=urllib.urlopen(lnk)
fname1=id+'.pdb'
with open (id+".pdb", "w") as output_file:
page.close()
except:
print (id + " reading Error ")

filepdb = open(id+".pdb", 'r')

counterB =0;
counterA =0;

if line1.startswith('ATOM') or line1.startswith('HETATM'):

#lig =chain[0:1].strip()
#rec=chain[2:3].strip()
chn = line1[21:22].strip()

#print (chn)
file1 = open(id+ "_"+ chain1 + ".pdb",'a+')
file2 = open(id+ "_"+ chain2 + ".pdb",'a+')

if chn == chain1:
chA.append(chn);
idA.append(id);
#print (chA)
print (chA + "  " + idA)
# ignore for first tym,
# if id=
# this   is the place
#if value in chA and value of ID are   repeating
# ignore
#else  Write

#print (line1)
x = line1
file1.write(x)
if chn == chain2:
chB.append(chn);
idB.append(id);
print (idB)
#print (chB)
#print (line1)
file2.write(line1)

counterA=counterA+1;
counterB=counterB+1;

file11 = open(id+ "_"+ chain1 + ".pdb",'r')
file22 = open(id+ "_"+ chain2 + ".pdb",'r')

file_com = open(id+"_"+chain1+"_"+chain2+".pdb", 'w+')

for abc in file11:
file_com.write(abc)
#print(abc)

for xyz in file22:
file_com.write(xyz)
#print(xyz)

print('program end')
file_com.close()
file11.close()
file22.close()
filepdb.close()
sample.close()
``````
0
Question by:Puneet Arora

LVL 28

Expert Comment

Can you write something more about the chA and idA lists?  What is the purpose of each of them?  If you want to check duplicities, you may want to build a set of id's and check the id against the set -- see  http://docs.python.org/library/stdtypes.html#set-types-set-frozenset.  The less efficient but basically the same way is to test the id against the list (like "id in chA").  The only difference is the time complexity of the operation (O(n) for lists vs. O(log n) for sets).

A note for reading the lines from the file: remove the .readlines().  Use just:

``````    for line1 in filepdb:
if line1.startswith('ATOM') or line1.startswith('HETATM'):
...
``````

The file object are ready for the iteration.  The readlines() reads all the file into memory (into the list structure) and then you iterate through the list.  You usually do not want to do that.

It would also be good to show a big picture of the problem.  There can be a different solution to your problem.
0

Author Comment

The format of D:/dataset/transient.txt  is this ...{ID,ChainA,ChainB)
1bj1 H:V
1ib1 B:E
1bj1 L:W
1efx C:D
1qfu A:L
1qfw A:M
1qkz A:L
1i4d B:D
1is8 C:M
1is8 B:L
1is8 E:O
1is8 D:N
1is8 A:K
1rlb A:E

Now , If {ID,ChainA,ChainB) is repeated , e.g

1rlb A:E
1rlb A:C

In this ID is same and ChainA is again repetitive   .

Therefor , it should be avoided to be written in file1, and file2  for respective cases
x = line1
file1.write(x)

0

LVL 28

Expert Comment

Is the order of items in the transient.txt file important?  Is the first non-repetive item better than some of the repeated items?
0

Author Comment

Order of items is important ( we need to maintain)  and and Non repetive items are better
0

LVL 28

Expert Comment

Let me reformulate the problem to see if I understand it well.  You have some presctiption in the transient.txt that contains some lines.  The line contains some "well known" identifier (like 1bj1) which descibes a database file (text form) like:

``````HEADER    COMPLEX (ANTIBODY/ANTIGEN)              30-JUN-98   1BJ1
TITLE     VASCULAR ENDOTHELIAL GROWTH FACTOR IN COMPLEX WITH A
TITLE    2 NEUTRALIZING ANTIBODY
...
``````

You are interested in lines starting with ATOM or HETATM where the character on the position 22 (counted from 1, or 21 counted from 0) contains a letter that is somehow interesting for you -- it determines the chain (I am totally dumb considering these information, so you should check if it makes sense ;)

I can only guess that there is some reason for writing the transient.txt in the form like:

``````...
1rlb A:E
...
1rlb A:C
...
``````

If I understand it well, the above two lines mean that you want to extract the interesting lines (ATOM or HETATM) from the 1rlb database file to separate chain files 1rlb_A.pdb, 1rlb_E.pdb, and 1rlb_C.pdb.  You want to avoid generating the 1rlb_A.pdb as it would produce the same content.

From your code above, I am not sure, whether you want to generate the 1rlb_C.pdb. It seems to me that you are not writing the C-lines anywhere (but I may have overlooked something).

Then you write the chanis back to the file like 1rlb_A_E.pdb (not sure again, whether you did not want to add the C thus getting 1rlb_A_E_C.pdb or possibly 1rlb_A_C_E.pdb.

In the last case, it looks as if you wanted to sort the 1rlb-lines by the column that marks the chain.

Can you comment on that?
0

LVL 28

Expert Comment

I am a bit tired. I guess that you guess what I wanted to write :)
0

Author Comment

Dear Pepr,

Your reformulation the problem and guess is very close and in fact right .

I m simply want to avoid writing content that will be duplicated if   the ID and Chain combination is repeated

I really appreciate your efforts and let find the solution : I think some data structure like dictionary or sets
can used to keep a track of combination ID and Chain , and before writing in respective files it must check from that set/dict if all ready it has written that combination .

0

Author Comment

Dear

My real issue is python also , I m  very new to it and I m find it hard to write the code also
0

LVL 28

Expert Comment

How big the transient.txt usually is?  When it contains the

``````1rlb A:E
...
1rlb A:C
``````

What the result should be?  Would the reversed order in transient.txt make a difference?  If the items in transient.txt must be ordered for the same ID, the items with different ID's probably need not to be ordered.  Then it leads to a dictionary with ID as a key.  Then the value of the dictionary item can be a single A:E or the list [A:E, A:C] (symbolically) or the set of the couples or even ony a set of the {A, E, C}.  The question is whether you want to combine all three chain lines together or not.  Your first approach also changes the order of lines in the reduced .pdb file.  You may possibly want not to change their order and only leave out the lines that you do not want.

If the transient.txt is small enough, then it should be loaded to the dictionary with the information about chains inserted in to a set value.  Then the ID.pdb file should be open and read only once and the output file should be generated on the fly in one pass.

Does it make sense?

0

LVL 28

Accepted Solution

Try the following script for you transient.txt (modify the path to the file).  Does it make sense for you?

a.py
``````# Open the transient.txt file and build the dictionary of sets for filtering the files.
d = {}                            # empty dictionary
f = open('transient.txt')
for line in f:
pdbId, couple = line.split()
chainSet = set(couple.split(':')) # order of A:E is lost

# If the pdbId already exists in the dictionary, you may want to updated its set.
# If it does not exist, you may want to predend that the empty set i there.
setForId = d.setdefault(pdbId, set())
setForId.update(chainSet)
f.close()

# Let's have a look what is inside the dictionary.
for pdbId in d:
print pdbId, d[pdbId]

# How the filtered pdb files can be named...
for pdbId in d:
lst = [pdbId]     # the list with a single string element
lst.extend(sorted(d[pdbId]))  # more elements, chain letters sorted
##print lst
pdbname = '_'.join(lst) + '.pdb'
print pdbname
``````

It prints on my console:

``````c:\tmp\_Python\puneetarora2000\Q_27405950>python a.py
1rlb set(['A', 'E'])
1qfu set(['A', 'L'])
1is8 set(['A', 'C', 'B', 'E', 'D', 'K', 'M', 'L', 'O', 'N'])
1i4d set(['B', 'D'])
1efx set(['C', 'D'])
1ib1 set(['B', 'E'])
1qfw set(['A', 'M'])
1qkz set(['A', 'L'])
1bj1 set(['H', 'L', 'W', 'V'])
1rlb_A_E.pdb
1qfu_A_L.pdb
1is8_A_B_C_D_E_K_L_M_N_O.pdb
1i4d_B_D.pdb
1efx_C_D.pdb
1ib1_B_E.pdb
1qfw_A_M.pdb
1qkz_A_L.pdb
1bj1_H_L_V_W.pdb
``````

The first half is the information collected in the dictionary, the second half shows the constructed filenames for the filtered output.  This way for example 1bj1.pdb input should be filtered into 1bj1_H_L_V_W.pdb containing only the lines for chains for H, L, V, W.  This can be done in one pass.

Is that what you want?
0

Author Closing Comment

Not complete solution , any good work appreciated
0

LVL 28

Expert Comment

0

LVL 28

Expert Comment

Did you solve the problem?  Or do you want to continue?  (I don't mind that the question is closed, you can still continue. ;)
0

## Featured Post

### Suggested Solutions

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…