Puneet Arora
asked on
Find Repeated Values in Python
please check the code below and read the comments ., How can we avoid writing a file
if there is repeat of ID and Chain value in a respective file object file1 and file2 ..
if chn == chain1:
chA.append(chn);
idA.append(id);
#print (chA)
print (chA + " " + idA)
# this is the place
#if value in chA and value of ID are repeating
# ignore
#else Write
#print (line1)
x = line1
file1.write(x)
if chn == chain2:
# this is the place
#if value in chB and value of ID are repeating
# ignore
#else Write
chB.append(chn);
idB.append(id);
print (idB)
#print (chB)
#print (line1)
file2.write(line1)
counterA=counterA+1;
counterB=counterB+1;
if there is repeat of ID and Chain value in a respective file object file1 and file2 ..
if chn == chain1:
chA.append(chn);
idA.append(id);
#print (chA)
print (chA + " " + idA)
# this is the place
#if value in chA and value of ID are repeating
# ignore
#else Write
#print (line1)
x = line1
file1.write(x)
if chn == chain2:
# this is the place
#if value in chB and value of ID are repeating
# ignore
#else Write
chB.append(chn);
idB.append(id);
print (idB)
#print (chB)
#print (line1)
file2.write(line1)
counterA=counterA+1;
counterB=counterB+1;
import urllib
import os
import math
import sys
sample = open("D:/dataset/transient.txt", 'r')
chA =[];
chB =[];
idA=[];
idB=[];
print (os.getcwd())
for a in sample.readlines():
aa = a.split()
id = aa[0]
chain = aa[1].split(':')
chain1 = chain[0]
chain2 = chain[1]
#print ("ID: " + id + " chain 1: "+ chain1 + " chain 2: " + chain2)
try:
lnk="http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId="+id
page=urllib.urlopen(lnk)
fname1=id+'.pdb'
with open (id+".pdb", "w") as output_file:
output_file.write(page.read())
print (fname1 + " file downloaded")
page.close()
except:
print (id + " reading Error ")
filepdb = open(id+".pdb", 'r')
counterB =0;
counterA =0;
for line1 in filepdb.readlines():
if line1.startswith('ATOM') or line1.startswith('HETATM'):
#lig =chain[0:1].strip()
#rec=chain[2:3].strip()
chn = line1[21:22].strip()
#print (chn)
file1 = open(id+ "_"+ chain1 + ".pdb",'a+')
file2 = open(id+ "_"+ chain2 + ".pdb",'a+')
if chn == chain1:
chA.append(chn);
idA.append(id);
#print (chA)
print (chA + " " + idA)
# ignore for first tym,
# if id=
# this is the place
#if value in chA and value of ID are repeating
# ignore
#else Write
#print (line1)
x = line1
file1.write(x)
if chn == chain2:
chB.append(chn);
idB.append(id);
print (idB)
#print (chB)
#print (line1)
file2.write(line1)
counterA=counterA+1;
counterB=counterB+1;
file11 = open(id+ "_"+ chain1 + ".pdb",'r')
file22 = open(id+ "_"+ chain2 + ".pdb",'r')
file_com = open(id+"_"+chain1+"_"+chain2+".pdb", 'w+')
for abc in file11:
file_com.write(abc)
#print(abc)
for xyz in file22:
file_com.write(xyz)
#print(xyz)
print('program end')
file_com.close()
file11.close()
file22.close()
filepdb.close()
sample.close()
ASKER
The format of D:/dataset/transient.txt is this ...{ID,ChainA,ChainB)
1bj1 H:V
1ib1 B:E
1bj1 L:W
1efx C:D
1qfu A:L
1qfw A:M
1qkz A:L
1i4d B:D
1is8 C:M
1is8 B:L
1is8 E:O
1is8 D:N
1is8 A:K
1rlb A:E
Now , If {ID,ChainA,ChainB) is repeated , e.g
1rlb A:E
1rlb A:C
In this ID is same and ChainA is again repetitive .
Therefor , it should be avoided to be written in file1, and file2 for respective cases
x = line1
file1.write(x)
Is the order of items in the transient.txt file important? Is the first non-repetive item better than some of the repeated items?
ASKER
Order of items is important ( we need to maintain) and and Non repetive items are better
Let me reformulate the problem to see if I understand it well. You have some presctiption in the transient.txt that contains some lines. The line contains some "well known" identifier (like 1bj1) which descibes a database file (text form) like:
You are interested in lines starting with ATOM or HETATM where the character on the position 22 (counted from 1, or 21 counted from 0) contains a letter that is somehow interesting for you -- it determines the chain (I am totally dumb considering these information, so you should check if it makes sense ;)
I can only guess that there is some reason for writing the transient.txt in the form like:
If I understand it well, the above two lines mean that you want to extract the interesting lines (ATOM or HETATM) from the 1rlb database file to separate chain files 1rlb_A.pdb, 1rlb_E.pdb, and 1rlb_C.pdb. You want to avoid generating the 1rlb_A.pdb as it would produce the same content.
From your code above, I am not sure, whether you want to generate the 1rlb_C.pdb. It seems to me that you are not writing the C-lines anywhere (but I may have overlooked something).
Then you write the chanis back to the file like 1rlb_A_E.pdb (not sure again, whether you did not want to add the C thus getting 1rlb_A_E_C.pdb or possibly 1rlb_A_C_E.pdb.
In the last case, it looks as if you wanted to sort the 1rlb-lines by the column that marks the chain.
Can you comment on that?
HEADER COMPLEX (ANTIBODY/ANTIGEN) 30-JUN-98 1BJ1
TITLE VASCULAR ENDOTHELIAL GROWTH FACTOR IN COMPLEX WITH A
TITLE 2 NEUTRALIZING ANTIBODY
...
You are interested in lines starting with ATOM or HETATM where the character on the position 22 (counted from 1, or 21 counted from 0) contains a letter that is somehow interesting for you -- it determines the chain (I am totally dumb considering these information, so you should check if it makes sense ;)
I can only guess that there is some reason for writing the transient.txt in the form like:
...
1rlb A:E
...
1rlb A:C
...
If I understand it well, the above two lines mean that you want to extract the interesting lines (ATOM or HETATM) from the 1rlb database file to separate chain files 1rlb_A.pdb, 1rlb_E.pdb, and 1rlb_C.pdb. You want to avoid generating the 1rlb_A.pdb as it would produce the same content.
From your code above, I am not sure, whether you want to generate the 1rlb_C.pdb. It seems to me that you are not writing the C-lines anywhere (but I may have overlooked something).
Then you write the chanis back to the file like 1rlb_A_E.pdb (not sure again, whether you did not want to add the C thus getting 1rlb_A_E_C.pdb or possibly 1rlb_A_C_E.pdb.
In the last case, it looks as if you wanted to sort the 1rlb-lines by the column that marks the chain.
Can you comment on that?
I am a bit tired. I guess that you guess what I wanted to write :)
ASKER
Dear Pepr,
Your reformulation the problem and guess is very close and in fact right .
I m simply want to avoid writing content that will be duplicated if the ID and Chain combination is repeated
I really appreciate your efforts and let find the solution : I think some data structure like dictionary or sets
can used to keep a track of combination ID and Chain , and before writing in respective files it must check from that set/dict if all ready it has written that combination .
Your reformulation the problem and guess is very close and in fact right .
I m simply want to avoid writing content that will be duplicated if the ID and Chain combination is repeated
I really appreciate your efforts and let find the solution : I think some data structure like dictionary or sets
can used to keep a track of combination ID and Chain , and before writing in respective files it must check from that set/dict if all ready it has written that combination .
ASKER
Dear
My real issue is python also , I m very new to it and I m find it hard to write the code also
My real issue is python also , I m very new to it and I m find it hard to write the code also
How big the transient.txt usually is? When it contains the
What the result should be? Would the reversed order in transient.txt make a difference? If the items in transient.txt must be ordered for the same ID, the items with different ID's probably need not to be ordered. Then it leads to a dictionary with ID as a key. Then the value of the dictionary item can be a single A:E or the list [A:E, A:C] (symbolically) or the set of the couples or even ony a set of the {A, E, C}. The question is whether you want to combine all three chain lines together or not. Your first approach also changes the order of lines in the reduced .pdb file. You may possibly want not to change their order and only leave out the lines that you do not want.
If the transient.txt is small enough, then it should be loaded to the dictionary with the information about chains inserted in to a set value. Then the ID.pdb file should be open and read only once and the output file should be generated on the fly in one pass.
Does it make sense?
1rlb A:E
...
1rlb A:C
What the result should be? Would the reversed order in transient.txt make a difference? If the items in transient.txt must be ordered for the same ID, the items with different ID's probably need not to be ordered. Then it leads to a dictionary with ID as a key. Then the value of the dictionary item can be a single A:E or the list [A:E, A:C] (symbolically) or the set of the couples or even ony a set of the {A, E, C}. The question is whether you want to combine all three chain lines together or not. Your first approach also changes the order of lines in the reduced .pdb file. You may possibly want not to change their order and only leave out the lines that you do not want.
If the transient.txt is small enough, then it should be loaded to the dictionary with the information about chains inserted in to a set value. Then the ID.pdb file should be open and read only once and the output file should be generated on the fly in one pass.
Does it make sense?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Not complete solution , any good work appreciated
Well, I was expecting more information from you to know what is inside your head ;) Anyway, thanks for the points.
Did you solve the problem? Or do you want to continue? (I don't mind that the question is closed, you can still continue. ;)
A note for reading the lines from the file: remove the .readlines(). Use just:
Open in new window
The file object are ready for the iteration. The readlines() reads all the file into memory (into the list structure) and then you iterate through the list. You usually do not want to do that.
It would also be good to show a big picture of the problem. There can be a different solution to your problem.