merge textfiles in python

Hi,

i want to mix two files into one.

file1.
genename expression
SPC043.c  0.4
SPC45.c   0.12
etc....

file2
genename geneid
SPC043.c  gid45
SPC45.c    gid32
etc...

Create file:
genename geneid expression
SPC043.c  gid45   0.4
SPC45.c    gid32   0.12

what's the fastest and most efficient way to do that in python?


dfernanAsked:
Who is Participating?
 
ghostdog74Connect With a Mentor Commented:
you can use dictionary. NB: dictionaries are not sorted.


h={}
for line in open("file1"):
    line=line.strip().split()
    h[line[0]]=line[1]
for line in open("file2"):
    line=line.strip()
    l=line.strip().split()
    print line,h[l[0]]

Open in new window

0
 
ghostdog74Connect With a Mentor Commented:

h={}
for line in open("file1"):
    line=line.strip().split()
    h[line[0]]=line[1]
for line in open("file2"):
    line=line.strip()
    l=line.split()
    print line,h[l[0]]

Open in new window

0
 
peprConnect With a Mentor Commented:
Basically, the ghostdog74 idea is correct. There are some minor issues like, you should never use small L as an identifier as it can be easily interchanged with 1 (one). The opened file should always be closed explicitly. This way the open() should not appear in the for construct.

It is not clear whether the files have always the same number of the related lines (whether they be generated by the same process). If this is the case, then the merge process could be simpler and faster. Another question is whether the files are small or extremely big. If they are small, then no problem with any kind of solution. Another question is whether the files contain the first descriptive header line or not. And the last question is whether the values are tab-separated (related namely to the output).

See the snippet below for how it could be done if the lines in the files are related.
f1 = open('file2')          # genename and geneide first
f2 = open('file1')          # expression after
fout = open('file3', 'w')
 
# The .readline() returns empty line on EOF. It can 
# even be called repeatedly and it does not raise exception.
# If the line is the existing one it contains at least '\n'.
# The non-empty line always contains the trailing '\n'.
line1 = f1.readline().rstrip()
line2 = f2.readline().rstrip()
 
# Here we get the headings and combine them.
lst1 = line1.split() # the first two elements from file1
lst2 = line2.split()    
lst1.append(lst2[1]) # and the second element from file2
 
# You can also format the output string differently
format = '%-10s %-10s %-10s\n'
fout.write(format % tuple(lst1))
 
# Loop through the other lines that contain the values. Check
# the assumption about the same genename.
while True:
    
    # Split both lines.
    # If both lines are empty, break the loop (i.e. finished).
    line1 = f1.readline()
    line2 = f2.readline()
    
    if line1 == ''  and line2 == '':
        break
        
    # Split the lines.  We assume that the lines are related.
    # There is no need to strip '\n'. The .split() will do it.
    lst1 = line1.split()
    lst2 = line2.split()
    assert lst1[0] == lst2[0] # raise exception if not true
    
    # Append the second element from lst2 to lst1 to get the 
    # list of values for one output line.
    lst1.append(lst2[1])
    
    # Output the formatted values.
    fout.write(format % tuple(lst1))
    
    
# Do not forget to close all files.
fout.close()
f1.close()
f2.close()

Open in new window

0
Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
ghostdog74Connect With a Mentor Commented:
>> The opened file should always be closed explicitly. This way the open() should not appear in the for construct.
using an implicit variable for the file object is perfectly fine. we have already discussed way back.
0
 
peprConnect With a Mentor Commented:
If I ever agreed with that, then sorry. It is not correct. Even the explicit .close() is not correct in all cases. See the doc http://docs.python.org/library/stdtypes.html#file.close
0
 
ghostdog74Connect With a Mentor Commented:
>>  It is not correct
I would still use it this way if all i want is iterate over the files so i say, its correct.
0
 
peprConnect With a Mentor Commented:
ghostdog74: You may be used to do that. Still, it does not mean that it is correct. At least, you should keep doing it secretly, and you should not advice the others to do the same ;)
0
 
Roger BaklundConnect With a Mentor Commented:
A major point in this discussion is that python will close the file automatically when the script exits. I agree it is a good advice to allways close the file explicitly, but nothing will break if you don't for short-running scripts like this. For longer running scripts, for instance daemons/services, it may be a problem if you forget to close the file. Other processes can be denied access to the file, and you can have file handle leakage, if you open multiple files without closing them. Data loss is a possibility in such cases.
0
 
dfernanAuthor Commented:
hi... well i am not gonna take any part in this discussion. Thanks a lot for the help anyways...

I tried

h={}
for line in open("file1"):
    line=line.strip().split()
    h[line[0]]=line[1]

but i have the problem that is only reading the first line of the file and that's all it's storing in h.... why is not reading the whole file?
0
 
dfernanAuthor Commented:
oh by the way the txt is a tab delimited file (that's what i think....)
0
 
dfernanAuthor Commented:
well, it's actually doing the following. It read the first line and then stores the h dictionary and then it reads the whole text and stores the whole text in line....
0
 
dfernanAuthor Commented:
ok. Now i am using pepr solution. i am just checking the reading part and is actually working but it's storing all the file in a variable... if the files are too large isn't that consuming a lot of memory? how can i do to do it without storing the entire file in a list??? is there anyway to do that?
0
 
peprConnect With a Mentor Commented:
The http:#24248282 solution does not read the file into memory. It reads from both files line by line and only the only line from each file (the .readline() reads one line from a text file, the .readlines() returns the list of all lines).

The problem with the solution is that the two files must have the lines ordered the same way. The ghostdog74 solution presents more general solution when the assumption does not hold. The first file is read into memory -- into the dictionary where key is the genename and the value is the expression. The order of lines is lost but the content is captured.

The order of the output lines is prescribed by file2. That one is read line by line and you get genename and the geneid. Using the genename as the key to the above filled dictionary, you get the expression. All the three parts are put together and sent to the output.

For the tab delimiter: it works nicely for the .split(). The output format should be modified (line 18) to

format = '%s\t%s\t%s\n'

Or you can use the cvs module to both read and write the files using the cvs.reader() and cvs.writer(). However, it is probably an overkill for the purpose.

For the memory consumption, just have a look at the file size and think about the memory size of your computer. One gigabyte file is usually considered quite huge. One gigabyte of free memory is not very unusual these days. It depends.
0
 
peprConnect With a Mentor Commented:
Instead of the modified format, you can use str.join() for putting the elements together, like this

    # Output the formatted values.
    fout.write('\t'.join(lst1) + '\n')
0
 
ghostdog74Connect With a Mentor Commented:
>>  it does not mean that it is correct.
well if its not correct, then maybe you should suggest Python authors to modify the interpreter not to accept such syntax? I will still suggest people to use this method sorry to disappoint you :)
0
 
peprConnect With a Mentor Commented:
I am not disappointed. You are right, it is correct syntactically. However, it is not correct from other points of view. Files are not the part of Python. They are related to the operating system. They have to be closed. This is simply the fact that the OS kernel operations sometimes must be used in pairs in some cases (the kind of open/close). When you forget to do the second of the pair, your program may be buggy.

This kind of problem cannot be solved syntactically.

Often, the runtime releases the system sources and do close operations when the process ends. Also the open files are released in that time -- as crx pointed out. This means that it may be fine for a quick hack. But once you get the source and use it in a bigger project, you risk the problems. This way, it is a good idea to get used to do the things more correctly.

If you want, you have the syntactic support for the solution of the problem -- see http://docs.python.org/reference/compound_stmts.html#the-with-statement.
    with open('file1') as f:
         do_the_commands

Open in new window

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.