?
Solved

merge textfiles in python

Posted on 2009-04-27
16
Medium Priority
?
691 Views
Last Modified: 2012-06-21
Hi,

i want to mix two files into one.

file1.
genename expression
SPC043.c  0.4
SPC45.c   0.12
etc....

file2
genename geneid
SPC043.c  gid45
SPC45.c    gid32
etc...

Create file:
genename geneid expression
SPC043.c  gid45   0.4
SPC45.c    gid32   0.12

what's the fastest and most efficient way to do that in python?


0
Comment
Question by:dfernan
  • 6
  • 5
  • 4
  • +1
16 Comments
 
LVL 9

Accepted Solution

by:
ghostdog74 earned 640 total points
ID: 24247920
you can use dictionary. NB: dictionaries are not sorted.


h={}
for line in open("file1"):
    line=line.strip().split()
    h[line[0]]=line[1]
for line in open("file2"):
    line=line.strip()
    l=line.strip().split()
    print line,h[l[0]]

Open in new window

0
 
LVL 9

Assisted Solution

by:ghostdog74
ghostdog74 earned 640 total points
ID: 24247924

h={}
for line in open("file1"):
    line=line.strip().split()
    h[line[0]]=line[1]
for line in open("file2"):
    line=line.strip()
    l=line.split()
    print line,h[l[0]]

Open in new window

0
 
LVL 29

Assisted Solution

by:pepr
pepr earned 480 total points
ID: 24248282
Basically, the ghostdog74 idea is correct. There are some minor issues like, you should never use small L as an identifier as it can be easily interchanged with 1 (one). The opened file should always be closed explicitly. This way the open() should not appear in the for construct.

It is not clear whether the files have always the same number of the related lines (whether they be generated by the same process). If this is the case, then the merge process could be simpler and faster. Another question is whether the files are small or extremely big. If they are small, then no problem with any kind of solution. Another question is whether the files contain the first descriptive header line or not. And the last question is whether the values are tab-separated (related namely to the output).

See the snippet below for how it could be done if the lines in the files are related.
f1 = open('file2')          # genename and geneide first
f2 = open('file1')          # expression after
fout = open('file3', 'w')
 
# The .readline() returns empty line on EOF. It can 
# even be called repeatedly and it does not raise exception.
# If the line is the existing one it contains at least '\n'.
# The non-empty line always contains the trailing '\n'.
line1 = f1.readline().rstrip()
line2 = f2.readline().rstrip()
 
# Here we get the headings and combine them.
lst1 = line1.split() # the first two elements from file1
lst2 = line2.split()    
lst1.append(lst2[1]) # and the second element from file2
 
# You can also format the output string differently
format = '%-10s %-10s %-10s\n'
fout.write(format % tuple(lst1))
 
# Loop through the other lines that contain the values. Check
# the assumption about the same genename.
while True:
    
    # Split both lines.
    # If both lines are empty, break the loop (i.e. finished).
    line1 = f1.readline()
    line2 = f2.readline()
    
    if line1 == ''  and line2 == '':
        break
        
    # Split the lines.  We assume that the lines are related.
    # There is no need to strip '\n'. The .split() will do it.
    lst1 = line1.split()
    lst2 = line2.split()
    assert lst1[0] == lst2[0] # raise exception if not true
    
    # Append the second element from lst2 to lst1 to get the 
    # list of values for one output line.
    lst1.append(lst2[1])
    
    # Output the formatted values.
    fout.write(format % tuple(lst1))
    
    
# Do not forget to close all files.
fout.close()
f1.close()
f2.close()

Open in new window

0
Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LVL 9

Assisted Solution

by:ghostdog74
ghostdog74 earned 640 total points
ID: 24249740
>> The opened file should always be closed explicitly. This way the open() should not appear in the for construct.
using an implicit variable for the file object is perfectly fine. we have already discussed way back.
0
 
LVL 29

Assisted Solution

by:pepr
pepr earned 480 total points
ID: 24249866
If I ever agreed with that, then sorry. It is not correct. Even the explicit .close() is not correct in all cases. See the doc http://docs.python.org/library/stdtypes.html#file.close
0
 
LVL 9

Assisted Solution

by:ghostdog74
ghostdog74 earned 640 total points
ID: 24251275
>>  It is not correct
I would still use it this way if all i want is iterate over the files so i say, its correct.
0
 
LVL 29

Assisted Solution

by:pepr
pepr earned 480 total points
ID: 24251991
ghostdog74: You may be used to do that. Still, it does not mean that it is correct. At least, you should keep doing it secretly, and you should not advice the others to do the same ;)
0
 
LVL 39

Assisted Solution

by:Roger Baklund
Roger Baklund earned 80 total points
ID: 24252297
A major point in this discussion is that python will close the file automatically when the script exits. I agree it is a good advice to allways close the file explicitly, but nothing will break if you don't for short-running scripts like this. For longer running scripts, for instance daemons/services, it may be a problem if you forget to close the file. Other processes can be denied access to the file, and you can have file handle leakage, if you open multiple files without closing them. Data loss is a possibility in such cases.
0
 

Author Comment

by:dfernan
ID: 24253295
hi... well i am not gonna take any part in this discussion. Thanks a lot for the help anyways...

I tried

h={}
for line in open("file1"):
    line=line.strip().split()
    h[line[0]]=line[1]

but i have the problem that is only reading the first line of the file and that's all it's storing in h.... why is not reading the whole file?
0
 

Author Comment

by:dfernan
ID: 24253334
oh by the way the txt is a tab delimited file (that's what i think....)
0
 

Author Comment

by:dfernan
ID: 24253440
well, it's actually doing the following. It read the first line and then stores the h dictionary and then it reads the whole text and stores the whole text in line....
0
 

Author Comment

by:dfernan
ID: 24253602
ok. Now i am using pepr solution. i am just checking the reading part and is actually working but it's storing all the file in a variable... if the files are too large isn't that consuming a lot of memory? how can i do to do it without storing the entire file in a list??? is there anyway to do that?
0
 
LVL 29

Assisted Solution

by:pepr
pepr earned 480 total points
ID: 24255608
The http:#24248282 solution does not read the file into memory. It reads from both files line by line and only the only line from each file (the .readline() reads one line from a text file, the .readlines() returns the list of all lines).

The problem with the solution is that the two files must have the lines ordered the same way. The ghostdog74 solution presents more general solution when the assumption does not hold. The first file is read into memory -- into the dictionary where key is the genename and the value is the expression. The order of lines is lost but the content is captured.

The order of the output lines is prescribed by file2. That one is read line by line and you get genename and the geneid. Using the genename as the key to the above filled dictionary, you get the expression. All the three parts are put together and sent to the output.

For the tab delimiter: it works nicely for the .split(). The output format should be modified (line 18) to

format = '%s\t%s\t%s\n'

Or you can use the cvs module to both read and write the files using the cvs.reader() and cvs.writer(). However, it is probably an overkill for the purpose.

For the memory consumption, just have a look at the file size and think about the memory size of your computer. One gigabyte file is usually considered quite huge. One gigabyte of free memory is not very unusual these days. It depends.
0
 
LVL 29

Assisted Solution

by:pepr
pepr earned 480 total points
ID: 24255628
Instead of the modified format, you can use str.join() for putting the elements together, like this

    # Output the formatted values.
    fout.write('\t'.join(lst1) + '\n')
0
 
LVL 9

Assisted Solution

by:ghostdog74
ghostdog74 earned 640 total points
ID: 24256313
>>  it does not mean that it is correct.
well if its not correct, then maybe you should suggest Python authors to modify the interpreter not to accept such syntax? I will still suggest people to use this method sorry to disappoint you :)
0
 
LVL 29

Assisted Solution

by:pepr
pepr earned 480 total points
ID: 24257853
I am not disappointed. You are right, it is correct syntactically. However, it is not correct from other points of view. Files are not the part of Python. They are related to the operating system. They have to be closed. This is simply the fact that the OS kernel operations sometimes must be used in pairs in some cases (the kind of open/close). When you forget to do the second of the pair, your program may be buggy.

This kind of problem cannot be solved syntactically.

Often, the runtime releases the system sources and do close operations when the process ends. Also the open files are released in that time -- as crx pointed out. This means that it may be fine for a quick hack. But once you get the source and use it in a bigger project, you risk the problems. This way, it is a good idea to get used to do the things more correctly.

If you want, you have the syntactic support for the solution of the problem -- see http://docs.python.org/reference/compound_stmts.html#the-with-statement.
    with open('file1') as f:
         do_the_commands

Open in new window

0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Plenty of writing has gone on the web trying to compare Python with other competitive programming languages and vice versa. However, not much has been put into a wholistic perspective. This article should help you decide whether to adopt Python as a…
Less strange, but still introduction This introduction was added (1st August, 2011) to reflect some reactions.  Firstly, the term basics in the title of the article...  As any other word, it is a symbol with meaning attached to the word by some a…
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
Suggested Courses
Course of the Month15 days, 2 hours left to enroll

840 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question