gelonida
asked on
find all lines existing in file1 but not in file2 (for very huge files)
In fact this are really three question:
1.) is grep -f memory friendly? (so could I avoid writing any python code)
2.) Does somebody have already a working solution for below problem?
3.) Could you make suggestions or review my code as soon as it starts working.
I'm trying to find a solution, that works for really huge files. (will not use more RAM if files have more lines)
The problem: identify all lines, that exist in the first file, but don't exist
in the second one. In none of the files the same line can occur twice
Here a small example:
file1 (f1.txt)
File2 (f2.txt)
The expected result (order is not relevant)
I have already three approaches and an idea for the fourth one
1.) use grep -v -f f2.txt f1.txt but I know nothing at all about memory consumption
2.) both files can be read into memory (so not suitable for huge files)
3.) the first file is huge the file fits into memory (so not suitable if the second file is huge)
4.) I'm working on it. In the moment the solution does not work as I didn't finish it. The idea es explained below
I have an idea of how to implement a solution, but wanted to know whether anybody has already written such code or knows of a standard solution.
My idea is to sort both files (the default linux sort is quite smart and ensures, that it works even if the files don't fit into RAM (I think it uses several instances, temporary files and performs a merge sort)
and then use code, that is reading both files chunk wise and.
As soon as I implemented the solution I will post it and ask for any advice of how to optimize further.
Here code to test the three solutions:
Here the code for the 'all in RAM solution': (small_data.py)
Here a solution where the second file is read into memory but the second isn't (only_one_huge_file.py)
Here the begin of my attempts to read only chunks of both files into RAM (as the test data is short the 'chunk size is set to two lines')
(both_files_huge.py)
1.) is grep -f memory friendly? (so could I avoid writing any python code)
2.) Does somebody have already a working solution for below problem?
3.) Could you make suggestions or review my code as soon as it starts working.
I'm trying to find a solution, that works for really huge files. (will not use more RAM if files have more lines)
The problem: identify all lines, that exist in the first file, but don't exist
in the second one. In none of the files the same line can occur twice
Here a small example:
file1 (f1.txt)
aba
abb
abd
abf
one
two
three
four
five
six
seven
File2 (f2.txt)
abd
abe
eleven
four
eight
one
six
The expected result (order is not relevant)
aba
abb
abf
five
seven
three
two
I have already three approaches and an idea for the fourth one
1.) use grep -v -f f2.txt f1.txt but I know nothing at all about memory consumption
2.) both files can be read into memory (so not suitable for huge files)
3.) the first file is huge the file fits into memory (so not suitable if the second file is huge)
4.) I'm working on it. In the moment the solution does not work as I didn't finish it. The idea es explained below
I have an idea of how to implement a solution, but wanted to know whether anybody has already written such code or knows of a standard solution.
My idea is to sort both files (the default linux sort is quite smart and ensures, that it works even if the files don't fit into RAM (I think it uses several instances, temporary files and performs a merge sort)
and then use code, that is reading both files chunk wise and.
As soon as I implemented the solution I will post it and ask for any advice of how to optimize further.
Here code to test the three solutions:
import os
def run_tst(func, fname1="f1.txt", fname2="f2.txt"):
"""
perform test with three different functions and
show first 20 results
"""
print("result for %s with %s and %s:" % (func, fname1, fname2))
with open("notinf2.txt", "w") as fout:
for flag, line in func(fname1, fname2):
if not flag: # is not member of fname2
fout.write(line)
# display first few lines
with open("notinf2.txt") as fin:
for linecount, line in enumerate(fin):
print(line.strip())
if linecount == 20:
break
import small_data
import only_one_huge_file
import both_files_huge
run_tst(small_data.check_membership_small)
run_tst(only_one_huge_file.check_membership_one_huge_file)
run_tst(both_files_huge.check_membership_two_huge)
Here the code for the 'all in RAM solution': (small_data.py)
def check_membership_small(fname1, fname2):
with open(fname1) as fin1, open(fname2) as fin2:
lines1 = set(list(fin1))
lines2 = set(list(fin2))
result = lines1 - lines2
# next line is just there to easier compare results
result = sorted(result)
for line in result:
yield False, line
Here a solution where the second file is read into memory but the second isn't (only_one_huge_file.py)
def check_membership_one_huge_file(fname1, fname2):
with open(fname1) as fin1, open(fname2) as fin2:
lines2 = set(list(fin2))
for line in fin1:
yield line in lines2, line
Here the begin of my attempts to read only chunks of both files into RAM (as the test data is short the 'chunk size is set to two lines')
(both_files_huge.py)
import os
import subprocess
def check_membership_two_huge(fname1, fname2, chunk_size=2):
subprocess.call("sort < '%s' > '%s'" % (fname1, fname1 + ".sorted"), shell=True)
subprocess.call("sort < '%s' > '%s'" % (fname2, fname2 + ".sorted"), shell=True)
with open(fname1+".sorted") as fin1, open(fname2+".sorted") as fin2:
chunk1 = []
chunk2 = []
all_read1 = False
all_read2 = False
while True:
# read in a 'reasonable' chunk into chunk1
if not all_read1 and len(chunk1) < chunk_size / 2:
for i in range(chunk_size):
try:
chunk1.append(next(fin1))
except StopIteration:
all_read1 = True
# read in a 'reasonable' chunk into chunk1
if not all_read2 and len(chunk2) < chunk_size / 2:
for i in range(chunk_size):
try:
chunk2.append(next(fin2))
except StopIteration:
all_read2 = True
# do some smart stuff here at the moment the next 4 lines are
# just a place holder for the real code
if chunk1:
yield False, chunk1[0]
chunk1 = []
chunk2 = []
if all_read1 and all_read2:
break
Potentially, you can insert each unique line in the two files into two database tables. Then you can do a left-join query and let the database engine do the heavy lifting for the comparison.
You can use a set instead of a dictionary. It is a bit faster than a dictionary and is actually better suited to this (member-of) determination of this solution algorithm.
One option is:
sort, uniq ,can be used to determine if either file contains duplicates and which they are.
sort -f file1.txt | uniq -c | sort -rn | awk ' ($1>1) {print $2 } '
Repeat for the other file
sort -fu file1.txt >sorted_file1.txt
Repeat for the other file
Comm --3 sorted_file"txt soerted_file2.txt
Will output the list of files unique to file 1 in the first column, see https://linux.die.net/man/1/comm
And unique to file 2 in the second column. Or you can add -2 as an optio. To suppress items unique to file2.
sort, uniq ,can be used to determine if either file contains duplicates and which they are.
sort -f file1.txt | uniq -c | sort -rn | awk ' ($1>1) {print $2 } '
Repeat for the other file
sort -fu file1.txt >sorted_file1.txt
Repeat for the other file
Comm --3 sorted_file"txt soerted_file2.txt
Will output the list of files unique to file 1 in the first column, see https://linux.die.net/man/1/comm
And unique to file 2 in the second column. Or you can add -2 as an optio. To suppress items unique to file2.
ASKER
@aikimark:
Let's assume the files are really huge and that even hashing would not fit into RAM.
However with hashes I could probably deal with files that contain in my context at least 5 times more lines in the second file.
@aikimark:
I'm afraid creating the database will be quite slow, but true, this will not overflow RAM and safe me the sorting
@aikimark:
Agreed sets should be faster (As you see I'm using sets in my all RAM solution)
Let's assume the files are really huge and that even hashing would not fit into RAM.
However with hashes I could probably deal with files that contain in my context at least 5 times more lines in the second file.
@aikimark:
I'm afraid creating the database will be quite slow, but true, this will not overflow RAM and safe me the sorting
@aikimark:
Agreed sets should be faster (As you see I'm using sets in my all RAM solution)
ASKER
I think I have now a working solution:
Any suggestions for improvement?
Finally I did not use chunking but a line by line algorithm.
Perhaps I could accelerate with chunking (reading in a few thousand lines in one go, but The code might be more difficult)
I guess by not using a closure function but by repeating the code I might accelerate a little.
Any suggestions for improvement?
import os
import subprocess
def check_membership_two_huge(fname1, fname2, chunk_size=2):
subprocess.call("sort < '%s' > '%s'" % (fname1, fname1 + ".sorted"), shell=True)
subprocess.call("sort < '%s' > '%s'" % (fname2, fname2 + ".sorted"), shell=True)
def nextline(fin, eof, line):
if eof:
return True, line
try:
return False, next(fin)
except StopIteration:
return True, line
with open(fname1+".sorted") as fin1, open(fname2+".sorted") as fin2:
eof1, line1 = nextline(fin1, False, None)
eof2, line2 = nextline(fin2, False, None)
while not (eof1 and eof2):
if eof2:
# print("%r <> %r NOT %s %s" % (line1, line2, eof1, eof2))
yield False, line1
eof1, line1 = nextline(fin1, eof1, line1)
elif line1 == line2:
# print("%r <> %r SAME %s %s" % (line1, line2, eof1, eof2))
yield True, line1
eof1, line1 = nextline(fin1, eof1, line1)
eof2, line2 = nextline(fin2, eof2, line2)
elif line1 > line2:
# print("%r <> %r TRY %s %s" % (line1, line2, eof1, eof2))
eof2, line2 = nextline(fin2, eof2, line2)
else:
# print("%r <> %r NOT %s %s" % (line1, line2, eof1, eof2))
yield False, line1
eof1, line1 = nextline(fin1, eof1, line1)
Finally I did not use chunking but a line by line algorithm.
Perhaps I could accelerate with chunking (reading in a few thousand lines in one go, but The code might be more difficult)
I guess by not using a closure function but by repeating the code I might accelerate a little.
ASKER
@arnold:
Yes indeed comm -2 -3 f1.txt.sorted f2.txt.sorted will be yielding the correct result.
I did not know about this command.
Yes indeed comm -2 -3 f1.txt.sorted f2.txt.sorted will be yielding the correct result.
I did not know about this command.
ASKER
Probably arnold's suggestion will perform best, though it will not work on windows (and it is not python, but in my context this is not essential)
Already the sorting of huge files will probably not work on windows, except windows has a similiar command line solution with similiar features than the linux sort.
If you guys don't object I'd like to keep the question open a little longer just to have a little more feedback about the python code.
Already the sorting of huge files will probably not work on windows, except windows has a similiar command line solution with similiar features than the linux sort.
If you guys don't object I'd like to keep the question open a little longer just to have a little more feedback about the python code.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
@arnold:
yes I thought about grep -v -f but never stumbled across comm
yep I could also check the DBM solution, but I think the solution that I posted is algorithmically close to the one of the comm and should thus be quite efficient.
Though DBM would save me the sorting time. Have to try out
yes I thought about grep -v -f but never stumbled across comm
yep I could also check the DBM solution, but I think the solution that I posted is algorithmically close to the one of the comm and should thus be quite efficient.
Though DBM would save me the sorting time. Have to try out
Oh, in the above since it is important only for file1. you would have to use logic within to evaluate which has more lines, and then set a marker to know which to output.
In windows, you can install cygwin.com 'unix virtual"
or
http://unxutils.sourceforge.net/ seemingly reached maturity with no further updates...
In windows, you can install cygwin.com 'unix virtual"
or
http://unxutils.sourceforge.net/ seemingly reached maturity with no further updates...
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Rather than showing us a "small example", please post a representational example of the lines in these two files.
How many lines are there in file1 and file2?
ASKER
@aikimark:
contents is work related, (confidential)
typical line length is about 80 - 140 (ASCII) characters (a few utf-8 characters might show up, but didn't so far)
1 file will contain about 20 million lines, the other about 4 million.
about 10% lines in file2 might not be in file1 all others are
The machines on which I have to run the code are not very well equipped and I do need the RAM for other server tasks running on them. I might have about 500M to 1G of RAM available for the task. But I'd like to avoid at all costs that the machine starts swapping due to my script running.
I need to perform this processing perhaps once per weeks in order to identify some server malfunction.
When the issue is fixed, the number of lines might be lower.
In fact in the moment I am using the solution where only file2 is stored as a set in memory, but I wanted (out of curiosity, and prophylactically) implement a less RAM greedy solution
contents is work related, (confidential)
typical line length is about 80 - 140 (ASCII) characters (a few utf-8 characters might show up, but didn't so far)
1 file will contain about 20 million lines, the other about 4 million.
about 10% lines in file2 might not be in file1 all others are
The machines on which I have to run the code are not very well equipped and I do need the RAM for other server tasks running on them. I might have about 500M to 1G of RAM available for the task. But I'd like to avoid at all costs that the machine starts swapping due to my script running.
I need to perform this processing perhaps once per weeks in order to identify some server malfunction.
When the issue is fixed, the number of lines might be lower.
In fact in the moment I am using the solution where only file2 is stored as a set in memory, but I wanted (out of curiosity, and prophylactically) implement a less RAM greedy solution
for your situation you only need only one file as a reference
It will be ineffcient to load both files into memory.
When ram is the constraint I would not care how long it takes.
so loading the smaller of the files, and reversing the logic.
i.e. if file 2 is the shorter.
load it memory or DBM.
now going line by line from file1.
if the entry does not exist, output it.
if the entry exits, delete it
if the dbm is empty, output the rest as it is unique to file1.
The logic can be simplified if your initial DB build uses an alphabetical tree.
IF the file1 is still on A's but there are no more content on A, outputing all remaining a;s from the file .......
It will be ineffcient to load both files into memory.
When ram is the constraint I would not care how long it takes.
so loading the smaller of the files, and reversing the logic.
i.e. if file 2 is the shorter.
load it memory or DBM.
now going line by line from file1.
if the entry does not exist, output it.
if the entry exits, delete it
if the dbm is empty, output the rest as it is unique to file1.
The logic can be simplified if your initial DB build uses an alphabetical tree.
IF the file1 is still on A's but there are no more content on A, outputing all remaining a;s from the file .......
Well, an MD5 hash is 16 bytes, so a set of file1 hashes has a 64 MB memory footprint and a set of file2 hashes would require a 320 MB memory footprint. Both of those are well under your 500 MB cap. Don't worry about sorting and merge logic.
And the memory footprint only exists for the length of time it takes to read, hash, and check (for set inclusion) each line of the other file. So, even if you had the larger (320 MB) set in memory, it wouldn't be for very long; perhaps a minute or two.
>> could I avoid writing any python code
Yes.
Yes.
sort f1.txt | uniq > sort_f1.txt
sort f2.txt | uniq > sort_f2.txt
comm -23 sort_f1.txt sort_f2.txt
Output:aba
abb
abf
five
seven
three
two
Sorry Arnold, I just realized that my solution is very close to yours.
I was thinking about two slightly different solutions. One uses a set of file2 line hashes. That is the one I had in mind when I posted earlier comments.
Yesterday, I thought about a dictionary of file1 hashes, where the line number is the value for the (line content) hash key. This is the minimal memory configuration (~64+ MB). As you read file2 lines, you look for inclusion in the (file1) dictionary. When you find a match, you delete that item from the dictionary. After iterating this process for all lines in file2, the dictionary contains data about unmatched lines.
* convert the dictionary.values() into a set with the set() function for fastest lookup
* remove the dictionary to free up memory
* iterate the file1 lines again, when the line number is in the set, output the actual line text.
Yesterday, I thought about a dictionary of file1 hashes, where the line number is the value for the (line content) hash key. This is the minimal memory configuration (~64+ MB). As you read file2 lines, you look for inclusion in the (file1) dictionary. When you find a match, you delete that item from the dictionary. After iterating this process for all lines in file2, the dictionary contains data about unmatched lines.
* convert the dictionary.values() into a set with the set() function for fastest lookup
* remove the dictionary to free up memory
* iterate the file1 lines again, when the line number is in the set, output the actual line text.
ASKER
Thanks a lot everybody:
All of the answers were very instructive and interesting. FInally I chose 2.) as 'correct' answer is at can be implemented in pure python and will not increase RAM consumption even if the files grow bigger.
However I still didn't have time to check whether the performance / runtime is reasonable (similiar or better to solution 1)
There are at least three correct answers:
1.) @arnold
sorting both files. with a not in memory sort (e.g. linux's sort)
comm -2 -3 f1.txt.sorted f2.txt.sorted (or my python version of it)
2.) @arnold
a version where the smaller file is stored in a db ( e.g. DBM) or just in a python shelf (which might use DBM)
3.) @aikimark
As Aikimark pointed out for not so extreme cases I might still live with a in memory solution by using a set of hashes.
However I cannot confirm the suggested memory consumption. Python has a huge overhead for many data structures.
I made a small test and seem to see, that an entry of 16 bytes in as set seems to consume about 100 bytes and not 16 as assumed.
On my machine I get
All of the answers were very instructive and interesting. FInally I chose 2.) as 'correct' answer is at can be implemented in pure python and will not increase RAM consumption even if the files grow bigger.
However I still didn't have time to check whether the performance / runtime is reasonable (similiar or better to solution 1)
There are at least three correct answers:
1.) @arnold
sorting both files. with a not in memory sort (e.g. linux's sort)
comm -2 -3 f1.txt.sorted f2.txt.sorted (or my python version of it)
2.) @arnold
a version where the smaller file is stored in a db ( e.g. DBM) or just in a python shelf (which might use DBM)
3.) @aikimark
As Aikimark pointed out for not so extreme cases I might still live with a in memory solution by using a set of hashes.
However I cannot confirm the suggested memory consumption. Python has a huge overhead for many data structures.
I made a small test and seem to see, that an entry of 16 bytes in as set seems to consume about 100 bytes and not 16 as assumed.
#!/usr/bin/env python
import os, random, psutil
def get_rss():
""" returns RSS of current process """
return psutil.Process(os.getpid()).memory_info().rss
def random_id(length=16):
hexdigit = '0123456789abcdef'
rid = ""
for i in range(length):
rid += random.choice(hexdigit)
return rid
def main():
bef = get_rss()
print(bef, "memory used before set creation")
s = set()
N = 1000000
for v in range(N):
s.add(random_id())
aft = get_rss()
print(aft, "memory used after set creation")
print(aft - bef, "bytes consumed by set")
print((aft - bef) / N, "bytes per entry")
if __name__ == "__main__":
main()
On my machine I get
13307904 memory used before set creation
120147968 memory used after set creation
106840064 bytes consumed by set
106.840064 bytes per entry
I expected you to use hashlib. It didn't make much difference, though. There is still a significant memory overhead :-(
121487360 memory used before set creation
211959808 memory used after set creation
90472448 bytes consumed by set
90.472448 bytes per entry
121487360 memory used before set creation
211959808 memory used after set creation
90472448 bytes consumed by set
90.472448 bytes per entry
import os, random, psutil
import hashlib
def get_rss():
""" returns RSS of current process """
return psutil.Process(os.getpid()).memory_info().rss
def random_text(minlength=80, maxlength=140):
hexdigit = '0123456789abcdef'
rndtxt = ''
length = random.randrange(minlength, maxlength)
for i in range(length):
rndtxt += random.choice(hexdigit)
return rndtxt
def main():
bef = get_rss()
print(bef, "memory used before set creation")
s = set()
N = 1000000
md5 = hashlib.md5()
for v in range(N):
md5.update(random_text().encode('utf-8'))
s.add(md5.digest())
aft = get_rss()
print(aft, "memory used after set creation")
print(aft - bef, "bytes consumed by set")
print((aft - bef) / N, "bytes per entry")
if __name__ == "__main__":
main()
ASKER
@aikimark. Yes I know you meant to use hashlib, but I just wanted to simulate a 16 byte hash with a 16 byte string, which is of course wrong as a hash is a byte string and not a string, thus about 16 bytes smaller ( 106 - 16 = 90) The other 74 bytes are probably just the overhead of a set entry or a python object or information required for memory allocation deallocation.
It is possible to use ctypes, structs and arrays to store more effieciently, but that works best for fixed sized arrays and not for arrays of unknown sizes or hashes.
It is possible to use ctypes, structs and arrays to store more effieciently, but that works best for fixed sized arrays and not for arrays of unknown sizes or hashes.
I was going to do some detailed calculations.
1. Read file2, storing the hash of each line in a dictionary.
2. For each line in file1, hash the line and check whether the hash is in your dictionary. If not, write the line.