Link to home
Start Free TrialLog in
Avatar of gelonida
gelonidaFlag for France

asked on

find all lines existing in file1 but not in file2 (for very huge files)

In fact this are really three question:
1.) is grep -f memory friendly? (so could I avoid writing any python code)
2.) Does somebody have already a working solution for below problem?
3.) Could you make suggestions or review my code as soon as it starts working.


I'm trying to find a solution, that works for really huge files. (will not  use more RAM if files have more lines)
The problem: identify all lines, that exist in the first file, but don't exist
in the second one. In none of the files the same line can occur twice

Here a small example:

file1 (f1.txt)
aba
abb
abd
abf
one
two
three
four
five
six
seven

Open in new window


File2 (f2.txt)
abd
abe
eleven
four
eight
one
six

Open in new window


The expected result (order is not relevant)
aba
abb
abf
five
seven
three
two

Open in new window


I have already three approaches and an idea for the fourth one
1.) use grep -v -f f2.txt  f1.txt but I know nothing at all about memory consumption
2.) both files can be read into memory (so not suitable for huge files)
3.) the first file is huge the file fits into memory (so not suitable if the second file is huge)
4.) I'm working on it. In the moment the solution does not work as I didn't finish it. The idea es explained below

I have an idea of how to implement a solution, but wanted to know whether anybody has already written such code or knows of a standard solution.

My idea is to sort both files (the default linux sort is quite smart and ensures, that it works even if the files don't fit into RAM (I think it uses several instances, temporary files and performs a merge sort)

and then use code, that is reading both files chunk wise and.

As soon as I implemented the solution I will post it and ask for any advice of how to optimize further.


Here code to test the three solutions:

import os

def run_tst(func, fname1="f1.txt", fname2="f2.txt"):
    """
        perform test with three different functions and
        show first 20 results
    """

    print("result for %s with %s and %s:" % (func, fname1, fname2))

    with open("notinf2.txt", "w") as fout:
        for flag, line in func(fname1, fname2):
            if not flag:  # is not member of fname2
                fout.write(line)

    # display first few lines
    with open("notinf2.txt") as fin:
        for linecount, line in enumerate(fin):
            print(line.strip())
            if linecount == 20:
                break


import small_data
import only_one_huge_file
import both_files_huge

run_tst(small_data.check_membership_small)
run_tst(only_one_huge_file.check_membership_one_huge_file)
run_tst(both_files_huge.check_membership_two_huge)

Open in new window


Here the code for the 'all in RAM solution': (small_data.py)
def check_membership_small(fname1, fname2):
    with open(fname1) as fin1, open(fname2) as fin2:
        lines1 = set(list(fin1))
        lines2 = set(list(fin2))
    result = lines1 - lines2

    # next line is just there to easier compare results
    result = sorted(result)

    for line in result:
        yield False, line

Open in new window


Here a solution where the second file is read into memory but the second isn't (only_one_huge_file.py)
def check_membership_one_huge_file(fname1, fname2):
    with open(fname1) as fin1, open(fname2) as fin2:
        lines2 = set(list(fin2))
        for line in fin1:
            yield line in lines2, line

Open in new window


Here the begin of my attempts to read only chunks of both files into RAM (as the test data is short the 'chunk size is set to two lines')
(both_files_huge.py)
import os
import subprocess

def check_membership_two_huge(fname1, fname2, chunk_size=2):
    subprocess.call("sort < '%s' > '%s'" % (fname1, fname1 + ".sorted"), shell=True)
    subprocess.call("sort < '%s' > '%s'" % (fname2, fname2 + ".sorted"), shell=True)
    with open(fname1+".sorted") as fin1, open(fname2+".sorted") as fin2:
        chunk1 = []
        chunk2 = []
        all_read1 = False
        all_read2 = False
        while True:
            # read in a 'reasonable' chunk into chunk1
            if not all_read1 and len(chunk1) < chunk_size / 2:
                for i in range(chunk_size):
                    try:
                        chunk1.append(next(fin1))
                    except StopIteration:
                        all_read1 = True

            # read in a 'reasonable' chunk into chunk1
            if not all_read2 and len(chunk2) < chunk_size / 2:
                for i in range(chunk_size):
                    try:
                        chunk2.append(next(fin2))
                    except StopIteration:
                        all_read2 = True

            # do some smart stuff here at the moment the next 4 lines are
            # just a place holder for the real code
            if chunk1:
                yield False, chunk1[0]
                chunk1 = []
                chunk2 = []

            if all_read1 and all_read2:
                break

Open in new window

Avatar of aikimark
aikimark
Flag of United States of America image

I think string hashing be key to the solution to your problem.

1. Read file2, storing the hash of each line in a dictionary.
2. For each line in file1, hash the line and check whether the hash is in your dictionary.  If not, write the line.
Potentially, you can insert each unique line in the two files into two database tables. Then you can do a left-join query and let the database engine do the heavy lifting for the comparison.
You can use a set instead of a dictionary.  It is a bit faster than a dictionary and is actually better suited to this (member-of) determination of this solution algorithm.
One option is:
sort, uniq ,can be used to determine if either file contains duplicates and which they are.
sort -f file1.txt | uniq -c | sort -rn | awk ' ($1>1) {print $2 } '
Repeat for the other file
sort -fu file1.txt >sorted_file1.txt
Repeat for the other file


Comm --3 sorted_file"txt soerted_file2.txt
Will output the list of files unique to file 1 in the first column, see https://linux.die.net/man/1/comm
And unique to file 2 in the second column. Or you can add -2 as an optio. To suppress items unique to file2.
Avatar of gelonida

ASKER

@aikimark:

Let's assume the files are really huge and that even hashing would not fit into RAM.
However with hashes I could probably deal with files that contain in my context at least 5 times more lines in the second file.

@aikimark:

I'm afraid creating the database will be quite slow, but true, this will not overflow RAM and safe me the sorting

@aikimark:
Agreed sets should be faster  (As you see I'm using sets in my all RAM solution)
I think I have now a working solution:
Any suggestions for improvement?

import os
import subprocess

def check_membership_two_huge(fname1, fname2, chunk_size=2):
    subprocess.call("sort < '%s' > '%s'" % (fname1, fname1 + ".sorted"), shell=True)
    subprocess.call("sort < '%s' > '%s'" % (fname2, fname2 + ".sorted"), shell=True)

    def nextline(fin, eof, line):
        if eof:
            return True, line
        try:
            return False, next(fin)
        except StopIteration:
            return True, line

    with open(fname1+".sorted") as fin1, open(fname2+".sorted") as fin2:
        eof1, line1 = nextline(fin1, False, None)
        eof2, line2 = nextline(fin2, False, None)
        while not (eof1 and eof2):
            if eof2:
                # print("%r <> %r NOT %s %s" % (line1, line2, eof1, eof2))
                yield False, line1
                eof1, line1 = nextline(fin1, eof1, line1)
            elif line1 == line2:
                # print("%r <> %r SAME %s %s" % (line1, line2, eof1, eof2))
                yield True, line1
                eof1, line1 = nextline(fin1, eof1, line1)
                eof2, line2 = nextline(fin2, eof2, line2)
            elif line1 > line2:
                # print("%r <> %r TRY %s %s" % (line1, line2, eof1, eof2))
                eof2, line2 = nextline(fin2, eof2, line2)
            else:
                # print("%r <> %r NOT %s %s" % (line1, line2, eof1, eof2))
                yield False, line1
                eof1, line1 = nextline(fin1, eof1, line1)

Open in new window


Finally I did not use chunking but a line by line algorithm.
Perhaps I could accelerate with chunking (reading in a few thousand lines in one go, but The code might be more difficult)


I guess by not using a closure function but by repeating the code I might accelerate a little.
@arnold:

Yes indeed comm -2 -3 f1.txt.sorted f2.txt.sorted will be yielding the correct result.

I did not know about this command.
Probably arnold's suggestion will perform best, though it will not work on windows (and it is not python, but in my context this is not essential)

Already the sorting of huge files will probably not work on windows, except windows has a similiar command line solution  with similiar features than the linux sort.

If you guys don't object I'd like to keep the question open a little longer just to have a little more feedback about the python code.
ASKER CERTIFIED SOLUTION
Avatar of arnold
arnold
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
@arnold:

yes I thought about grep -v -f but never stumbled across comm

 yep I could also check the DBM solution, but I think the solution that I posted is algorithmically close to the one of the comm and should thus be quite efficient.

Though DBM would save me the sorting time. Have to try out
Oh, in the above since it is important only for file1. you would have to use logic within to evaluate which has more lines, and then set a marker to know which to output.

In windows, you can install cygwin.com 'unix virtual"
or
http://unxutils.sourceforge.net/ seemingly reached maturity with no further updates...
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Rather than showing us a "small example", please post a representational example of the lines in these two files.
How many lines are there in file1 and file2?
@aikimark:

contents is work related, (confidential)

typical line length is about 80 - 140 (ASCII) characters (a few utf-8 characters might show up, but didn't so far)
1 file will contain about 20 million lines, the other about 4 million.
about 10% lines in file2 might not be in file1 all others are

The machines on which I have to run the code are not very well equipped and I do need the RAM for other server tasks running on them. I might have about 500M to 1G of RAM available for the task. But I'd like to avoid at all costs that the machine starts swapping due to my script running.

I need to perform this processing perhaps once per weeks in order to identify some server malfunction.
When the issue is fixed, the number of lines might be lower.

In fact in the moment I am using the solution where only file2 is stored as a set in memory, but I wanted (out of curiosity, and prophylactically) implement a less RAM greedy solution
for your situation you only need only one file as a reference
It will be ineffcient to load both files into memory.

When ram is the constraint I would not care how long it takes.
so loading the smaller of the files, and reversing the logic.

i.e. if file 2 is the shorter.
load it memory or DBM.
now going line by line from file1.
if the entry does not exist, output it.
if the entry exits, delete it
if the dbm is empty, output the rest as it is unique to file1.

The logic can be simplified if your initial DB build uses an alphabetical tree.

IF the file1 is still on A's but there are no more content on A, outputing all remaining a;s from the file .......
Well, an MD5 hash is 16 bytes, so a set of file1 hashes has a 64 MB memory footprint and a set of file2 hashes would require a 320 MB memory footprint.  Both of those are well under your 500 MB cap.  Don't worry about sorting and merge logic.
And the memory footprint only exists for the length of time it takes to read, hash, and check (for set inclusion) each line of the other file.  So, even if you had the larger (320 MB) set in memory, it wouldn't be for very long; perhaps a minute or two.
Avatar of phoffric
phoffric

>> could I avoid writing any python code
Yes.
sort f1.txt | uniq > sort_f1.txt
sort f2.txt | uniq > sort_f2.txt
comm -23 sort_f1.txt sort_f2.txt

Open in new window

Output:
aba
abb
abf
five
seven
three
two

Open in new window

Sorry Arnold, I just realized that my solution is very close to yours.
I was thinking about two slightly different solutions.  One uses a set of file2 line hashes.  That is the one I had in mind when I posted earlier comments.

Yesterday, I thought about a dictionary of file1 hashes, where the line number is the value for the (line content) hash key.  This is the minimal memory configuration (~64+ MB).  As you read file2 lines, you look for inclusion in the (file1) dictionary.  When you find a match, you delete that item from the dictionary.  After iterating this process for all lines in file2, the dictionary contains data about unmatched lines.
* convert the dictionary.values() into a set with the set() function for fastest lookup
* remove the dictionary to free up memory
* iterate the file1 lines again, when the line number is in the set, output the actual line text.
Thanks a lot everybody:

All of the answers were very instructive and interesting. FInally I chose 2.) as 'correct' answer is at can be implemented in pure python and will not increase RAM consumption even if the files grow bigger.

However I still didn't have time to check whether the performance / runtime is reasonable (similiar or better to solution 1)

There are at least three correct answers:

1.)   @arnold
sorting both files. with a not in memory sort (e.g. linux's sort)
comm -2 -3 f1.txt.sorted f2.txt.sorted (or my python version of it)

2.) @arnold
a version where the smaller file is stored in a db ( e.g. DBM) or just in a python shelf (which might use DBM)

3.) @aikimark
As Aikimark pointed out for not so extreme cases I might still live with a in memory solution by using a set of hashes.
However I cannot confirm the suggested memory consumption. Python has a huge overhead for many data structures.

I made a small test and seem to see, that an entry of 16 bytes in as set seems to consume about 100 bytes and not 16 as assumed.

#!/usr/bin/env python
import os, random, psutil

def get_rss():
    """ returns RSS of current process """
    return psutil.Process(os.getpid()).memory_info().rss

def random_id(length=16):
    hexdigit = '0123456789abcdef'
    rid = ""
    for i in range(length):
        rid += random.choice(hexdigit)
    return rid

def main():
    bef = get_rss()
    print(bef, "memory used before set creation")
    s = set()
    N = 1000000
    for v in range(N):
        s.add(random_id())
    aft = get_rss()
    print(aft, "memory used after set creation")
    print(aft - bef, "bytes consumed by set")
    print((aft - bef) / N, "bytes per entry")

if __name__ == "__main__":
    main()

Open in new window


On my machine I get
13307904 memory used before set creation
120147968 memory used after set creation
106840064 bytes consumed by set
106.840064 bytes per entry

Open in new window

I expected you to use hashlib.  It didn't make much difference, though.  There is still a significant memory overhead :-(
121487360 memory used before set creation
211959808 memory used after set creation
90472448 bytes consumed by set
90.472448 bytes per entry

import os, random, psutil
import hashlib

def get_rss():
    """ returns RSS of current process """
    return psutil.Process(os.getpid()).memory_info().rss

def random_text(minlength=80, maxlength=140):
    hexdigit = '0123456789abcdef'
    rndtxt = ''
    length = random.randrange(minlength, maxlength)
    for i in range(length):
        rndtxt += random.choice(hexdigit)
    return rndtxt

def main():
    bef = get_rss()
    print(bef, "memory used before set creation")
    s = set()
    N = 1000000
    md5 = hashlib.md5()
    for v in range(N):
        md5.update(random_text().encode('utf-8'))
        s.add(md5.digest())
    aft = get_rss()
    print(aft, "memory used after set creation")
    print(aft - bef, "bytes consumed by set")
    print((aft - bef) / N, "bytes per entry")

if __name__ == "__main__":
    main()

Open in new window

@aikimark.  Yes I know you meant to use hashlib, but I just wanted to simulate a 16 byte hash with a 16 byte string, which is of course wrong as a hash is a byte string and not a string, thus about 16 bytes smaller ( 106  - 16 = 90) The other 74 bytes are probably just the overhead of a set entry or a python object or information required for memory allocation deallocation.

It is possible to use ctypes, structs and arrays to store more effieciently, but that works best for fixed sized arrays and not for arrays of unknown sizes or hashes.
I was going to do some detailed calculations.