asked on

Compare two files using python script

Hi All,
I want to compare two files which will have following Ids(fileds).

----------file01--------
6100100013
6110010003
6120010001
6120010002
-------------------------

----------file02---------
6120120001
6130040001
6130070001
6130070005
-------------------------

Two outputs requires
01.) file01 Ids, which are not in file02.
01.) file02 Ids, which are not in file01.

BR Dushan

SOLUTION

Haris V

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

woolmilkporc

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Dushan Silva

ASKER

Thanks ! But it gives following error, this files contains more than 150,000 records..
----------------------------------------------------------
Traceback (most recent call last):
File "compare.py", line 10, in <module>
if i != fileTwo[x]:
IndexError: list index out of range
----------------------------------------------------------

Dushan Silva

ASKER

Thanks woolmilkporc ! But I specifically need a solution from python script.

SOLUTION

ghostdog74

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Dushan Silva

ASKER

Thanks ghostdog74! But it's something different which I want and it will also diffidently says "index out of range", because I have more than 150,000 records in my files.

BR Dushan

HonorGod

Is the data ordered, or are we talking about 150,000 random records?

Dushan Silva

ASKER

Its ordered :)

Kamaraj Subramanian

This script will compare two files and write the difference in the third file.

f1 = open("file1.txt", "r")
f2 = open("file2.txt", "r")
 
fileOne = f1.readlines()
fileTwo = f2.readlines()
f1.close()
f2.close()
outFile = open("results.txt", "w")
x = 0
for i in fileOne:
   if i != fileTwo[x]:
      outFile.write(i+" <> "+fileTwo[x])
   x <strong class="highlight">+</strong>= 1
 
outFile.close()

Open in new window

Kamaraj Subramanian

you can check this too

http://docs.python.org/library/difflib.html#differ-examples

Dushan Silva

ASKER

Thanks itkamaraj!
But it gives following error.
--------------------------------------------------------------------------------------------------------------
File "compare.py", line 32
x <strong class="highlight">+</strong>= 1
^
SyntaxError: invalid syntax
--------------------------------------------------------------------------------------------------------------

BR Dushan

SOLUTION

Kamaraj Subramanian

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Dushan Silva

ASKER

Thanks itkamaraj! It's working .
But it shows results as
-----------------
630260002
<> 630260001
630260004
<> 630260002
630260005
<> 630260004
630260006
<> 630260005
630260007
-------------------

example : 630260002 is available on both files. Both files are sorted. But because of missing values on some file.... values are not exactly in the same line number..

Please provide just to filter values which are not in file01 but available on file02
and
values which are not in file02 but available on file01

BR Dushan

BR Dushan

Roger Baklund

How does this work:

# filecompare.py
in1 = file('file01.txt')
in2 = file('file02.txt')
out1 = file('in01notin02.txt','w')
out2 = file('in02notin01.txt','w')
f1_line = in1.readline().strip()
f2_line = in2.readline().strip()
while f1_line or f2_line:
  if f1_line==f2_line: 
    f1_line = in1.readline().strip()
    f2_line = in2.readline().strip()
  while f1_line and f1_line < f2_line:
    print 'in01notin02',f1_line
    out1.write(f1_line+"\n")
    f1_line = in1.readline().strip()
  while f2_line and f1_line > f2_line:
    print 'in02notin01',f2_line
    out2.write(f2_line+"\n")
    f2_line = in2.readline().strip()
  while f1_line and not f2_line:
    print 'in01notin02',f1_line
    out1.write(f1_line+"\n")
    f1_line = in1.readline().strip()
  while f2_line and not f1_line:
    print 'in02notin01',f2_line
    out2.write(f2_line+"\n")
    f2_line = in2.readline().strip()
in1.close()
in2.close()
out1.close()
out2.close()

Open in new window

ASKER CERTIFIED SOLUTION

Roger Baklund

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Dushan Silva

ASKER

Thanks! but this is giving larger than and less than vales .. I just only want fields which are in file01 , which are not in file02.
and
fields which are in file02 , which are not in file01

BR Dushan

Roger Baklund

That is what my script is doing... did you try to run it?

It reads both files in parallell, and while the value from one file is smaller than the value from the other file, it does not exist in the other file (because the files are sorted), so it is written to the output file. Then the next row from the input is checked. And so on. Just test it. :)

Dushan Silva

ASKER

Thanks! these tow files are sorted . but this script is giving greater than values on files, which I don't want.. I want only missing ids.

BR Dushan

Dushan Silva

ASKER

Yes I've executed this script and I got output ids which are already in both files..

Roger Baklund

>> I want only missing ids

That is what the script is supposed to do.

>> Yes I've executed this script and I got output ids which are already in both files..

That did not happen with my test files. Could you provide some example files? Preferably not with 150.000 rows... but two smaller files that fails. These are my test files and my results:

# file01.txt
6100100013
6110010003
6120010001
6120010002
6130040001
6130070005
 
# file02.txt
6120010001
6120120001
6130040001
6130070001
6130070005
 
# output files:
 
# in01notin02.txt
6100100013
6110010003
6120010002
 
# in02notin01.txt
6120120001
6130070001

Open in new window

Dushan Silva

ASKER

Hi All,
Thanks lot for your kind help!
I found following solution with powerful "Sets" class.

http://www.daniweb.com/code/snippet708.html

BR Dushan