Link to home
Start Free TrialLog in
Avatar of Dushan Silva
Dushan SilvaFlag for Australia

asked on

Compare two files using python script

Hi All,
I want to compare two files which will have following Ids(fileds).

----------file01--------
6100100013
6110010003
6120010001
6120010002
-------------------------

----------file02---------
6120120001
6130040001
6130070001
6130070005
-------------------------


Two outputs requires
01.) file01 Ids, which are not in file02.
01.) file02 Ids, which are not in file01.

BR Dushan
SOLUTION
Avatar of Haris V
Haris V
Flag of India image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Dushan Silva

ASKER

Thanks ! But it gives following error, this files contains more than 150,000 records..
----------------------------------------------------------
Traceback (most recent call last):
  File "compare.py", line 10, in <module>
    if i != fileTwo[x]:
IndexError: list index out of range
----------------------------------------------------------
Thanks woolmilkporc ! But I specifically need a solution from python script.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks ghostdog74! But it's something different which I want and it will also diffidently says "index out of range", because I have more than 150,000 records in my files.

BR Dushan
Is the data ordered, or are we talking about 150,000 random records?
Its ordered :)
This script will compare two files and write the difference in the third file.
f1 = open("file1.txt", "r")
f2 = open("file2.txt", "r")
 
fileOne = f1.readlines()
fileTwo = f2.readlines()
f1.close()
f2.close()
outFile = open("results.txt", "w")
x = 0
for i in fileOne:
   if i != fileTwo[x]:
      outFile.write(i+" <> "+fileTwo[x])
   x <strong class="highlight">+</strong>= 1
 
outFile.close()

Open in new window

Thanks itkamaraj!
But it gives following error.
--------------------------------------------------------------------------------------------------------------
  File "compare.py", line 32
    x <strong class="highlight">+</strong>= 1
                  ^
SyntaxError: invalid syntax
--------------------------------------------------------------------------------------------------------------

BR Dushan
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks itkamaraj! It's working .
But it shows results as
-----------------
630260002
 <> 630260001
630260004
 <> 630260002
630260005
 <> 630260004
630260006
 <> 630260005
630260007
-------------------

example :  630260002 is available on both files. Both files are sorted. But because of missing values on some file.... values are not exactly in the same line number..

Please provide just to filter values which are not in file01 but available on file02
and
values which are not in file02 but available on file01
 
BR Dushan

BR Dushan
How does this work:
# filecompare.py
in1 = file('file01.txt')
in2 = file('file02.txt')
out1 = file('in01notin02.txt','w')
out2 = file('in02notin01.txt','w')
f1_line = in1.readline().strip()
f2_line = in2.readline().strip()
while f1_line or f2_line:
  if f1_line==f2_line: 
    f1_line = in1.readline().strip()
    f2_line = in2.readline().strip()
  while f1_line and f1_line < f2_line:
    print 'in01notin02',f1_line
    out1.write(f1_line+"\n")
    f1_line = in1.readline().strip()
  while f2_line and f1_line > f2_line:
    print 'in02notin01',f2_line
    out2.write(f2_line+"\n")
    f2_line = in2.readline().strip()
  while f1_line and not f2_line:
    print 'in01notin02',f1_line
    out1.write(f1_line+"\n")
    f1_line = in1.readline().strip()
  while f2_line and not f1_line:
    print 'in02notin01',f2_line
    out2.write(f2_line+"\n")
    f2_line = in2.readline().strip()
in1.close()
in2.close()
out1.close()
out2.close()

Open in new window

ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks! but this is giving larger than and less than vales .. I just only want fields which are in file01 , which are not in file02.
and
fields which are in file02 , which are not in file01

BR Dushan
That is what my script is doing... did you try to run it?

It reads both files in parallell, and while the value from one file is smaller than the value from the other file, it does not exist in the other file (because the files are sorted), so it is written to the output file. Then the next row from the input is checked. And so on. Just test it. :)
Thanks! these tow files are sorted . but this script is giving greater than values on files, which I don't want.. I want only missing ids.

BR Dushan
Yes I've executed this script and I got output ids which are already in both files..
>> I want only missing ids

That is what the script is supposed to do.

>> Yes I've executed this script and I got output ids which are already in both files..

That did not happen with my test files. Could you provide some example files? Preferably not with 150.000 rows... but two smaller files that fails. These are my test files and my results:
# file01.txt
6100100013
6110010003
6120010001
6120010002
6130040001
6130070005
 
# file02.txt
6120010001
6120120001
6130040001
6130070001
6130070005
 
# output files:
 
# in01notin02.txt
6100100013
6110010003
6120010002
 
# in02notin01.txt
6120120001
6130070001

Open in new window

Hi All,
Thanks lot for your kind help!
I found following solution with powerful "Sets" class.

http://www.daniweb.com/code/snippet708.html

BR Dushan