?
Solved

Compare two files using python script

Posted on 2009-07-01
21
Medium Priority
?
1,028 Views
Last Modified: 2013-12-26
Hi All,
I want to compare two files which will have following Ids(fileds).

----------file01--------
6100100013
6110010003
6120010001
6120010002
-------------------------

----------file02---------
6120120001
6130040001
6130070001
6130070005
-------------------------


Two outputs requires
01.) file01 Ids, which are not in file02.
01.) file02 Ids, which are not in file01.

BR Dushan
0
Comment
Question by:Dushan De Silva
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 10
  • 4
  • 3
  • +4
21 Comments
 
LVL 8

Assisted Solution

by:Haris V
Haris V earned 150 total points
ID: 24752673
0
 
LVL 68

Assisted Solution

by:woolmilkporc
woolmilkporc earned 150 total points
ID: 24752767
Why use python?
If the files are sorted, use 'comm'.
01.) comm -2 -3 file01 file02
02.) comm -1 -3 file01 file02
Two-column format -
comm -3 file01 file02
0
 
LVL 17

Author Comment

by:Dushan De Silva
ID: 24752783
Thanks ! But it gives following error, this files contains more than 150,000 records..
----------------------------------------------------------
Traceback (most recent call last):
  File "compare.py", line 10, in <module>
    if i != fileTwo[x]:
IndexError: list index out of range
----------------------------------------------------------
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 17

Author Comment

by:Dushan De Silva
ID: 24752789
Thanks woolmilkporc ! But I specifically need a solution from python script.
0
 
LVL 9

Assisted Solution

by:ghostdog74
ghostdog74 earned 150 total points
ID: 24752823
see here(example 3) for a small example
0
 
LVL 17

Author Comment

by:Dushan De Silva
ID: 24752859
Thanks ghostdog74! But it's something different which I want and it will also diffidently says "index out of range", because I have more than 150,000 records in my files.

BR Dushan
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 24753302
Is the data ordered, or are we talking about 150,000 random records?
0
 
LVL 17

Author Comment

by:Dushan De Silva
ID: 24753430
Its ordered :)
0
 
LVL 23

Expert Comment

by:Kamaraj Subramanian
ID: 24753454
This script will compare two files and write the difference in the third file.
f1 = open("file1.txt", "r")
f2 = open("file2.txt", "r")
 
fileOne = f1.readlines()
fileTwo = f2.readlines()
f1.close()
f2.close()
outFile = open("results.txt", "w")
x = 0
for i in fileOne:
   if i != fileTwo[x]:
      outFile.write(i+" <> "+fileTwo[x])
   x <strong class="highlight">+</strong>= 1
 
outFile.close()

Open in new window

0
 
LVL 23

Expert Comment

by:Kamaraj Subramanian
ID: 24753459
0
 
LVL 17

Author Comment

by:Dushan De Silva
ID: 24753633
Thanks itkamaraj!
But it gives following error.
--------------------------------------------------------------------------------------------------------------
  File "compare.py", line 32
    x <strong class="highlight">+</strong>= 1
                  ^
SyntaxError: invalid syntax
--------------------------------------------------------------------------------------------------------------

BR Dushan
0
 
LVL 23

Assisted Solution

by:Kamaraj Subramanian
Kamaraj Subramanian earned 300 total points
ID: 24753721
check this
f1 = open("file1.txt", "r")
f2 = open("file2.txt", "r")
fileOne = f1.readlines()
fileTwo = f2.readlines()
f1.close()
f2.close()
outFile = open("results.txt", "w")
x = 0
for i in fileOne:
  if i != fileTwo[x]:
     outFile.write(i+" <> "+fileTwo[x])
  x += 1
outFile.close()

Open in new window

0
 
LVL 17

Author Comment

by:Dushan De Silva
ID: 24753836
Thanks itkamaraj! It's working .
But it shows results as
-----------------
630260002
 <> 630260001
630260004
 <> 630260002
630260005
 <> 630260004
630260006
 <> 630260005
630260007
-------------------

example :  630260002 is available on both files. Both files are sorted. But because of missing values on some file.... values are not exactly in the same line number..

Please provide just to filter values which are not in file01 but available on file02
and
values which are not in file02 but available on file01
 
BR Dushan

BR Dushan
0
 
LVL 39

Expert Comment

by:Roger Baklund
ID: 24754535
How does this work:
# filecompare.py
in1 = file('file01.txt')
in2 = file('file02.txt')
out1 = file('in01notin02.txt','w')
out2 = file('in02notin01.txt','w')
f1_line = in1.readline().strip()
f2_line = in2.readline().strip()
while f1_line or f2_line:
  if f1_line==f2_line: 
    f1_line = in1.readline().strip()
    f2_line = in2.readline().strip()
  while f1_line and f1_line < f2_line:
    print 'in01notin02',f1_line
    out1.write(f1_line+"\n")
    f1_line = in1.readline().strip()
  while f2_line and f1_line > f2_line:
    print 'in02notin01',f2_line
    out2.write(f2_line+"\n")
    f2_line = in2.readline().strip()
  while f1_line and not f2_line:
    print 'in01notin02',f1_line
    out1.write(f1_line+"\n")
    f1_line = in1.readline().strip()
  while f2_line and not f1_line:
    print 'in02notin01',f2_line
    out2.write(f2_line+"\n")
    f2_line = in2.readline().strip()
in1.close()
in2.close()
out1.close()
out2.close()

Open in new window

0
 
LVL 39

Accepted Solution

by:
Roger Baklund earned 750 total points
ID: 24754554
Sorry, you should remove the print statements, that was just for debugging:
# filecompare.py
in1 = file('file01.txt')
in2 = file('file02.txt')
out1 = file('in01notin02.txt','w')
out2 = file('in02notin01.txt','w')
f1_line = in1.readline().strip()
f2_line = in2.readline().strip()
while f1_line or f2_line:
  if f1_line==f2_line: 
    f1_line = in1.readline().strip()
    f2_line = in2.readline().strip()
  while f1_line and f1_line < f2_line:
    out1.write(f1_line+"\n")
    f1_line = in1.readline().strip()
  while f2_line and f1_line > f2_line:
    out2.write(f2_line+"\n")
    f2_line = in2.readline().strip()
  while f1_line and not f2_line:
    out1.write(f1_line+"\n")
    f1_line = in1.readline().strip()
  while f2_line and not f1_line:
    out2.write(f2_line+"\n")
    f2_line = in2.readline().strip()
in1.close()
in2.close()
out1.close()
out2.close()

Open in new window

0
 
LVL 17

Author Comment

by:Dushan De Silva
ID: 24754880
Thanks! but this is giving larger than and less than vales .. I just only want fields which are in file01 , which are not in file02.
and
fields which are in file02 , which are not in file01

BR Dushan
0
 
LVL 39

Expert Comment

by:Roger Baklund
ID: 24754911
That is what my script is doing... did you try to run it?

It reads both files in parallell, and while the value from one file is smaller than the value from the other file, it does not exist in the other file (because the files are sorted), so it is written to the output file. Then the next row from the input is checked. And so on. Just test it. :)
0
 
LVL 17

Author Comment

by:Dushan De Silva
ID: 24755122
Thanks! these tow files are sorted . but this script is giving greater than values on files, which I don't want.. I want only missing ids.

BR Dushan
0
 
LVL 17

Author Comment

by:Dushan De Silva
ID: 24755141
Yes I've executed this script and I got output ids which are already in both files..
0
 
LVL 39

Expert Comment

by:Roger Baklund
ID: 24755715
>> I want only missing ids

That is what the script is supposed to do.

>> Yes I've executed this script and I got output ids which are already in both files..

That did not happen with my test files. Could you provide some example files? Preferably not with 150.000 rows... but two smaller files that fails. These are my test files and my results:
# file01.txt
6100100013
6110010003
6120010001
6120010002
6130040001
6130070005
 
# file02.txt
6120010001
6120120001
6130040001
6130070001
6130070005
 
# output files:
 
# in01notin02.txt
6100100013
6110010003
6120010002
 
# in02notin01.txt
6120120001
6130070001

Open in new window

0
 
LVL 17

Author Comment

by:Dushan De Silva
ID: 24760186
Hi All,
Thanks lot for your kind help!
I found following solution with powerful "Sets" class.

http://www.daniweb.com/code/snippet708.html

BR Dushan
0

Featured Post

Get MongoDB database support online, now!

At Percona’s web store you can order your MongoDB database support needs in minutes. No hassles, no fuss, just pick and click. Pay online with a credit card. Handle your MongoDB database support now!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction On September 29, 2012, the Python 3.3.0 was released; nothing extremely unexpected,  yet another, better version of Python. But, if you work in Microsoft Windows, you should notice that the Python Launcher for Windows was introduced wi…
Flask is a microframework for Python based on Werkzeug and Jinja 2. This requires you to have a good understanding of Python 2.7. Lets install Flask! To install Flask you can use a python repository for libraries tool called pip. Download this f…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
This video will show you how to get GIT to work in Eclipse.   It will walk you through how to install the EGit plugin in eclipse and how to checkout an existing repository.
Suggested Courses

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question