Solved

Compare two files using python script

Posted on 2009-07-01
21
947 Views
Last Modified: 2013-12-26
Hi All,
I want to compare two files which will have following Ids(fileds).

----------file01--------
6100100013
6110010003
6120010001
6120010002
-------------------------

----------file02---------
6120120001
6130040001
6130070001
6130070005
-------------------------


Two outputs requires
01.) file01 Ids, which are not in file02.
01.) file02 Ids, which are not in file01.

BR Dushan
0
Comment
Question by:Dushan911
  • 10
  • 4
  • 3
  • +4
21 Comments
 
LVL 8

Assisted Solution

by:Haris V
Haris V earned 50 total points
ID: 24752673
0
 
LVL 68

Assisted Solution

by:woolmilkporc
woolmilkporc earned 50 total points
ID: 24752767
Why use python?
If the files are sorted, use 'comm'.
01.) comm -2 -3 file01 file02
02.) comm -1 -3 file01 file02
Two-column format -
comm -3 file01 file02
0
 
LVL 17

Author Comment

by:Dushan911
ID: 24752783
Thanks ! But it gives following error, this files contains more than 150,000 records..
----------------------------------------------------------
Traceback (most recent call last):
  File "compare.py", line 10, in <module>
    if i != fileTwo[x]:
IndexError: list index out of range
----------------------------------------------------------
0
 
LVL 17

Author Comment

by:Dushan911
ID: 24752789
Thanks woolmilkporc ! But I specifically need a solution from python script.
0
 
LVL 9

Assisted Solution

by:ghostdog74
ghostdog74 earned 50 total points
ID: 24752823
see here(example 3) for a small example
0
 
LVL 17

Author Comment

by:Dushan911
ID: 24752859
Thanks ghostdog74! But it's something different which I want and it will also diffidently says "index out of range", because I have more than 150,000 records in my files.

BR Dushan
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 24753302
Is the data ordered, or are we talking about 150,000 random records?
0
 
LVL 17

Author Comment

by:Dushan911
ID: 24753430
Its ordered :)
0
 
LVL 23

Expert Comment

by:Kamaraj Subramanian
ID: 24753454
This script will compare two files and write the difference in the third file.
f1 = open("file1.txt", "r")

f2 = open("file2.txt", "r")
 

fileOne = f1.readlines()

fileTwo = f2.readlines()

f1.close()

f2.close()

outFile = open("results.txt", "w")

x = 0

for i in fileOne:

   if i != fileTwo[x]:

      outFile.write(i+" <> "+fileTwo[x])

   x <strong class="highlight">+</strong>= 1
 

outFile.close()

Open in new window

0
 
LVL 23

Expert Comment

by:Kamaraj Subramanian
ID: 24753459
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 
LVL 17

Author Comment

by:Dushan911
ID: 24753633
Thanks itkamaraj!
But it gives following error.
--------------------------------------------------------------------------------------------------------------
  File "compare.py", line 32
    x <strong class="highlight">+</strong>= 1
                  ^
SyntaxError: invalid syntax
--------------------------------------------------------------------------------------------------------------

BR Dushan
0
 
LVL 23

Assisted Solution

by:Kamaraj Subramanian
Kamaraj Subramanian earned 100 total points
ID: 24753721
check this
f1 = open("file1.txt", "r")

f2 = open("file2.txt", "r")

fileOne = f1.readlines()

fileTwo = f2.readlines()

f1.close()

f2.close()

outFile = open("results.txt", "w")

x = 0

for i in fileOne:

  if i != fileTwo[x]:

     outFile.write(i+" <> "+fileTwo[x])

  x += 1

outFile.close()

Open in new window

0
 
LVL 17

Author Comment

by:Dushan911
ID: 24753836
Thanks itkamaraj! It's working .
But it shows results as
-----------------
630260002
 <> 630260001
630260004
 <> 630260002
630260005
 <> 630260004
630260006
 <> 630260005
630260007
-------------------

example :  630260002 is available on both files. Both files are sorted. But because of missing values on some file.... values are not exactly in the same line number..

Please provide just to filter values which are not in file01 but available on file02
and
values which are not in file02 but available on file01
 
BR Dushan

BR Dushan
0
 
LVL 39

Expert Comment

by:Roger Baklund
ID: 24754535
How does this work:
# filecompare.py

in1 = file('file01.txt')

in2 = file('file02.txt')

out1 = file('in01notin02.txt','w')

out2 = file('in02notin01.txt','w')

f1_line = in1.readline().strip()

f2_line = in2.readline().strip()

while f1_line or f2_line:

  if f1_line==f2_line: 

    f1_line = in1.readline().strip()

    f2_line = in2.readline().strip()

  while f1_line and f1_line < f2_line:

    print 'in01notin02',f1_line

    out1.write(f1_line+"\n")

    f1_line = in1.readline().strip()

  while f2_line and f1_line > f2_line:

    print 'in02notin01',f2_line

    out2.write(f2_line+"\n")

    f2_line = in2.readline().strip()

  while f1_line and not f2_line:

    print 'in01notin02',f1_line

    out1.write(f1_line+"\n")

    f1_line = in1.readline().strip()

  while f2_line and not f1_line:

    print 'in02notin01',f2_line

    out2.write(f2_line+"\n")

    f2_line = in2.readline().strip()

in1.close()

in2.close()

out1.close()

out2.close()

Open in new window

0
 
LVL 39

Accepted Solution

by:
Roger Baklund earned 250 total points
ID: 24754554
Sorry, you should remove the print statements, that was just for debugging:
# filecompare.py

in1 = file('file01.txt')

in2 = file('file02.txt')

out1 = file('in01notin02.txt','w')

out2 = file('in02notin01.txt','w')

f1_line = in1.readline().strip()

f2_line = in2.readline().strip()

while f1_line or f2_line:

  if f1_line==f2_line: 

    f1_line = in1.readline().strip()

    f2_line = in2.readline().strip()

  while f1_line and f1_line < f2_line:

    out1.write(f1_line+"\n")

    f1_line = in1.readline().strip()

  while f2_line and f1_line > f2_line:

    out2.write(f2_line+"\n")

    f2_line = in2.readline().strip()

  while f1_line and not f2_line:

    out1.write(f1_line+"\n")

    f1_line = in1.readline().strip()

  while f2_line and not f1_line:

    out2.write(f2_line+"\n")

    f2_line = in2.readline().strip()

in1.close()

in2.close()

out1.close()

out2.close()

Open in new window

0
 
LVL 17

Author Comment

by:Dushan911
ID: 24754880
Thanks! but this is giving larger than and less than vales .. I just only want fields which are in file01 , which are not in file02.
and
fields which are in file02 , which are not in file01

BR Dushan
0
 
LVL 39

Expert Comment

by:Roger Baklund
ID: 24754911
That is what my script is doing... did you try to run it?

It reads both files in parallell, and while the value from one file is smaller than the value from the other file, it does not exist in the other file (because the files are sorted), so it is written to the output file. Then the next row from the input is checked. And so on. Just test it. :)
0
 
LVL 17

Author Comment

by:Dushan911
ID: 24755122
Thanks! these tow files are sorted . but this script is giving greater than values on files, which I don't want.. I want only missing ids.

BR Dushan
0
 
LVL 17

Author Comment

by:Dushan911
ID: 24755141
Yes I've executed this script and I got output ids which are already in both files..
0
 
LVL 39

Expert Comment

by:Roger Baklund
ID: 24755715
>> I want only missing ids

That is what the script is supposed to do.

>> Yes I've executed this script and I got output ids which are already in both files..

That did not happen with my test files. Could you provide some example files? Preferably not with 150.000 rows... but two smaller files that fails. These are my test files and my results:
# file01.txt

6100100013

6110010003

6120010001

6120010002

6130040001

6130070005
 

# file02.txt

6120010001

6120120001

6130040001

6130070001

6130070005
 

# output files:
 

# in01notin02.txt

6100100013

6110010003

6120010002
 

# in02notin01.txt

6120120001

6130070001

Open in new window

0
 
LVL 17

Author Comment

by:Dushan911
ID: 24760186
Hi All,
Thanks lot for your kind help!
I found following solution with powerful "Sets" class.

http://www.daniweb.com/code/snippet708.html

BR Dushan
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Introduction On September 29, 2012, the Python 3.3.0 was released; nothing extremely unexpected,  yet another, better version of Python. But, if you work in Microsoft Windows, you should notice that the Python Launcher for Windows was introduced wi…
Dictionaries contain key:value pairs. Which means a collection of tuples with an attribute name and an assigned value to it. The semicolon present in between each key and values and attribute with values are delimited with a comma.  In python we can…
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now