Solved

two csv files to two dictionaries and compare

Posted on 2009-07-05
20
1,185 Views
Last Modified: 2013-11-13
Hi Experts,
I have following three files.  
------------------------------------file01.txt---------------------------
00650260048_c,,,The Pink Sheet
05040390072,,,The Tan Sheet
020108d4,,,Health News Daily
02260160016_b,,,The Rose Sheet
00630360023,relatedDocs,00630220023,The Pink Sheet
--------------------------------------------------------------------------

------------------------------------file02.txt----------------------------
000105d2,,,Health News Daily
00650260048_c,,,The Pink Sheet
000105d5,,,Health News Daily
05040390072,,,The Tan Sheet
000106d1,,,Health News Daily
000106d3,,,Health News Daily
000106d4,,,Health News Daily
000106d6,,,Health News Daily
--------------------------------------------------------------------------

-------------------------both.txt----------------------------------------
00650260048_c
05040390072
020108d4
02260160016_b
00630360023
000105d2
00650260048_c
000105d5
05040390072
000106d1
000106d3
000106d4
000106d6
--------------------------------------------------------------------------
1.) Load file01.txt into a dictionary, with the key being id (the first value before the comma) and the value being the entire line of the file.  
2.) Load file02.txt into a dictionary, with the key being the id (the first value before the comma) and the value being the entire line of the file.
3.) For all of the ids in both.txt, determine which ids(keys) have different values between file01.txt and file02.txt.

I want python script.
0
Comment
Question by:Dushan911
  • 10
  • 6
  • 4
20 Comments
 
LVL 17

Author Comment

by:Dushan911
ID: 24780966
import csv
import sys

f1 = open(sys.argv[1], 'rt')
f2 = open(sys.argv[2], 'rt')
f3 = open(sys.argv[3], 'rt')
#f1 = csv.reader(open(sys.argv[1]),delimiter=',')
h={}
try:
    reader1 = csv.reader(f1, delimiter=',')
    reader2 = csv.reader(f2, delimiter=',')
    reader3 = csv.reader(f3, delimiter=',')
    for row1 in reader1:
#       print row1[0]
#       print ','.join(row1[1:])
#       h[row1[0]]=','.join(row1[1:])
#       print h
            print row1[1:]
#       if row1[0]!= "":
            for row2 in reader2:
#               if row2[0]!= "":
                    if row2[0] in row1[0]:
#                       if row2[1:] in row1[1:]:
                           print row2[0]
#           print row2[1:]

finally:
    f1.close()
    f2.close()
    f3.close()
0
 
LVL 17

Author Comment

by:Dushan911
ID: 24780967
plz help me !
0
 
LVL 41

Accepted Solution

by:
HonorGod earned 100 total points
ID: 24781027
Something like this perhaps?
import csv, sys, os.path
 

def csvToDict( filename ):

  result = {};

  if os.path.exists( filename ) :

    try :

      handle = open( filename );

      for line in csv.reader( handle, delimiter=',' ) :

        inx = line[ 0 ];

        data = ','.join( line );

        result[ inx ] = data

      handle.close();

    except :

      ( kind, val ) = sys.exc_info()[ :2 ];

      ( kind, val ) = str( kind ), str( val );

      print 'Error: ' + kind + '\nValue: ' + val;

  return result;
 

def CSVcompare( file1, file2 ):

  print 'file1: "%s"\nfile2: "%s"' % ( file1, file2 )

  d1 = csvToDict( file1 );

  d2 = csvToDict( file2 );

  n1 = d1.keys();

  n1.sort();

  for n in n1[ : ] :

    if d2.has_key( n ) :

      if d1[ n ] == d2[ n ] :

        del( d1[ n ] );

        del( d2[ n ] );

      else :

        print d1[ n ];

    else :

      print d1[ n ];

  print '-' * 50

  n2 = d2.keys();

  n2.sort();

  for n in n2[ : ] :

    if d1.has_key( n ) :

      if d1[ n ] != d2[ n ] :

        print d2[ n ];

    else :

      print d2[ n ];
 
 

def Usage( cmd = 'CSVcompare' ):

  print 'Uage: %s.py file1.csv file2.csv'

  sys.exit()
 

if __name__ == '__main__' or __name__ == 'main' :

  if len( sys.argv ) == 3 :

    CSVcompare( sys.argv[ 1 ], sys.argv[ 2 ] );

  else :

    Usage();

else :

  Usage( __name__ );

Open in new window

0
 
LVL 17

Author Comment

by:Dushan911
ID: 24782282
Thanks lot HonorGod!!
That's really close to which I'm looking. And I need to compare ids in both.txt against file01.txt and file02.txt 's ids. Then if thoese ids avialable on file01.txt and file02.txt  then I should match the rest of the line part in those files. If you could help for solve that also it will be great! :)

BR Dushan
0
 
LVL 17

Author Comment

by:Dushan911
ID: 24782769
Can u plz explain following code part.

       inx = line[ 0 ];

        data = ','.join( line );

        result[ inx ] = data

Open in new window

0
 
LVL 17

Author Comment

by:Dushan911
ID: 24782778
and :)
      ( kind, val ) = sys.exc_info()[ :2 ];

      ( kind, val ) = str( kind ), str( val );

      print 'Error: ' + kind + '\nValue: ' + val;

Open in new window

0
 
LVL 41

Expert Comment

by:HonorGod
ID: 24785962
Certainly

inx = line[ 0 ]


assigns the value of the 1st element of the list variable named line to the local variable named inx (for "index", but you don't want to use that as a variable name, since it occurs as a built-in string, and list method).

data = ','.join( line );

Takes the list variable named line, and converts it into a string with a comma as the delimiter between entries, and assigns this string result to the local variable named data.

result[ inx ] = data;

Uses the local variable named inx as an index, or key, into the dictionary variable named result, and assigns this the value currently in the local variable named data.
Since data isn't really used anywhere else, we could combine these two statements into 1:

result[ inx ] = ','.join( line );

This statement is a little more interesting:

( kind, val ) = sys.exc_info()[ :2 ];

To understand this, we need to know that sys.exc_info() is a thread safe way to access the following 3 exception state values.

- sys.exc_type  = Exception type
- sys.exc_value = Exception value
- sys.exc_traceback = Exception stack trace

The [ :2 ] after this function call is called slice notation that is used to specify that we (only) want the values up to, but not including [ 2 ] (i.e., the stack trace).  So this:

sys.exc_info()[ :2

returns a list of two values.

An assignment statement that looks like this:

( kind, val ) = sys.exc_info()[ :2 ];

takes the sequence on the right side of the assignment statement, and using sequence unpacking, assigns the values to the variables in the tuple on the left.
So the first value (i.e., sys.exc_info()[ 0 ] ) will be assigned (bound) to the local variable named "kind", and the second value (i.e., sys.exc_info()[ 1 ] ) will be assigned (bound) to the local variable named "val".

However, the data types of these values aren't strings.  It is much easier to deal with these values if they are strings.  Hence the statement:

( kind, val ) = str( kind ), str( val );

Logically what happens is this:
- the value in "kind" is converted to a string
- the value in "val" is converted to a string
- these two string values are combined (using tuple packing) into a tuple
- Using sequence unpacking the tuple values are bound (assigned) to the values on the left side of the assignment statement, so this single statement is equivalent to:
  kind = str( kind )
  val   = str( val )

The last statement:

print 'Error: ' + kind + '\nValue: ' + val;

Is used to display the exception type and value in a string using string concatenation because two strings may be concatenated using the + operator.

Does this help your understanding?
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 24785984
I went back to reread the sequence of updates, and don't quite understand for what you are asking in this update: http:#a24782282

Can you clarify it, and perhaps state it a different way?
0
 
LVL 16

Assisted Solution

by:t0t0
t0t0 earned 25 total points
ID: 24790877
I can do it using DOS batch files.

The batch file first creates it's own BOTH.TXT based on the contents of FILE01.TXT and FILE02.TXT. The resultant BOTH.TXT differs from yours in that it does not contain duplicate entries.

The first FOR loop get data from FILE01.TXT into BOTH.TXT.

The second FOR loop gets data from FILE02.TXT into BOTH.TXT providing that data does not already exist in BOTH.TXT.

The third FOR loop is the actual loop which processes BOTH.TXT. It compares each entry against FILE01.TXT and FILE02.TXT and if the entry is found in both files, it compares the whole lines for any differences. If there are differences, both lines are printed to the screen.

@echo off
(for /f "tokens=1 delims=," %%a in ('type file01.txt') do (
   echo %%a
))>both.txt

(for /f "tokens=1 delims=," %%a in ('type file02.txt') do (
   findstr /i "%%a" both.txt>nul
   if ERRORLEVEL 1 (
      echo %%a
   )
))>>both.txt

for /f %%a in ('type both.txt') do (
   for /f "tokens=*" %%b in ('findstr /i /b "%%a" file01.txt') do (
      if ERRORLEVEL 0 (
         for /f "tokens=*" %%c in ('findstr /i /b "%%a" file02.txt') do (
            if ERRORLEVEL 0 (
               if not "%%b"=="%%c" (
                  echo File01.txt: %%b
                  echo File02.txt: %%c
               )
            )
         )
      )
   )
)
0
 
LVL 17

Author Comment

by:Dushan911
ID: 24793058
Thannk you very much HonorGod!
and it's really helpful to me about the code explaination, I think I should look more python commands from documentation.
on http://www.experts-exchange.com/Programming/Languages/Scripting/Python/Q_24545046.html#a24782282
I tired to explain below point. :)
3.) For all of the ids in both.txt, determine which ids(keys) have different values between file01.txt and file02.txt.
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 17

Author Comment

by:Dushan911
ID: 24793070
Thanks t0t0! but I'm runing on this code on unix.
0
 
LVL 17

Author Comment

by:Dushan911
ID: 24793192
I tried below code, but it's not giving correct results :) .. sorry for bothering u HonorGod.
import csv, sys, os.path
 

def csvToDict( filename ):

  result = {};

  if os.path.exists( filename ) :

    try :

      handle = open( filename );

      for line in csv.reader( handle, delimiter=',' ) :

        inx = line[ 0 ];

        data = ','.join( line );

        result[ inx ] = data

      handle.close();

    except :

      ( kind, val ) = sys.exc_info()[ :2 ];

      ( kind, val ) = str( kind ), str( val );

      print 'Error: ' + kind + '\nValue: ' + val;

  return result;
 

def CSVcompare( file1, file2, file3 ):

  print 'file1: "%s"\nfile2: "%s"' % ( file1, file2 )

  d1 = csvToDict( file1 );

  d2 = csvToDict( file2 );

  d3 = csvToDict( file3 );

  n1 = d1.keys();

  n1.sort();

  for n in n1[ : ] :

    if d2.has_key( n ) :                        #compare keys

     if d3.has_key( n ) :

      if d2[ n ] == d3[ n ] :                   #compare values

#       print d2[n];

#        print d3[n];

        del( d2[ n ] );

        del( d3[ n ] );

      else :

        print d2[ n ];

     else :

      print d3[ n ];

    else :

      print d2[ n ];
 

#  print '-' * 50

#  n2 = d2.keys();

#  n2.sort();

#  for n in n2[ : ] :

#    if d1.has_key( n ) :

#      if d1[ n ] != d2[ n ] :

#        print d2[ n ];

#    else :

#      print d2[ n ];
 

def Usage( cmd = 'CSVcompare' ):

  print 'Uage: %s.py both.csv file1.csv file2.csv'

  sys.exit()
 

if __name__ == '__main__' or __name__ == 'main' :

  if len( sys.argv ) == 4 :

    CSVcompare( sys.argv[ 1 ], sys.argv[ 2 ], sys.argv[ 3 ]);

  else :

    Usage();

else :

  Usage( __name__ );

Open in new window

0
 
LVL 16

Expert Comment

by:t0t0
ID: 24793525
No probs... it's an interesting question which is why I gqve it a shot.
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 24793548
What data are you using for input, and what problems are you seeing?
0
 
LVL 17

Author Comment

by:Dushan911
ID: 24802211
I'm using imput texgt file as input. Problem is first I should check ids in both.txt, against file01.txt and file02.txt, if that id avialable on file01.txt and file02.txt , then should compare that perticular id's perticular field's rest of the line part (that is in file01.txt and file02.txt)

Example : steps
1.) we should read values in both.txt
2.) the first value is 00650260048_c
3.) Then should check that perticular value is available on file01.txt and file02.txt ' on first field. (yes on file01.txt it's 1st line and on file02.txt it's 2nd value), if it not available we should print that ids
4.) If it avilable on file01.txt and file02.txt, then we could skip it, if it not avialable , then we should print it.
5.) If that id avialable on both file01.txt and file02.txt , then we should match that perticular line's other part, if it matching we should skip it, if it not matching we should print it.
(yes it's matching and should skip  "  ,,,The Pink Sheet "    on file01.txt 1st line and file02.txt 2nd line)
0
 
LVL 16

Expert Comment

by:t0t0
ID: 24802658
FOR DOS USERS

Apologies for coming back to you on this one. Quite frankly, I'm surprised you haven't received a solution to this problem.

In DOS, this would be done as follows (modified to output just ID if lines match):


@echo off
(for /f "tokens=1 delims=," %%a in ('type file01.txt') do (
   echo %%a
))>both.txt

(for /f "tokens=1 delims=," %%a in ('type file02.txt') do (
   findstr /i "%%a" both.txt>nul
   if ERRORLEVEL 1 (
      echo %%a
   )
))>>both.txt

for /f %%a in ('type both.txt') do (
   for /f "tokens=*" %%b in ('findstr /i /b "%%a" file01.txt') do (
      if ERRORLEVEL 0 (
         for /f "tokens=*" %%c in ('findstr /i /b "%%a" file02.txt') do (
            if ERRORLEVEL 0 (
               if not "%%b"=="%%c" (
                  echo File01.txt: %%b
                  echo File02.txt: %%c
               ) else (
                  echo %%a
               )
            )
         )
      )
   )
)
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 24803754
ok, I went back, and re-read the question.

Please forgive me for my misunderstanding.
The code, as written will:

- read file1 into a dictionary (D1), indexed by the first csv value.  so far, so good.
- read file2 into a dictionary (D2), indexed by the first csv value.  so far, so good.
- "compare" the two dicationaries.
  For each key in dictionary #1 (D1):
  - if the key exists in both dictionaries, and the values agree, we're fine, ignore the value (i.e., by removing them)
  - if the key exists in both dictionaries, but the values differ, display the value from D1 (mismatch)
  - if the key doesn't exist in D2, display the value because it's unique to D1.
  For each key in dictionary #2 (D2):
  - if the key exists in both dictionaries, but the values differ, display the value from D2 (mismatch)
  - if the key doesn't exist in D1, display the value because it's unique to D2.
 
So, what is the problem?
- is it the fact that the comparison will display the mismatch entries from both dictionaries?
- or, it it the fact that values unique to each dictionary will be displayed?

My presumption is that you don't want the second.  In which case,  the comparison script should be written something like this:

- Sample output:
----------------------------------------------------------------------
C:\Programs\Python\csv>type file01.txt
00650260048_c,,,The Pink Sheet
05040390072,,,The Tan Sheet
020108d4,,,Health News Daily
02260160016_b,,,The Rose Sheet
00630360023,relatedDocs,00630220023,The Pink Sheet
----------------------------------------------------------------------
C:\Programs\Python\csv>type file02.txt
000105d2,,,Health News Daily
00650260048_c,,,The Pink Sheet
000105d5,,,Health News Daily
05040390072,,,The Tan Sheet
000106d1,,,Health News Daily
000106d3,,,Health News Daily
000106d4,,,Health News Daily
000106d6,,,Health News Daily
----------------------------------------------------------------------
C:\Programs\Python\csv>CSVcompare.py file01.txt file02.txt
----------------------------------------------------------------------

C:\Programs\Python\csv>type file01.txt
00650260048_c,,,The Pink Sheet
05040390072,,,The Tan Sheet
020108d4,,,Health News Daily
02260160016_b,,,The Rose Sheet
00630360023,relatedDocs,00630220023,The Pink Sheet
C:\Programs\Python\csv>type file03.txt
00650260048_c,,,The Pink Sheet
05040390072,,,The Tan Sheet
020108d4,,,<Different value>
02260160016_b,,,The Rose Sheet
00630360023,relatedDocs,00630220023,The Pink Sheet
C:\Programs\Python\csv>CSVcompare.py file01.txt file03.txt
020108d4,,,Health News Daily
020108d4,,,<Different value>

C:\Programs\Python\csv>


Is that what you wanted?
import csv, sys, os.path
 

def csvToDict( filename ):

  result = {};

# print 'filename: ' + filename;

  if os.path.exists( filename ) :

    try :

      handle = open( filename );

      for line in csv.reader( handle, delimiter=',' ) :

        inx = line[ 0 ];

        data = ','.join( line );

#       print '["%s"] = "%s"' % ( inx, data );

        result[ inx ] = data

      handle.close();

    except :

      ( kind, val ) = sys.exc_info()[ :2 ];

      ( kind, val ) = str( kind ), str( val );

      print 'Error: ' + kind + '\nValue: ' + val;

  return result;
 

def CSVcompare( file1, file2 ):

# print 'file1: "%s"\nfile2: "%s"' % ( file1, file2 )

  d1 = csvToDict( file1 );

  d2 = csvToDict( file2 );

  n1 = d1.keys();

  n1.sort();

  for n in n1[ : ] :

    if d2.has_key( n ) :

      if d1[ n ] == d2[ n ] :

        del( d1[ n ] );

        del( d2[ n ] );

      else :

        print d1[ n ];

    else :

#      print d1[ n ];

       pass;

# print '-' * 50

  n2 = d2.keys();

  n2.sort();

  for n in n2[ : ] :

    if d1.has_key( n ) :

      if d1[ n ] != d2[ n ] :

        print d2[ n ];

    else :

#     print d2[ n ];

      pass;
 
 

def Usage( cmd = 'CSVcompare' ):

  print 'Uage: %s.py file1.csv file2.csv'

  sys.exit()
 

if __name__ == '__main__' or __name__ == 'main' :

  if len( sys.argv ) == 3 :

    CSVcompare( sys.argv[ 1 ], sys.argv[ 2 ] );

  else :

    Usage();

else :

  Usage( __name__ );

Open in new window

0
 
LVL 16

Expert Comment

by:t0t0
ID: 24805673
>> "is it the fact that the comparison will display the mismatch entries from both dictionaries?"
>> "or, it it the fact that values unique to each dictionary will be displayed?"

BOTH! - Neither of these are part of the asker's requirements.
0
 
LVL 17

Author Comment

by:Dushan911
ID: 24811222
Hi HonorGod and t0t0,

Thanks lot for your really great effort!
Finally I cameup following code with looking at your  code :)

Thanks so much!!!!!!!!!
import csv, sys, os.path
 

def csvToDict( filename ):

  result = {};                 

  if os.path.exists( filename ) :

    try :

      handle = open( filename );

      for line in csv.reader( handle, delimiter=',' ) :
 

        inx = line[ 0 ];

        data = ','.join( line );

        result[ inx ] = data

      handle.close();

    except :

      ( kind, val ) = sys.exc_info()[ :2 ];

      ( kind, val ) = str( kind ), str( val );

      print 'Error: ' + kind + '\nValue: ' + val;

  return result;
 

def CSVcompare( file1, file2, file3 ):

  print 'file1: "%s"\nfile2: "%s"' % ( file1, file2 )

  d1 = csvToDict( file1 );

  d2 = csvToDict( file2 );

  d3 = csvToDict( file3 );

  n1 = d1.keys();

  n1.sort();

  for n in n1:

    if n in d1 and n in d2:                                         

      if d2[n] != d3[n]:                                            

        print "values differ, ", n, "---", d2[n], "---", d3[n]
 

  print '-' * 50
 
 

def Usage( cmd = 'CSVcompare' ):

  print 'Uage: %s.py both.csv file1.csv file2.csv'

  sys.exit()
 

if __name__ == '__main__' or __name__ == 'main' :

  if len( sys.argv ) == 4 :

    CSVcompare( sys.argv[ 1 ], sys.argv[ 2 ], sys.argv[ 3 ]);

  else :

    Usage();

else :

  Usage( __name__ );

Open in new window

0
 
LVL 41

Expert Comment

by:HonorGod
ID: 24812750
Thanks for the grade & points.

Good luck & have a great day.
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

This article is meant to give a basic understanding of how to use R Sweave as a way to merge LaTeX and R code seamlessly into one presentable document.
Does the idea of dealing with bits scare or confuse you? Does it seem like a waste of time in an age where we all have terabytes of storage? If so, you're missing out on one of the core tools in every professional programmer's toolbox. Learn how to …
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now