Link to home
Start Free TrialLog in
Avatar of Dushan Silva
Dushan SilvaFlag for Australia

asked on

two csv files to two dictionaries and compare

Hi Experts,
I have following three files.  
------------------------------------file01.txt---------------------------
00650260048_c,,,The Pink Sheet
05040390072,,,The Tan Sheet
020108d4,,,Health News Daily
02260160016_b,,,The Rose Sheet
00630360023,relatedDocs,00630220023,The Pink Sheet
--------------------------------------------------------------------------

------------------------------------file02.txt----------------------------
000105d2,,,Health News Daily
00650260048_c,,,The Pink Sheet
000105d5,,,Health News Daily
05040390072,,,The Tan Sheet
000106d1,,,Health News Daily
000106d3,,,Health News Daily
000106d4,,,Health News Daily
000106d6,,,Health News Daily
--------------------------------------------------------------------------

-------------------------both.txt----------------------------------------
00650260048_c
05040390072
020108d4
02260160016_b
00630360023
000105d2
00650260048_c
000105d5
05040390072
000106d1
000106d3
000106d4
000106d6
--------------------------------------------------------------------------
1.) Load file01.txt into a dictionary, with the key being id (the first value before the comma) and the value being the entire line of the file.  
2.) Load file02.txt into a dictionary, with the key being the id (the first value before the comma) and the value being the entire line of the file.
3.) For all of the ids in both.txt, determine which ids(keys) have different values between file01.txt and file02.txt.

I want python script.
Avatar of Dushan Silva
Dushan Silva
Flag of Australia image

ASKER

import csv
import sys

f1 = open(sys.argv[1], 'rt')
f2 = open(sys.argv[2], 'rt')
f3 = open(sys.argv[3], 'rt')
#f1 = csv.reader(open(sys.argv[1]),delimiter=',')
h={}
try:
    reader1 = csv.reader(f1, delimiter=',')
    reader2 = csv.reader(f2, delimiter=',')
    reader3 = csv.reader(f3, delimiter=',')
    for row1 in reader1:
#       print row1[0]
#       print ','.join(row1[1:])
#       h[row1[0]]=','.join(row1[1:])
#       print h
            print row1[1:]
#       if row1[0]!= "":
            for row2 in reader2:
#               if row2[0]!= "":
                    if row2[0] in row1[0]:
#                       if row2[1:] in row1[1:]:
                           print row2[0]
#           print row2[1:]

finally:
    f1.close()
    f2.close()
    f3.close()
plz help me !
ASKER CERTIFIED SOLUTION
Avatar of HonorGod
HonorGod
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks lot HonorGod!!
That's really close to which I'm looking. And I need to compare ids in both.txt against file01.txt and file02.txt 's ids. Then if thoese ids avialable on file01.txt and file02.txt  then I should match the rest of the line part in those files. If you could help for solve that also it will be great! :)

BR Dushan
Can u plz explain following code part.

       inx = line[ 0 ];
        data = ','.join( line );
        result[ inx ] = data

Open in new window

and :)
      ( kind, val ) = sys.exc_info()[ :2 ];
      ( kind, val ) = str( kind ), str( val );
      print 'Error: ' + kind + '\nValue: ' + val;

Open in new window

Certainly

inx = line[ 0 ]


assigns the value of the 1st element of the list variable named line to the local variable named inx (for "index", but you don't want to use that as a variable name, since it occurs as a built-in string, and list method).

data = ','.join( line );

Takes the list variable named line, and converts it into a string with a comma as the delimiter between entries, and assigns this string result to the local variable named data.

result[ inx ] = data;

Uses the local variable named inx as an index, or key, into the dictionary variable named result, and assigns this the value currently in the local variable named data.
Since data isn't really used anywhere else, we could combine these two statements into 1:

result[ inx ] = ','.join( line );

This statement is a little more interesting:

( kind, val ) = sys.exc_info()[ :2 ];

To understand this, we need to know that sys.exc_info() is a thread safe way to access the following 3 exception state values.

- sys.exc_type  = Exception type
- sys.exc_value = Exception value
- sys.exc_traceback = Exception stack trace

The [ :2 ] after this function call is called slice notation that is used to specify that we (only) want the values up to, but not including [ 2 ] (i.e., the stack trace).  So this:

sys.exc_info()[ :2

returns a list of two values.

An assignment statement that looks like this:

( kind, val ) = sys.exc_info()[ :2 ];

takes the sequence on the right side of the assignment statement, and using sequence unpacking, assigns the values to the variables in the tuple on the left.
So the first value (i.e., sys.exc_info()[ 0 ] ) will be assigned (bound) to the local variable named "kind", and the second value (i.e., sys.exc_info()[ 1 ] ) will be assigned (bound) to the local variable named "val".

However, the data types of these values aren't strings.  It is much easier to deal with these values if they are strings.  Hence the statement:

( kind, val ) = str( kind ), str( val );

Logically what happens is this:
- the value in "kind" is converted to a string
- the value in "val" is converted to a string
- these two string values are combined (using tuple packing) into a tuple
- Using sequence unpacking the tuple values are bound (assigned) to the values on the left side of the assignment statement, so this single statement is equivalent to:
  kind = str( kind )
  val   = str( val )

The last statement:

print 'Error: ' + kind + '\nValue: ' + val;

Is used to display the exception type and value in a string using string concatenation because two strings may be concatenated using the + operator.

Does this help your understanding?
I went back to reread the sequence of updates, and don't quite understand for what you are asking in this update: http:#a24782282

Can you clarify it, and perhaps state it a different way?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thannk you very much HonorGod!
and it's really helpful to me about the code explaination, I think I should look more python commands from documentation.
on https://www.experts-exchange.com/questions/24545046/two-csv-files-to-two-dictionaries-and-compare.html?anchorAnswerId=24782282#a24782282
I tired to explain below point. :)
3.) For all of the ids in both.txt, determine which ids(keys) have different values between file01.txt and file02.txt.
Thanks t0t0! but I'm runing on this code on unix.
I tried below code, but it's not giving correct results :) .. sorry for bothering u HonorGod.
import csv, sys, os.path
 
def csvToDict( filename ):
  result = {};
  if os.path.exists( filename ) :
    try :
      handle = open( filename );
      for line in csv.reader( handle, delimiter=',' ) :
        inx = line[ 0 ];
        data = ','.join( line );
        result[ inx ] = data
      handle.close();
    except :
      ( kind, val ) = sys.exc_info()[ :2 ];
      ( kind, val ) = str( kind ), str( val );
      print 'Error: ' + kind + '\nValue: ' + val;
  return result;
 
def CSVcompare( file1, file2, file3 ):
  print 'file1: "%s"\nfile2: "%s"' % ( file1, file2 )
  d1 = csvToDict( file1 );
  d2 = csvToDict( file2 );
  d3 = csvToDict( file3 );
  n1 = d1.keys();
  n1.sort();
  for n in n1[ : ] :
    if d2.has_key( n ) :                        #compare keys
     if d3.has_key( n ) :
      if d2[ n ] == d3[ n ] :                   #compare values
#       print d2[n];
#        print d3[n];
        del( d2[ n ] );
        del( d3[ n ] );
      else :
        print d2[ n ];
     else :
      print d3[ n ];
    else :
      print d2[ n ];
 
#  print '-' * 50
#  n2 = d2.keys();
#  n2.sort();
#  for n in n2[ : ] :
#    if d1.has_key( n ) :
#      if d1[ n ] != d2[ n ] :
#        print d2[ n ];
#    else :
#      print d2[ n ];
 
def Usage( cmd = 'CSVcompare' ):
  print 'Uage: %s.py both.csv file1.csv file2.csv'
  sys.exit()
 
if __name__ == '__main__' or __name__ == 'main' :
  if len( sys.argv ) == 4 :
    CSVcompare( sys.argv[ 1 ], sys.argv[ 2 ], sys.argv[ 3 ]);
  else :
    Usage();
else :
  Usage( __name__ );

Open in new window

No probs... it's an interesting question which is why I gqve it a shot.
What data are you using for input, and what problems are you seeing?
I'm using imput texgt file as input. Problem is first I should check ids in both.txt, against file01.txt and file02.txt, if that id avialable on file01.txt and file02.txt , then should compare that perticular id's perticular field's rest of the line part (that is in file01.txt and file02.txt)

Example : steps
1.) we should read values in both.txt
2.) the first value is 00650260048_c
3.) Then should check that perticular value is available on file01.txt and file02.txt ' on first field. (yes on file01.txt it's 1st line and on file02.txt it's 2nd value), if it not available we should print that ids
4.) If it avilable on file01.txt and file02.txt, then we could skip it, if it not avialable , then we should print it.
5.) If that id avialable on both file01.txt and file02.txt , then we should match that perticular line's other part, if it matching we should skip it, if it not matching we should print it.
(yes it's matching and should skip  "  ,,,The Pink Sheet "    on file01.txt 1st line and file02.txt 2nd line)
FOR DOS USERS

Apologies for coming back to you on this one. Quite frankly, I'm surprised you haven't received a solution to this problem.

In DOS, this would be done as follows (modified to output just ID if lines match):


@echo off
(for /f "tokens=1 delims=," %%a in ('type file01.txt') do (
   echo %%a
))>both.txt

(for /f "tokens=1 delims=," %%a in ('type file02.txt') do (
   findstr /i "%%a" both.txt>nul
   if ERRORLEVEL 1 (
      echo %%a
   )
))>>both.txt

for /f %%a in ('type both.txt') do (
   for /f "tokens=*" %%b in ('findstr /i /b "%%a" file01.txt') do (
      if ERRORLEVEL 0 (
         for /f "tokens=*" %%c in ('findstr /i /b "%%a" file02.txt') do (
            if ERRORLEVEL 0 (
               if not "%%b"=="%%c" (
                  echo File01.txt: %%b
                  echo File02.txt: %%c
               ) else (
                  echo %%a
               )
            )
         )
      )
   )
)
ok, I went back, and re-read the question.

Please forgive me for my misunderstanding.
The code, as written will:

- read file1 into a dictionary (D1), indexed by the first csv value.  so far, so good.
- read file2 into a dictionary (D2), indexed by the first csv value.  so far, so good.
- "compare" the two dicationaries.
  For each key in dictionary #1 (D1):
  - if the key exists in both dictionaries, and the values agree, we're fine, ignore the value (i.e., by removing them)
  - if the key exists in both dictionaries, but the values differ, display the value from D1 (mismatch)
  - if the key doesn't exist in D2, display the value because it's unique to D1.
  For each key in dictionary #2 (D2):
  - if the key exists in both dictionaries, but the values differ, display the value from D2 (mismatch)
  - if the key doesn't exist in D1, display the value because it's unique to D2.
 
So, what is the problem?
- is it the fact that the comparison will display the mismatch entries from both dictionaries?
- or, it it the fact that values unique to each dictionary will be displayed?

My presumption is that you don't want the second.  In which case,  the comparison script should be written something like this:

- Sample output:
----------------------------------------------------------------------
C:\Programs\Python\csv>type file01.txt
00650260048_c,,,The Pink Sheet
05040390072,,,The Tan Sheet
020108d4,,,Health News Daily
02260160016_b,,,The Rose Sheet
00630360023,relatedDocs,00630220023,The Pink Sheet
----------------------------------------------------------------------
C:\Programs\Python\csv>type file02.txt
000105d2,,,Health News Daily
00650260048_c,,,The Pink Sheet
000105d5,,,Health News Daily
05040390072,,,The Tan Sheet
000106d1,,,Health News Daily
000106d3,,,Health News Daily
000106d4,,,Health News Daily
000106d6,,,Health News Daily
----------------------------------------------------------------------
C:\Programs\Python\csv>CSVcompare.py file01.txt file02.txt
----------------------------------------------------------------------

C:\Programs\Python\csv>type file01.txt
00650260048_c,,,The Pink Sheet
05040390072,,,The Tan Sheet
020108d4,,,Health News Daily
02260160016_b,,,The Rose Sheet
00630360023,relatedDocs,00630220023,The Pink Sheet
C:\Programs\Python\csv>type file03.txt
00650260048_c,,,The Pink Sheet
05040390072,,,The Tan Sheet
020108d4,,,<Different value>
02260160016_b,,,The Rose Sheet
00630360023,relatedDocs,00630220023,The Pink Sheet
C:\Programs\Python\csv>CSVcompare.py file01.txt file03.txt
020108d4,,,Health News Daily
020108d4,,,<Different value>

C:\Programs\Python\csv>


Is that what you wanted?
import csv, sys, os.path
 
def csvToDict( filename ):
  result = {};
# print 'filename: ' + filename;
  if os.path.exists( filename ) :
    try :
      handle = open( filename );
      for line in csv.reader( handle, delimiter=',' ) :
        inx = line[ 0 ];
        data = ','.join( line );
#       print '["%s"] = "%s"' % ( inx, data );
        result[ inx ] = data
      handle.close();
    except :
      ( kind, val ) = sys.exc_info()[ :2 ];
      ( kind, val ) = str( kind ), str( val );
      print 'Error: ' + kind + '\nValue: ' + val;
  return result;
 
def CSVcompare( file1, file2 ):
# print 'file1: "%s"\nfile2: "%s"' % ( file1, file2 )
  d1 = csvToDict( file1 );
  d2 = csvToDict( file2 );
  n1 = d1.keys();
  n1.sort();
  for n in n1[ : ] :
    if d2.has_key( n ) :
      if d1[ n ] == d2[ n ] :
        del( d1[ n ] );
        del( d2[ n ] );
      else :
        print d1[ n ];
    else :
#      print d1[ n ];
       pass;
# print '-' * 50
  n2 = d2.keys();
  n2.sort();
  for n in n2[ : ] :
    if d1.has_key( n ) :
      if d1[ n ] != d2[ n ] :
        print d2[ n ];
    else :
#     print d2[ n ];
      pass;
 
 
def Usage( cmd = 'CSVcompare' ):
  print 'Uage: %s.py file1.csv file2.csv'
  sys.exit()
 
if __name__ == '__main__' or __name__ == 'main' :
  if len( sys.argv ) == 3 :
    CSVcompare( sys.argv[ 1 ], sys.argv[ 2 ] );
  else :
    Usage();
else :
  Usage( __name__ );

Open in new window

>> "is it the fact that the comparison will display the mismatch entries from both dictionaries?"
>> "or, it it the fact that values unique to each dictionary will be displayed?"

BOTH! - Neither of these are part of the asker's requirements.
Hi HonorGod and t0t0,

Thanks lot for your really great effort!
Finally I cameup following code with looking at your  code :)

Thanks so much!!!!!!!!!
import csv, sys, os.path
 
def csvToDict( filename ):
  result = {};                 
  if os.path.exists( filename ) :
    try :
      handle = open( filename );
      for line in csv.reader( handle, delimiter=',' ) :
 
        inx = line[ 0 ];
        data = ','.join( line );
        result[ inx ] = data
      handle.close();
    except :
      ( kind, val ) = sys.exc_info()[ :2 ];
      ( kind, val ) = str( kind ), str( val );
      print 'Error: ' + kind + '\nValue: ' + val;
  return result;
 
def CSVcompare( file1, file2, file3 ):
  print 'file1: "%s"\nfile2: "%s"' % ( file1, file2 )
  d1 = csvToDict( file1 );
  d2 = csvToDict( file2 );
  d3 = csvToDict( file3 );
  n1 = d1.keys();
  n1.sort();
  for n in n1:
    if n in d1 and n in d2:                                         
      if d2[n] != d3[n]:                                            
        print "values differ, ", n, "---", d2[n], "---", d3[n]
 
  print '-' * 50
 
 
def Usage( cmd = 'CSVcompare' ):
  print 'Uage: %s.py both.csv file1.csv file2.csv'
  sys.exit()
 
if __name__ == '__main__' or __name__ == 'main' :
  if len( sys.argv ) == 4 :
    CSVcompare( sys.argv[ 1 ], sys.argv[ 2 ], sys.argv[ 3 ]);
  else :
    Usage();
else :
  Usage( __name__ );

Open in new window

Thanks for the grade & points.

Good luck & have a great day.