Link to home
Start Free TrialLog in
Avatar of mtnseeker
mtnseeker

asked on

compare csv on exact columns

this is a fairly common question I know from searching; however, none of what I found really seemed to be quite what I wanted. I've looked at trying powershell, but have never messed with it; looking for the best and quickest solution. This is something I'd like to automate.

csv1: need to look at column 3
csv2: need to look at column 2

columns 3 and 2 should match; if there are differences, such as an extra I need it outputted to a file
Avatar of soostibi
soostibi
Flag of Hungary image

Could you make it a bit clearer?

CSV1:

Col1, Col2, Col3
a,a,a
b,b,b
c,c,c

CSV2:
Col1, Col2
x,a
y,b
z,d

What would you like to see in the output?

c,c,c?
z,d?
Both?

Do I get the problem right?
Avatar of mtnseeker
mtnseeker

ASKER

csv1:
data, data, AccountNumber, data, data

csv2:
data, AccountNumber, data, data, data, data

output:
Just the AccoutNumber that does not exist in one of the two CSV's. They should both be in there, but there may be an account number that is found in one and not the other.
Check this:
$csv1 = Import-Csv C:\ee\CSV1.csv -UseCulture
$csv2 = Import-Csv C:\ee\CSV2.csv -UseCulture
Compare-Object $csv1 $csv2 -Property Accountnumber | Select-Object -Property accountnumber

Open in new window

@aikimark: yes... unfortunately I didn't tag enough zones. My question was meant to be more general than specific to powershell (I was guessing it was the best solution). I am also looking at any utilities (freeware) that may do the job as well.
Avatar of aikimark
my approach would be to run two queries against an outer-joined view of the two tables.  Let the database engine do the heavy lifting.
* left join where the right table value is null
* right join where the left table value is null

====
If the column data is unique, you could do the following:
In a VBScript environment, I would populate two "scripting.dictionary" objects with the contents of the two columns.  Iterate through each of the objects once, invoking the Exists() method on the other dictionary object.

If the column data isn't unique, you could invoke that dictionary object's Exists() method before adding the column data.

If this is a simple CSV file, your parsing of the data should be straight forward.
Will the values in the columns to compare be in the same order, or do we essentially have to search the whole file looking for a row from the first file?

~bp
@bp

That's the advantage of using a hashed lookup (dictionary)
==> aikimark

Yes, after i posted my question I saw your post and we are thinking along similar lines...

~bp
Before you guys invent the wheel, please check my PowerShell solution, please.
>>check my PowerShell solution

@mtnseeker

* Have you tried the posted PS code?
* Do the CSV files have a header row?
the csv files do not have a header row so that code didn't run...

also with this powershell solution how would I output the results to a third csv or other text file?
@mtnseeker

please post some sample rows from your two CSV files.  You should obscure the actual data if it is sensitive.  However, it should be an accurate representation of the data that is in the actual CSV files.
see the attachment... I am inclined to go with powershell on this since it seems lightweight and I don't want to have to build an entire application...just haven't used it before
 example1.csv
 example2.csv
@soostibi

1. you need to use the -Header parameter when doing your CSV-Import function
2. you need to add something, perhaps a -differenceobject parameter, that will compare the imported data for differences.  Even then, I think you will need two of such comparisons, swapping the left/right order of the CSV objects.
Yes, I need the header, but I do not need any other thing.

The output:

accountnumber                                                                                                                          
-------------                                                                                                                          
70002300                                                                                                                                
40006700                                                                                                                                
10011400                                                                                                                                
10022800  
PS C:\Windows\system32> $csv1= Import-Csv C:\ee\example1.csv -Header (("c1","c2","accountnumber") + (1..6 | %{"d$_"}))
$csv2= Import-Csv C:\ee\example2.csv -Header (("e1","accountnumber") + (1..11 | %{"f$_"}))

Compare-Object $csv1 $csv2 -Property accountnumber | Select-Object -Property accountnumber

Open in new window

And you can also export the result by this:
Compare-Object $csv1 $csv2 -Property accountnumber | Select-Object -Property accountnumber | Export-Csv -notypeinformation -path c:\result.csv

Open in new window

@soostibi

the problem, as stated, is to find the differences between the two columns of accountnumber values
here's a VBS solution that should do the job.  Save as a VBS and adjust the file paths at the top as needed.

Option Explicit

' Define Needed constants
Const cForReading = 1
Const cForWriting = 2
Const cInFile1 = "X:\EE\EE26910135\example1.csv"
Const cInFile2 = "X:\EE\EE26910135\example2.csv"
Const cOutFile = "X:\EE\EE26910135\missing.csv"
Const cFieldNum1 = 3
Const cFieldNum2 = 2

' Define needed variables
Dim oFSO, oOutFile, dCount, sLine

' Create file system object
Set oFSO = CreateObject("Scripting.FileSystemObject")

' Create dictionary to accumulate counts in
Set dCount = CreateObject("Scripting.Dictionary")

' Process two incoming CSV files
LoadFile cInFile1, cFieldNum1
LoadFile cInFile2, cFieldNum2

' Open output file, print header lines
Set oOutFile = oFSO.OpenTextFile(cOutFile, cForWriting, True)
oOutFile.WriteLine "AccountNumber"
oOutFile.WriteLine "-------------"

' Process each entry in counts dictionary, and output count
For Each sLine In dCount
    If dCount.Item(sLine) = 1 oOutFile.WriteLine sLine
Next

' Close output file
oOutFile.Close

Sub LoadFile (sFile, iField)
    Dim oFile, sLine, aLine, sKey

    ' Open file for reading
    Set oFile = oFSO.OpenTextFile(sFile, cForReading)

    ' Read input lines, accumulate counts in dictionary for this account
    do Until oFile.AtEndOfStream
        sLine = oFile.Readline
        aLine = Split(sLine, ",")
        sKey = aLine(iField-1)
        If dCount.Exists(sKey) Then
            dCount.Item(sKey) = dCount.Item(sKey) + 1
        Else
            dCount.Add sKey, 1
        End If
    Loop
    oFile.Close
End Sub

Open in new window

~bp
I do not exactly see what output you exactly expect, but here is another try:

accountnumber                                                        file                                                              
-------------                                                        ----                                                              
70002300                                                             example2.csv                                                      
40006700                                                             example2.csv                                                      
10011400                                                             example1.csv                                                      
10022800                                                             example1.csv    
$file1 = "C:\ee\example1.csv"
$file2 = "C:\ee\example2.csv"

$csv1= Import-Csv -Path $file1 -Header (("c1","c2","accountnumber") + (1..6 | %{"d$_"}))
$csv2= Import-Csv -path $file2 -Header (("e1","accountnumber") + (1..11 | %{"f$_"}))

Compare-Object $csv1 $csv2 -Property accountnumber | select-object -Property accountnumber, @{n="file";e={
    if($_.sideindicator -eq "=>"){split-path $file2 -Leaf}else{split-path $file1 -Leaf}
}}

Open in new window

@soostibi

edit one or both CSV files, adding values from the other list and reordering the list values.  What mtnseeker needs is to ONLY see those values that are not both lists.
This exactly what my list is: those values that are not in both list.
My comment at 'Date:25/03/11 12:16 PM ' exactly resulted in 'ONLY those values that are not both lists'. What was the problem with that then? I just do not understand...
In my test of you PS script and my reordered and partially-same values, the same values were output.
Looking at the VBS Script from @billprew..

This doesn't seem to compile, receiving a "Expected Then" statement. Suggestions on that script?
accountnumbers of CSV1:
-------------
10011400
10022800
10041600

accountnumbers of CSV2:
-------------
70002300
40006700
10041600

Only the last numbers in both CSV are the same, so - if I understand correctly the request - I have to output the first 2 numbers from both CSVs, because they do not match. And this is exactly what I did. These numbers are the different ones.
then I must have done something wrong with my tests
So consider this again:
$csv1= Import-Csv C:\ee\example1.csv -Header (("c1","c2","accountnumber") + (1..6 | %{"d$_"}))  
$csv2= Import-Csv C:\ee\example2.csv -Header (("e1","accountnumber") + (1..11 | %{"f$_"}))  
  
Compare-Object $csv1 $csv2 -Property accountnumber | Select-Object -Property accountnumber | 
Export-Csv -notypeinformation -path c:\result.csv

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Bill Prew
Bill Prew

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
@billprew

how would I concatenate the AccountNumber with the respective csv, so if it is csv1 it specifies accountnumber, example1 on the output?

I've been trying to add it in the for each, but I don't think I've got it quite right
How about this?

Option Explicit

' Define Needed constants
Const cForReading = 1
Const cForWriting = 2
Const cInFile1 = "X:\EE\EE26910135\example1.csv"
Const cInFile2 = "X:\EE\EE26910135\example2.csv"
Const cOutFile = "X:\EE\EE26910135\missing.csv"
Const cFieldNum1 = 3
Const cFieldNum2 = 2

' Define needed variables
Dim oFSO, oOutFile, dCount, sLine

' Create file system object
Set oFSO = CreateObject("Scripting.FileSystemObject")

' Create dictionary to accumulate counts in
Set dCount = CreateObject("Scripting.Dictionary")

' Process two incoming CSV files
LoadFile cInFile1, cFieldNum1
LoadFile cInFile2, cFieldNum2

' Open output file, print header lines
Set oOutFile = oFSO.OpenTextFile(cOutFile, cForWriting, True)
oOutFile.WriteLine "AccountNumber"
oOutFile.WriteLine "-------------"

' Process each entry in counts dictionary, and output count
For Each sLine In dCount
    If dCount.Item(sLine) <> "BOTH" Then oOutFile.WriteLine sLine & "    " & dCount.Item(sLine)
Next

' Close output file
oOutFile.Close

Sub LoadFile (sFile, iField)
    Dim oFile, oGetFile, sLine, aLine, sKey

    ' Open file for reading
    Set oFile = oFSO.OpenTextFile(sFile, cForReading)
    set oGetFile = oFSO.GetFile(sFile)

    ' Read input lines, accumulate counts in dictionary for this account
    do Until oFile.AtEndOfStream
        sLine = oFile.Readline
        aLine = Split(sLine, ",")
        sKey = aLine(iField-1)
        If dCount.Exists(sKey) Then
            dCount.Item(sKey) = "BOTH"
        Else
            dCount.Add sKey, oGetFile.Name
        End If
    Loop
    oFile.Close
End Sub

Open in new window

~bp
Dear mtnseeker,

Why did you totally ignored my solutions? You requested PowerShell, and I gave you two solutions, one with the all the different AccountNumbers indicating which CSV they ceme from, and a second one with only the non matching Account numbers...
I understood the other solution a bit better and it was just easier for me to make modifications I needed such as adding a email send feature. Originally I thought powershell would be the way to go but I needed something famailar to me and the vbs script worked great. Thank you for your solution and I probably should start using powershell a bit more.

thanks again all,
Kevin