Solved

compare csv on exact columns

Posted on 2011-03-24
33
1,718 Views
Last Modified: 2012-05-11
this is a fairly common question I know from searching; however, none of what I found really seemed to be quite what I wanted. I've looked at trying powershell, but have never messed with it; looking for the best and quickest solution. This is something I'd like to automate.

csv1: need to look at column 3
csv2: need to look at column 2

columns 3 and 2 should match; if there are differences, such as an extra I need it outputted to a file
0
Comment
Question by:mtnseeker
  • 10
  • 9
  • 7
  • +1
33 Comments
 
LVL 13

Expert Comment

by:soostibi
ID: 35211287
Could you make it a bit clearer?

CSV1:

Col1, Col2, Col3
a,a,a
b,b,b
c,c,c

CSV2:
Col1, Col2
x,a
y,b
z,d

What would you like to see in the output?

c,c,c?
z,d?
Both?

Do I get the problem right?
0
 

Author Comment

by:mtnseeker
ID: 35211318
csv1:
data, data, AccountNumber, data, data

csv2:
data, AccountNumber, data, data, data, data

output:
Just the AccoutNumber that does not exist in one of the two CSV's. They should both be in there, but there may be an account number that is found in one and not the other.
0
 
LVL 13

Expert Comment

by:soostibi
ID: 35211374
Check this:
$csv1 = Import-Csv C:\ee\CSV1.csv -UseCulture
$csv2 = Import-Csv C:\ee\CSV2.csv -UseCulture
Compare-Object $csv1 $csv2 -Property Accountnumber | Select-Object -Property accountnumber

Open in new window

0
 

Author Comment

by:mtnseeker
ID: 35211940
@aikimark: yes... unfortunately I didn't tag enough zones. My question was meant to be more general than specific to powershell (I was guessing it was the best solution). I am also looking at any utilities (freeware) that may do the job as well.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 35212067
my approach would be to run two queries against an outer-joined view of the two tables.  Let the database engine do the heavy lifting.
* left join where the right table value is null
* right join where the left table value is null

====
If the column data is unique, you could do the following:
In a VBScript environment, I would populate two "scripting.dictionary" objects with the contents of the two columns.  Iterate through each of the objects once, invoking the Exists() method on the other dictionary object.

If the column data isn't unique, you could invoke that dictionary object's Exists() method before adding the column data.

If this is a simple CSV file, your parsing of the data should be straight forward.
0
 
LVL 51

Expert Comment

by:Bill Prew
ID: 35212137
Will the values in the columns to compare be in the same order, or do we essentially have to search the whole file looking for a row from the first file?

~bp
0
 
LVL 45

Expert Comment

by:aikimark
ID: 35212204
@bp

That's the advantage of using a hashed lookup (dictionary)
0
 
LVL 51

Expert Comment

by:Bill Prew
ID: 35212283
==> aikimark

Yes, after i posted my question I saw your post and we are thinking along similar lines...

~bp
0
 
LVL 13

Expert Comment

by:soostibi
ID: 35214281
Before you guys invent the wheel, please check my PowerShell solution, please.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 35214523
>>check my PowerShell solution

@mtnseeker

* Have you tried the posted PS code?
* Do the CSV files have a header row?
0
 

Author Comment

by:mtnseeker
ID: 35216858
the csv files do not have a header row so that code didn't run...

also with this powershell solution how would I output the results to a third csv or other text file?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 35216916
@mtnseeker

please post some sample rows from your two CSV files.  You should obscure the actual data if it is sensitive.  However, it should be an accurate representation of the data that is in the actual CSV files.
0
 

Author Comment

by:mtnseeker
ID: 35217239
see the attachment... I am inclined to go with powershell on this since it seems lightweight and I don't want to have to build an entire application...just haven't used it before
 example1.csv
 example2.csv
0
 
LVL 45

Expert Comment

by:aikimark
ID: 35217406
@soostibi

1. you need to use the -Header parameter when doing your CSV-Import function
2. you need to add something, perhaps a -differenceobject parameter, that will compare the imported data for differences.  Even then, I think you will need two of such comparisons, swapping the left/right order of the CSV objects.
0
 
LVL 13

Expert Comment

by:soostibi
ID: 35218127
Yes, I need the header, but I do not need any other thing.

The output:

accountnumber                                                                                                                          
-------------                                                                                                                          
70002300                                                                                                                                
40006700                                                                                                                                
10011400                                                                                                                                
10022800  
PS C:\Windows\system32> $csv1= Import-Csv C:\ee\example1.csv -Header (("c1","c2","accountnumber") + (1..6 | %{"d$_"}))
$csv2= Import-Csv C:\ee\example2.csv -Header (("e1","accountnumber") + (1..11 | %{"f$_"}))

Compare-Object $csv1 $csv2 -Property accountnumber | Select-Object -Property accountnumber

Open in new window

0
How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

 
LVL 13

Expert Comment

by:soostibi
ID: 35218187
And you can also export the result by this:
Compare-Object $csv1 $csv2 -Property accountnumber | Select-Object -Property accountnumber | Export-Csv -notypeinformation -path c:\result.csv

Open in new window

0
 
LVL 45

Expert Comment

by:aikimark
ID: 35218253
@soostibi

the problem, as stated, is to find the differences between the two columns of accountnumber values
0
 
LVL 51

Expert Comment

by:Bill Prew
ID: 35218277
here's a VBS solution that should do the job.  Save as a VBS and adjust the file paths at the top as needed.

Option Explicit

' Define Needed constants
Const cForReading = 1
Const cForWriting = 2
Const cInFile1 = "X:\EE\EE26910135\example1.csv"
Const cInFile2 = "X:\EE\EE26910135\example2.csv"
Const cOutFile = "X:\EE\EE26910135\missing.csv"
Const cFieldNum1 = 3
Const cFieldNum2 = 2

' Define needed variables
Dim oFSO, oOutFile, dCount, sLine

' Create file system object
Set oFSO = CreateObject("Scripting.FileSystemObject")

' Create dictionary to accumulate counts in
Set dCount = CreateObject("Scripting.Dictionary")

' Process two incoming CSV files
LoadFile cInFile1, cFieldNum1
LoadFile cInFile2, cFieldNum2

' Open output file, print header lines
Set oOutFile = oFSO.OpenTextFile(cOutFile, cForWriting, True)
oOutFile.WriteLine "AccountNumber"
oOutFile.WriteLine "-------------"

' Process each entry in counts dictionary, and output count
For Each sLine In dCount
    If dCount.Item(sLine) = 1 oOutFile.WriteLine sLine
Next

' Close output file
oOutFile.Close

Sub LoadFile (sFile, iField)
    Dim oFile, sLine, aLine, sKey

    ' Open file for reading
    Set oFile = oFSO.OpenTextFile(sFile, cForReading)

    ' Read input lines, accumulate counts in dictionary for this account
    do Until oFile.AtEndOfStream
        sLine = oFile.Readline
        aLine = Split(sLine, ",")
        sKey = aLine(iField-1)
        If dCount.Exists(sKey) Then
            dCount.Item(sKey) = dCount.Item(sKey) + 1
        Else
            dCount.Add sKey, 1
        End If
    Loop
    oFile.Close
End Sub

Open in new window

~bp
0
 
LVL 13

Expert Comment

by:soostibi
ID: 35218358
I do not exactly see what output you exactly expect, but here is another try:

accountnumber                                                        file                                                              
-------------                                                        ----                                                              
70002300                                                             example2.csv                                                      
40006700                                                             example2.csv                                                      
10011400                                                             example1.csv                                                      
10022800                                                             example1.csv    
$file1 = "C:\ee\example1.csv"
$file2 = "C:\ee\example2.csv"

$csv1= Import-Csv -Path $file1 -Header (("c1","c2","accountnumber") + (1..6 | %{"d$_"}))
$csv2= Import-Csv -path $file2 -Header (("e1","accountnumber") + (1..11 | %{"f$_"}))

Compare-Object $csv1 $csv2 -Property accountnumber | select-object -Property accountnumber, @{n="file";e={
    if($_.sideindicator -eq "=>"){split-path $file2 -Leaf}else{split-path $file1 -Leaf}
}}

Open in new window

0
 
LVL 45

Expert Comment

by:aikimark
ID: 35218736
@soostibi

edit one or both CSV files, adding values from the other list and reordering the list values.  What mtnseeker needs is to ONLY see those values that are not both lists.
0
 
LVL 13

Expert Comment

by:soostibi
ID: 35219095
This exactly what my list is: those values that are not in both list.
My comment at 'Date:25/03/11 12:16 PM ' exactly resulted in 'ONLY those values that are not both lists'. What was the problem with that then? I just do not understand...
0
 
LVL 45

Expert Comment

by:aikimark
ID: 35219408
In my test of you PS script and my reordered and partially-same values, the same values were output.
0
 

Author Comment

by:mtnseeker
ID: 35219493
Looking at the VBS Script from @billprew..

This doesn't seem to compile, receiving a "Expected Then" statement. Suggestions on that script?
0
 
LVL 13

Expert Comment

by:soostibi
ID: 35219496
accountnumbers of CSV1:
-------------
10011400
10022800
10041600

accountnumbers of CSV2:
-------------
70002300
40006700
10041600

Only the last numbers in both CSV are the same, so - if I understand correctly the request - I have to output the first 2 numbers from both CSVs, because they do not match. And this is exactly what I did. These numbers are the different ones.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 35219522
then I must have done something wrong with my tests
0
 
LVL 13

Expert Comment

by:soostibi
ID: 35219549
So consider this again:
$csv1= Import-Csv C:\ee\example1.csv -Header (("c1","c2","accountnumber") + (1..6 | %{"d$_"}))  
$csv2= Import-Csv C:\ee\example2.csv -Header (("e1","accountnumber") + (1..11 | %{"f$_"}))  
  
Compare-Object $csv1 $csv2 -Property accountnumber | Select-Object -Property accountnumber | 
Export-Csv -notypeinformation -path c:\result.csv

Open in new window

0
 
LVL 51

Accepted Solution

by:
Bill Prew earned 500 total points
ID: 35219871
Sorry, lost a "Then" in a cut and paste, this does run.

Option Explicit

' Define Needed constants
Const cForReading = 1
Const cForWriting = 2
Const cInFile1 = "X:\EE\EE26910135\example1.csv"
Const cInFile2 = "X:\EE\EE26910135\example2.csv"
Const cOutFile = "X:\EE\EE26910135\missing.csv"
Const cFieldNum1 = 3
Const cFieldNum2 = 2

' Define needed variables
Dim oFSO, oOutFile, dCount, sLine

' Create file system object
Set oFSO = CreateObject("Scripting.FileSystemObject")

' Create dictionary to accumulate counts in
Set dCount = CreateObject("Scripting.Dictionary")

' Process two incoming CSV files
LoadFile cInFile1, cFieldNum1
LoadFile cInFile2, cFieldNum2

' Open output file, print header lines
Set oOutFile = oFSO.OpenTextFile(cOutFile, cForWriting, True)
oOutFile.WriteLine "AccountNumber"
oOutFile.WriteLine "-------------"

' Process each entry in counts dictionary, and output count
For Each sLine In dCount
    If dCount.Item(sLine) = 1 Then oOutFile.WriteLine sLine
Next

' Close output file
oOutFile.Close

Sub LoadFile (sFile, iField)
    Dim oFile, sLine, aLine, sKey

    ' Open file for reading
    Set oFile = oFSO.OpenTextFile(sFile, cForReading)

    ' Read input lines, accumulate counts in dictionary for this account
    do Until oFile.AtEndOfStream
        sLine = oFile.Readline
        aLine = Split(sLine, ",")
        sKey = aLine(iField-1)
        If dCount.Exists(sKey) Then
            dCount.Item(sKey) = dCount.Item(sKey) + 1
        Else
            dCount.Add sKey, 1
        End If
    Loop
    oFile.Close
End Sub

Open in new window

~bp
0
 

Author Comment

by:mtnseeker
ID: 35236876
@billprew

how would I concatenate the AccountNumber with the respective csv, so if it is csv1 it specifies accountnumber, example1 on the output?

I've been trying to add it in the for each, but I don't think I've got it quite right
0
 
LVL 51

Expert Comment

by:Bill Prew
ID: 35237532
How about this?

Option Explicit

' Define Needed constants
Const cForReading = 1
Const cForWriting = 2
Const cInFile1 = "X:\EE\EE26910135\example1.csv"
Const cInFile2 = "X:\EE\EE26910135\example2.csv"
Const cOutFile = "X:\EE\EE26910135\missing.csv"
Const cFieldNum1 = 3
Const cFieldNum2 = 2

' Define needed variables
Dim oFSO, oOutFile, dCount, sLine

' Create file system object
Set oFSO = CreateObject("Scripting.FileSystemObject")

' Create dictionary to accumulate counts in
Set dCount = CreateObject("Scripting.Dictionary")

' Process two incoming CSV files
LoadFile cInFile1, cFieldNum1
LoadFile cInFile2, cFieldNum2

' Open output file, print header lines
Set oOutFile = oFSO.OpenTextFile(cOutFile, cForWriting, True)
oOutFile.WriteLine "AccountNumber"
oOutFile.WriteLine "-------------"

' Process each entry in counts dictionary, and output count
For Each sLine In dCount
    If dCount.Item(sLine) <> "BOTH" Then oOutFile.WriteLine sLine & "    " & dCount.Item(sLine)
Next

' Close output file
oOutFile.Close

Sub LoadFile (sFile, iField)
    Dim oFile, oGetFile, sLine, aLine, sKey

    ' Open file for reading
    Set oFile = oFSO.OpenTextFile(sFile, cForReading)
    set oGetFile = oFSO.GetFile(sFile)

    ' Read input lines, accumulate counts in dictionary for this account
    do Until oFile.AtEndOfStream
        sLine = oFile.Readline
        aLine = Split(sLine, ",")
        sKey = aLine(iField-1)
        If dCount.Exists(sKey) Then
            dCount.Item(sKey) = "BOTH"
        Else
            dCount.Add sKey, oGetFile.Name
        End If
    Loop
    oFile.Close
End Sub

Open in new window

~bp
0
 
LVL 13

Expert Comment

by:soostibi
ID: 35238636
Dear mtnseeker,

Why did you totally ignored my solutions? You requested PowerShell, and I gave you two solutions, one with the all the different AccountNumbers indicating which CSV they ceme from, and a second one with only the non matching Account numbers...
0
 

Author Comment

by:mtnseeker
ID: 35238787
I understood the other solution a bit better and it was just easier for me to make modifications I needed such as adding a email send feature. Originally I thought powershell would be the way to go but I needed something famailar to me and the vbs script worked great. Thank you for your solution and I probably should start using powershell a bit more.

thanks again all,
Kevin
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

It was really hard time for me to get the understanding of Delegates in C#. I went through many websites and articles but I found them very clumsy. After going through those sites, I noted down the points in a easy way so here I am sharing that unde…
If you need to start windows update installation remotely or as a scheduled task you will find this very helpful.
The viewer will learn how to create a normally distributed random variable in Excel, use a normal distribution to simulate the return on an investment over a period of years, Create a Monte Carlo simulation using a normal random variable, and calcul…
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now