Compare multiple columns to determine matches

Posted on 2014-03-07
Medium Priority
Last Modified: 2014-03-12
I have been given an audit that I do every month that requires comparing different sources to determine which items from each source are not in the others.  Essentially, I am given a directory dump with files in it, and told to validate that each of the files in the directory is also present in a spreadsheet.  The spreadsheet has 2 different columns that I have to validate the directory list against, but it only has to exist in one of the columns in order to pass the audit.  So, what I have been doing is scraping the .txt document that contains the files, and copying that into a column in Excel.  Then, I take the 2 columns out of the other Excel spreadsheet and compare them one at a time against the new sheet that I created.  It ends up working, but it's really not repeatable, or consistent, and it takes forever.  I'm looking for a more elegant solution.  It doesn't matter to me if it's in Excel, Access, VBA, SQL, or whatever.  I just need some way to do it better than I've been doing.
Question by:SarahDaisy8
LVL 16

Expert Comment

ID: 39913883
What do you have for tools?

If SQL Server, Access, or something that accepts SQL commands, you could import and keep the two columns from the Excel file, and also import the text file (is it a "dir >> textfile.txt ?) into another table. If that's possible then it's simple to write a query like

select * from DirectoryListTable
where FileNameField in (select FileNameField from ExcelTableOne)

to see what filenames are present. Change the "in" to "not in" to see the extras!


LVL 41

Assisted Solution

PatHartman earned 1000 total points
ID: 39913984
If you can see the directory from your computer, I would ignore the word document and get the file list directly from windows using FSO (file system object).  Then I would link to the spreadsheet and use an append query to append column 1 to a work table and a second query to append column 2.  You can add an extra column if you need to retain the fact that the name came from column 1 or column 2  Now you have one table with a list of files from the directory and a second table with a list of files from the spreadsheet.  So far all of these steps are repeatable and can be automated.

To do the match, I would use a union query and tag the results
qry1 is an inner join and will return all rows in both tables.
qry2 is a left join and returns rows in the directory list but not the spreadsheet
qry3 is a right join and returns rows in the spreadsheet but not in the directory list (I know you didn't ask for this but you should account for typos)

Then you can make a report that lists and counts the three situations from the union query.
LVL 17

Accepted Solution

dbaSQL earned 1000 total points
ID: 39914143
You know, I have done something very similar.  I had an ftp process that ran every day, to upload new files to our system.  The files were triggered by events, so there was no way to schedule, or know when to expect new files.  

I created a 'ProcessedFiles' table that just held FileName and Processed, where Processed was a bit flag, with a DEFAULT of 0.  Every day I would check the target directory for new files.  I don't have the full code on hand right now, but it was basically like the statement below.  Scan the directory, load all filenames to a working table, then compare to ProcessedFiles.  Anything where NOT EXISTS was then uploaded with a SSIS data flow task.  Upon completion, the Processed flag gets changed to 1.

You could use this type of approach to gather your filenames from the directory, and then compare to the spreadsheet.  

-- change this to your target directory path and run it.  you'll see what I mean.
DECLARE @path VARCHAR(100) = 'dir C:\tmp\'   --- your target directory
DECLARE @cmd VARCHAR(1024) = @path + ' /A-D /B'

DECLARE @Files TABLE (Filename VARCHAR(255))
INSERT @Files (Filename)
EXEC master..xp_cmdshell @cmd

-- Look at your file list
A proven path to a career in data science

At Springboard, we know how to get you a job in data science. With Springboard’s Data Science Career Track, you’ll master data science  with a curriculum built by industry experts. You’ll work on real projects, and get 1-on-1 mentorship from a data scientist.

LVL 27

Expert Comment

ID: 39914753
One advice: use Python. By far the easiest way to accomplish this kind of tasks. It will be a well spent time to learn and get familiar with it. There is a module to deal with any situation you might come across and they are all FREE!

CSV with Python: http://www.pythonforbeginners.com/systems-programming/using-the-csv-module-in-python/
LVL 47

Expert Comment

ID: 39915146
1. is the file comparison for a single directory's files or is there a tree involved?
2. do you need to compare any of the file attributes (modified date, size, hash) or just the names?
3. do you only need to know the exceptions or something else? (new and missing, for instance)
4. at the end of the process, what is to become of the Excel worksheet contents?
5. how many files are in the directory?
6. how many file names are in the worksheet?
7. is there a directory of files that matches the list in the worksheet?
LVL 47

Expert Comment

ID: 39915161
If the files from the prior month (time period) exist in a directory, you should also be able to use Robocopy to do the comparison.
For testing purposes, I created two folders "new" and "old", copied some files to each folder and ran the following command.
robocopy /L /X new old

Open in new window

I then deleted a file and added a file in the "new" folder and reran the command.  This was the result.
C:\Users\Mark\Downloads>robocopy /L /X new old

   ROBOCOPY     ::     Robust File Copy for Windows


  Started : Sat Mar 08 17:21:32 2014

   Source : C:\Users\Mark\Downloads\new\
     Dest : C:\Users\Mark\Downloads\old\

    Files : *.*

  Options : *.* /X /L /COPY:DAT /R:1000000 /W:30


                           3    C:\Users\Mark\Downloads\new\
          *EXTRA File              84784        fciv.exe
            New File                   0        New Text Document.txt


                Total    Copied   Skipped  Mismatch    FAILED    Extras
     Dirs :         1         0         1         0         0         0
    Files :         3         1         2         0         0         1
    Bytes :    47.6 k         0    47.6 k         0         0    82.7 k
    Times :   0:00:00   0:00:00                       0:00:00   0:00:00

    Ended : Sat Mar 08 17:21:32 2014

Open in new window

Note that it reported the new and missing files.
LVL 47

Expert Comment

ID: 39915187
If you only have the list in the worksheet, you can most quickly do this comparison in the Excel VBA or VBScript environment using a dictionary object.
Example with parameter for directory.  The list of files is assumed to be in column F, starting on row 2.
Option Explicit

Public Sub Q_28383341(parmFolder As String)
    Dim oDic As Object
    Dim vFiles() As Variant
    Dim strFilename As String
    Dim vItem As Variant
    Set oDic = CreateObject("scripting.dictionary")
    vFiles = ActiveSheet.Range(ActiveSheet.Range("F2"), ActiveSheet.Range("F2").End(xlDown)).Value
    For Each vItem In vFiles
        oDic.Add CStr(vItem), 0
    strFilename = Dir(parmFolder & "\*.*", vbNormal)
    Do Until Len(strFilename) = 0
        If oDic.exists(strFilename) Then
            oDic(strFilename) = 1
            Debug.Print "New file: " & strFilename
        End If
        strFilename = Dir
    For Each vItem In oDic
        If oDic(vItem) = 0 Then
            Debug.Print "Missing file: " & vItem
        End If
End Sub

Open in new window

Invoking this in the Immediate Window gives:
Q_28383341 "C:\Users\Mark\Downloads\New"
New file: New Text Document.txt
Missing file: fciv.exe

Open in new window


Author Comment

ID: 39918086
Let me expand on my topic to add clarification, and to talk about my difficulties.  Right now, I receive a .txt file that contains files in directories that have been backed up.  The location of the files doesn't matter, just the name of the files, so I strip out only the names of the files from this .txt file and create a table in SQL.  I end up with one column with items named [username].txt.  The [username] portion can come from 2 different variables.  Those being the 2 columns in Excel that I spoke of previously.  The first column in the email address of the user, and the 2nd column is the user's login name.  Therefore, the corresponding .txt file for the user can either be named [email].txt or [login].txt.  Technically a user can have an email and a username, and the .txt file could be named either one, but I wouldn't know this in advance.  So, basically I need to compare the .txt file to see which ones don't match either an email address OR login name.  The queries I've been using aren't working because I'm getting duplicate data, because if there is an email, it will still check the login name, and if it doesn't find one, it will tell me it's missing.  I need to combine login + email into one check against the .txt file name.
LVL 47

Expert Comment

ID: 39918306
Please post a sample of both the database table and the text file contents.

This question is in the Excel zone as well as some database zones.  Are these appropriate to the environment of the solution you seek?  If not, we can change the zones.

What do you think of my posted code solution?
LVL 47

Expert Comment

ID: 39918313
If you're working in a database environment I would normally import the text file and let the database do the comparison with left and right (outer) joins.

Author Closing Comment

ID: 39924597
Thanks.  Between the two different solutions, I came up with something that works for me.

Featured Post

Easily Design & Build Your Next Website

Squarespace’s all-in-one platform gives you everything you need to express yourself creatively online, whether it is with a domain, website, or online store. Get started with your free trial today, and when ready, take 10% off your first purchase with offer code 'EXPERTS'.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Microsoft's Excel has many features that most people will never need nor take advantage of.  Conditional formatting is one feature that you may find a necessity once you start using it.
In a use case, a user needs to close an opened report by simply pressing the Escape (Esc) key. This can be done by adding macro code in Report_KeyPress or Report_KeyDown event.
Have you created a query with information for a calendar? ... and then, abra-cadabra, the calendar is done?! I am going to show you how to make that happen. Visualize your data!  ... really see it To use the code to create a calendar from a q…
The Relationships Diagram is a good way to get an overall view of what a database is keeping track of. It is also where relationships are defined. A relationship specifies how two tables connect to each other. As you build tables in Microsoft Ac…

586 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question