Compare multiple columns to determine matches

I have been given an audit that I do every month that requires comparing different sources to determine which items from each source are not in the others.  Essentially, I am given a directory dump with files in it, and told to validate that each of the files in the directory is also present in a spreadsheet.  The spreadsheet has 2 different columns that I have to validate the directory list against, but it only has to exist in one of the columns in order to pass the audit.  So, what I have been doing is scraping the .txt document that contains the files, and copying that into a column in Excel.  Then, I take the 2 columns out of the other Excel spreadsheet and compare them one at a time against the new sheet that I created.  It ends up working, but it's really not repeatable, or consistent, and it takes forever.  I'm looking for a more elegant solution.  It doesn't matter to me if it's in Excel, Access, VBA, SQL, or whatever.  I just need some way to do it better than I've been doing.
Who is Participating?
dbaSQLConnect With a Mentor Commented:
You know, I have done something very similar.  I had an ftp process that ran every day, to upload new files to our system.  The files were triggered by events, so there was no way to schedule, or know when to expect new files.  

I created a 'ProcessedFiles' table that just held FileName and Processed, where Processed was a bit flag, with a DEFAULT of 0.  Every day I would check the target directory for new files.  I don't have the full code on hand right now, but it was basically like the statement below.  Scan the directory, load all filenames to a working table, then compare to ProcessedFiles.  Anything where NOT EXISTS was then uploaded with a SSIS data flow task.  Upon completion, the Processed flag gets changed to 1.

You could use this type of approach to gather your filenames from the directory, and then compare to the spreadsheet.  

-- change this to your target directory path and run it.  you'll see what I mean.
DECLARE @path VARCHAR(100) = 'dir C:\tmp\'   --- your target directory
DECLARE @cmd VARCHAR(1024) = @path + ' /A-D /B'

DECLARE @Files TABLE (Filename VARCHAR(255))
INSERT @Files (Filename)
EXEC master..xp_cmdshell @cmd

-- Look at your file list
What do you have for tools?

If SQL Server, Access, or something that accepts SQL commands, you could import and keep the two columns from the Excel file, and also import the text file (is it a "dir >> textfile.txt ?) into another table. If that's possible then it's simple to write a query like

select * from DirectoryListTable
where FileNameField in (select FileNameField from ExcelTableOne)

to see what filenames are present. Change the "in" to "not in" to see the extras!


PatHartmanConnect With a Mentor Commented:
If you can see the directory from your computer, I would ignore the word document and get the file list directly from windows using FSO (file system object).  Then I would link to the spreadsheet and use an append query to append column 1 to a work table and a second query to append column 2.  You can add an extra column if you need to retain the fact that the name came from column 1 or column 2  Now you have one table with a list of files from the directory and a second table with a list of files from the spreadsheet.  So far all of these steps are repeatable and can be automated.

To do the match, I would use a union query and tag the results
qry1 is an inner join and will return all rows in both tables.
qry2 is a left join and returns rows in the directory list but not the spreadsheet
qry3 is a right join and returns rows in the spreadsheet but not in the directory list (I know you didn't ask for this but you should account for typos)

Then you can make a report that lists and counts the three situations from the union query.
Cloud Class® Course: Microsoft Azure 2017

Azure has a changed a lot since it was originally introduce by adding new services and features. Do you know everything you need to about Azure? This course will teach you about the Azure App Service, monitoring and application insights, DevOps, and Team Services.

One advice: use Python. By far the easiest way to accomplish this kind of tasks. It will be a well spent time to learn and get familiar with it. There is a module to deal with any situation you might come across and they are all FREE!

CSV with Python:
1. is the file comparison for a single directory's files or is there a tree involved?
2. do you need to compare any of the file attributes (modified date, size, hash) or just the names?
3. do you only need to know the exceptions or something else? (new and missing, for instance)
4. at the end of the process, what is to become of the Excel worksheet contents?
5. how many files are in the directory?
6. how many file names are in the worksheet?
7. is there a directory of files that matches the list in the worksheet?
If the files from the prior month (time period) exist in a directory, you should also be able to use Robocopy to do the comparison.
For testing purposes, I created two folders "new" and "old", copied some files to each folder and ran the following command.
robocopy /L /X new old

Open in new window

I then deleted a file and added a file in the "new" folder and reran the command.  This was the result.
C:\Users\Mark\Downloads>robocopy /L /X new old

   ROBOCOPY     ::     Robust File Copy for Windows


  Started : Sat Mar 08 17:21:32 2014

   Source : C:\Users\Mark\Downloads\new\
     Dest : C:\Users\Mark\Downloads\old\

    Files : *.*

  Options : *.* /X /L /COPY:DAT /R:1000000 /W:30


                           3    C:\Users\Mark\Downloads\new\
          *EXTRA File              84784        fciv.exe
            New File                   0        New Text Document.txt


                Total    Copied   Skipped  Mismatch    FAILED    Extras
     Dirs :         1         0         1         0         0         0
    Files :         3         1         2         0         0         1
    Bytes :    47.6 k         0    47.6 k         0         0    82.7 k
    Times :   0:00:00   0:00:00                       0:00:00   0:00:00

    Ended : Sat Mar 08 17:21:32 2014

Open in new window

Note that it reported the new and missing files.
If you only have the list in the worksheet, you can most quickly do this comparison in the Excel VBA or VBScript environment using a dictionary object.
Example with parameter for directory.  The list of files is assumed to be in column F, starting on row 2.
Option Explicit

Public Sub Q_28383341(parmFolder As String)
    Dim oDic As Object
    Dim vFiles() As Variant
    Dim strFilename As String
    Dim vItem As Variant
    Set oDic = CreateObject("scripting.dictionary")
    vFiles = ActiveSheet.Range(ActiveSheet.Range("F2"), ActiveSheet.Range("F2").End(xlDown)).Value
    For Each vItem In vFiles
        oDic.Add CStr(vItem), 0
    strFilename = Dir(parmFolder & "\*.*", vbNormal)
    Do Until Len(strFilename) = 0
        If oDic.exists(strFilename) Then
            oDic(strFilename) = 1
            Debug.Print "New file: " & strFilename
        End If
        strFilename = Dir
    For Each vItem In oDic
        If oDic(vItem) = 0 Then
            Debug.Print "Missing file: " & vItem
        End If
End Sub

Open in new window

Invoking this in the Immediate Window gives:
Q_28383341 "C:\Users\Mark\Downloads\New"
New file: New Text Document.txt
Missing file: fciv.exe

Open in new window

SarahDaisy8Author Commented:
Let me expand on my topic to add clarification, and to talk about my difficulties.  Right now, I receive a .txt file that contains files in directories that have been backed up.  The location of the files doesn't matter, just the name of the files, so I strip out only the names of the files from this .txt file and create a table in SQL.  I end up with one column with items named [username].txt.  The [username] portion can come from 2 different variables.  Those being the 2 columns in Excel that I spoke of previously.  The first column in the email address of the user, and the 2nd column is the user's login name.  Therefore, the corresponding .txt file for the user can either be named [email].txt or [login].txt.  Technically a user can have an email and a username, and the .txt file could be named either one, but I wouldn't know this in advance.  So, basically I need to compare the .txt file to see which ones don't match either an email address OR login name.  The queries I've been using aren't working because I'm getting duplicate data, because if there is an email, it will still check the login name, and if it doesn't find one, it will tell me it's missing.  I need to combine login + email into one check against the .txt file name.
Please post a sample of both the database table and the text file contents.

This question is in the Excel zone as well as some database zones.  Are these appropriate to the environment of the solution you seek?  If not, we can change the zones.

What do you think of my posted code solution?
If you're working in a database environment I would normally import the text file and let the database do the comparison with left and right (outer) joins.
SarahDaisy8Author Commented:
Thanks.  Between the two different solutions, I came up with something that works for me.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.