Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium


Compare multiple columns to determine matches

Posted on 2014-03-07
Medium Priority
Last Modified: 2014-03-12
I have been given an audit that I do every month that requires comparing different sources to determine which items from each source are not in the others.  Essentially, I am given a directory dump with files in it, and told to validate that each of the files in the directory is also present in a spreadsheet.  The spreadsheet has 2 different columns that I have to validate the directory list against, but it only has to exist in one of the columns in order to pass the audit.  So, what I have been doing is scraping the .txt document that contains the files, and copying that into a column in Excel.  Then, I take the 2 columns out of the other Excel spreadsheet and compare them one at a time against the new sheet that I created.  It ends up working, but it's really not repeatable, or consistent, and it takes forever.  I'm looking for a more elegant solution.  It doesn't matter to me if it's in Excel, Access, VBA, SQL, or whatever.  I just need some way to do it better than I've been doing.
Question by:SarahDaisy8
LVL 16

Expert Comment

ID: 39913883
What do you have for tools?

If SQL Server, Access, or something that accepts SQL commands, you could import and keep the two columns from the Excel file, and also import the text file (is it a "dir >> textfile.txt ?) into another table. If that's possible then it's simple to write a query like

select * from DirectoryListTable
where FileNameField in (select FileNameField from ExcelTableOne)

to see what filenames are present. Change the "in" to "not in" to see the extras!


LVL 40

Assisted Solution

PatHartman earned 1000 total points
ID: 39913984
If you can see the directory from your computer, I would ignore the word document and get the file list directly from windows using FSO (file system object).  Then I would link to the spreadsheet and use an append query to append column 1 to a work table and a second query to append column 2.  You can add an extra column if you need to retain the fact that the name came from column 1 or column 2  Now you have one table with a list of files from the directory and a second table with a list of files from the spreadsheet.  So far all of these steps are repeatable and can be automated.

To do the match, I would use a union query and tag the results
qry1 is an inner join and will return all rows in both tables.
qry2 is a left join and returns rows in the directory list but not the spreadsheet
qry3 is a right join and returns rows in the spreadsheet but not in the directory list (I know you didn't ask for this but you should account for typos)

Then you can make a report that lists and counts the three situations from the union query.
LVL 17

Accepted Solution

dbaSQL earned 1000 total points
ID: 39914143
You know, I have done something very similar.  I had an ftp process that ran every day, to upload new files to our system.  The files were triggered by events, so there was no way to schedule, or know when to expect new files.  

I created a 'ProcessedFiles' table that just held FileName and Processed, where Processed was a bit flag, with a DEFAULT of 0.  Every day I would check the target directory for new files.  I don't have the full code on hand right now, but it was basically like the statement below.  Scan the directory, load all filenames to a working table, then compare to ProcessedFiles.  Anything where NOT EXISTS was then uploaded with a SSIS data flow task.  Upon completion, the Processed flag gets changed to 1.

You could use this type of approach to gather your filenames from the directory, and then compare to the spreadsheet.  

-- change this to your target directory path and run it.  you'll see what I mean.
DECLARE @path VARCHAR(100) = 'dir C:\tmp\'   --- your target directory
DECLARE @cmd VARCHAR(1024) = @path + ' /A-D /B'

DECLARE @Files TABLE (Filename VARCHAR(255))
INSERT @Files (Filename)
EXEC master..xp_cmdshell @cmd

-- Look at your file list
Nothing ever in the clear!

This technical paper will help you implement VMware’s VM encryption as well as implement Veeam encryption which together will achieve the nothing ever in the clear goal. If a bad guy steals VMs, backups or traffic they get nothing.

LVL 27

Expert Comment

ID: 39914753
One advice: use Python. By far the easiest way to accomplish this kind of tasks. It will be a well spent time to learn and get familiar with it. There is a module to deal with any situation you might come across and they are all FREE!

CSV with Python: http://www.pythonforbeginners.com/systems-programming/using-the-csv-module-in-python/
LVL 46

Expert Comment

ID: 39915146
1. is the file comparison for a single directory's files or is there a tree involved?
2. do you need to compare any of the file attributes (modified date, size, hash) or just the names?
3. do you only need to know the exceptions or something else? (new and missing, for instance)
4. at the end of the process, what is to become of the Excel worksheet contents?
5. how many files are in the directory?
6. how many file names are in the worksheet?
7. is there a directory of files that matches the list in the worksheet?
LVL 46

Expert Comment

ID: 39915161
If the files from the prior month (time period) exist in a directory, you should also be able to use Robocopy to do the comparison.
For testing purposes, I created two folders "new" and "old", copied some files to each folder and ran the following command.
robocopy /L /X new old

Open in new window

I then deleted a file and added a file in the "new" folder and reran the command.  This was the result.
C:\Users\Mark\Downloads>robocopy /L /X new old

   ROBOCOPY     ::     Robust File Copy for Windows


  Started : Sat Mar 08 17:21:32 2014

   Source : C:\Users\Mark\Downloads\new\
     Dest : C:\Users\Mark\Downloads\old\

    Files : *.*

  Options : *.* /X /L /COPY:DAT /R:1000000 /W:30


                           3    C:\Users\Mark\Downloads\new\
          *EXTRA File              84784        fciv.exe
            New File                   0        New Text Document.txt


                Total    Copied   Skipped  Mismatch    FAILED    Extras
     Dirs :         1         0         1         0         0         0
    Files :         3         1         2         0         0         1
    Bytes :    47.6 k         0    47.6 k         0         0    82.7 k
    Times :   0:00:00   0:00:00                       0:00:00   0:00:00

    Ended : Sat Mar 08 17:21:32 2014

Open in new window

Note that it reported the new and missing files.
LVL 46

Expert Comment

ID: 39915187
If you only have the list in the worksheet, you can most quickly do this comparison in the Excel VBA or VBScript environment using a dictionary object.
Example with parameter for directory.  The list of files is assumed to be in column F, starting on row 2.
Option Explicit

Public Sub Q_28383341(parmFolder As String)
    Dim oDic As Object
    Dim vFiles() As Variant
    Dim strFilename As String
    Dim vItem As Variant
    Set oDic = CreateObject("scripting.dictionary")
    vFiles = ActiveSheet.Range(ActiveSheet.Range("F2"), ActiveSheet.Range("F2").End(xlDown)).Value
    For Each vItem In vFiles
        oDic.Add CStr(vItem), 0
    strFilename = Dir(parmFolder & "\*.*", vbNormal)
    Do Until Len(strFilename) = 0
        If oDic.exists(strFilename) Then
            oDic(strFilename) = 1
            Debug.Print "New file: " & strFilename
        End If
        strFilename = Dir
    For Each vItem In oDic
        If oDic(vItem) = 0 Then
            Debug.Print "Missing file: " & vItem
        End If
End Sub

Open in new window

Invoking this in the Immediate Window gives:
Q_28383341 "C:\Users\Mark\Downloads\New"
New file: New Text Document.txt
Missing file: fciv.exe

Open in new window


Author Comment

ID: 39918086
Let me expand on my topic to add clarification, and to talk about my difficulties.  Right now, I receive a .txt file that contains files in directories that have been backed up.  The location of the files doesn't matter, just the name of the files, so I strip out only the names of the files from this .txt file and create a table in SQL.  I end up with one column with items named [username].txt.  The [username] portion can come from 2 different variables.  Those being the 2 columns in Excel that I spoke of previously.  The first column in the email address of the user, and the 2nd column is the user's login name.  Therefore, the corresponding .txt file for the user can either be named [email].txt or [login].txt.  Technically a user can have an email and a username, and the .txt file could be named either one, but I wouldn't know this in advance.  So, basically I need to compare the .txt file to see which ones don't match either an email address OR login name.  The queries I've been using aren't working because I'm getting duplicate data, because if there is an email, it will still check the login name, and if it doesn't find one, it will tell me it's missing.  I need to combine login + email into one check against the .txt file name.
LVL 46

Expert Comment

ID: 39918306
Please post a sample of both the database table and the text file contents.

This question is in the Excel zone as well as some database zones.  Are these appropriate to the environment of the solution you seek?  If not, we can change the zones.

What do you think of my posted code solution?
LVL 46

Expert Comment

ID: 39918313
If you're working in a database environment I would normally import the text file and let the database do the comparison with left and right (outer) joins.

Author Closing Comment

ID: 39924597
Thanks.  Between the two different solutions, I came up with something that works for me.

Featured Post

Veeam Disaster Recovery in Microsoft Azure

Veeam PN for Microsoft Azure is a FREE solution designed to simplify and automate the setup of a DR site in Microsoft Azure using lightweight software-defined networking. It reduces the complexity of VPN deployments and is designed for businesses of ALL sizes.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Excel can be a tricky bit of software to get your head around. Whilst you’ll be able to eventually get to grips with the basic understanding of how to get by, there are a few Excel tips that not everybody will even know about let alone know how to d…
Beware when using the ListIndex and the Column() properties of a listbox in Access 2007.  A bug has been identified in the Access 2007 listbox code which can cause the .ListIndex property to return a -1, and the .Columns(#) property to return a NULL…
Viewers will learn how the fundamental information of how to create a table.
This lesson discusses how to use a Mainform + Subforms in Microsoft Access to find and enter data for payments on orders. The sample data comes from a custom shop that builds and sells movable storage structures that are delivered to your property. …

580 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question