Solved

Compare multiple columns to determine matches

Posted on 2014-03-07
11
488 Views
Last Modified: 2014-03-12
I have been given an audit that I do every month that requires comparing different sources to determine which items from each source are not in the others.  Essentially, I am given a directory dump with files in it, and told to validate that each of the files in the directory is also present in a spreadsheet.  The spreadsheet has 2 different columns that I have to validate the directory list against, but it only has to exist in one of the columns in order to pass the audit.  So, what I have been doing is scraping the .txt document that contains the files, and copying that into a column in Excel.  Then, I take the 2 columns out of the other Excel spreadsheet and compare them one at a time against the new sheet that I created.  It ends up working, but it's really not repeatable, or consistent, and it takes forever.  I'm looking for a more elegant solution.  It doesn't matter to me if it's in Excel, Access, VBA, SQL, or whatever.  I just need some way to do it better than I've been doing.
0
Comment
Question by:SarahDaisy8
11 Comments
 
LVL 16

Expert Comment

by:DcpKing
Comment Utility
What do you have for tools?

If SQL Server, Access, or something that accepts SQL commands, you could import and keep the two columns from the Excel file, and also import the text file (is it a "dir >> textfile.txt ?) into another table. If that's possible then it's simple to write a query like

select * from DirectoryListTable
where FileNameField in (select FileNameField from ExcelTableOne)

to see what filenames are present. Change the "in" to "not in" to see the extras!

hth

Mike
0
 
LVL 34

Assisted Solution

by:PatHartman
PatHartman earned 250 total points
Comment Utility
If you can see the directory from your computer, I would ignore the word document and get the file list directly from windows using FSO (file system object).  Then I would link to the spreadsheet and use an append query to append column 1 to a work table and a second query to append column 2.  You can add an extra column if you need to retain the fact that the name came from column 1 or column 2  Now you have one table with a list of files from the directory and a second table with a list of files from the spreadsheet.  So far all of these steps are repeatable and can be automated.

To do the match, I would use a union query and tag the results
qry1 is an inner join and will return all rows in both tables.
qry2 is a left join and returns rows in the directory list but not the spreadsheet
qry3 is a right join and returns rows in the spreadsheet but not in the directory list (I know you didn't ask for this but you should account for typos)

Then you can make a report that lists and counts the three situations from the union query.
0
 
LVL 17

Accepted Solution

by:
dbaSQL earned 250 total points
Comment Utility
You know, I have done something very similar.  I had an ftp process that ran every day, to upload new files to our system.  The files were triggered by events, so there was no way to schedule, or know when to expect new files.  

I created a 'ProcessedFiles' table that just held FileName and Processed, where Processed was a bit flag, with a DEFAULT of 0.  Every day I would check the target directory for new files.  I don't have the full code on hand right now, but it was basically like the statement below.  Scan the directory, load all filenames to a working table, then compare to ProcessedFiles.  Anything where NOT EXISTS was then uploaded with a SSIS data flow task.  Upon completion, the Processed flag gets changed to 1.

You could use this type of approach to gather your filenames from the directory, and then compare to the spreadsheet.  

-- change this to your target directory path and run it.  you'll see what I mean.
DECLARE @path VARCHAR(100) = 'dir C:\tmp\'   --- your target directory
DECLARE @cmd VARCHAR(1024) = @path + ' /A-D /B'

DECLARE @Files TABLE (Filename VARCHAR(255))
INSERT @Files (Filename)
EXEC master..xp_cmdshell @cmd

-- Look at your file list
SELECT * FROM @Files
0
 
LVL 26

Expert Comment

by:Zberteoc
Comment Utility
One advice: use Python. By far the easiest way to accomplish this kind of tasks. It will be a well spent time to learn and get familiar with it. There is a module to deal with any situation you might come across and they are all FREE!

CSV with Python: http://www.pythonforbeginners.com/systems-programming/using-the-csv-module-in-python/
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
1. is the file comparison for a single directory's files or is there a tree involved?
2. do you need to compare any of the file attributes (modified date, size, hash) or just the names?
3. do you only need to know the exceptions or something else? (new and missing, for instance)
4. at the end of the process, what is to become of the Excel worksheet contents?
5. how many files are in the directory?
6. how many file names are in the worksheet?
7. is there a directory of files that matches the list in the worksheet?
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 45

Expert Comment

by:aikimark
Comment Utility
If the files from the prior month (time period) exist in a directory, you should also be able to use Robocopy to do the comparison.
For testing purposes, I created two folders "new" and "old", copied some files to each folder and ran the following command.
robocopy /L /X new old

Open in new window

I then deleted a file and added a file in the "new" folder and reran the command.  This was the result.
C:\Users\Mark\Downloads>robocopy /L /X new old

-------------------------------------------------------------------------------
   ROBOCOPY     ::     Robust File Copy for Windows

-------------------------------------------------------------------------------

  Started : Sat Mar 08 17:21:32 2014

   Source : C:\Users\Mark\Downloads\new\
     Dest : C:\Users\Mark\Downloads\old\

    Files : *.*

  Options : *.* /X /L /COPY:DAT /R:1000000 /W:30

------------------------------------------------------------------------------

                           3    C:\Users\Mark\Downloads\new\
          *EXTRA File              84784        fciv.exe
            New File                   0        New Text Document.txt

------------------------------------------------------------------------------

                Total    Copied   Skipped  Mismatch    FAILED    Extras
     Dirs :         1         0         1         0         0         0
    Files :         3         1         2         0         0         1
    Bytes :    47.6 k         0    47.6 k         0         0    82.7 k
    Times :   0:00:00   0:00:00                       0:00:00   0:00:00

    Ended : Sat Mar 08 17:21:32 2014

Open in new window

Note that it reported the new and missing files.
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
If you only have the list in the worksheet, you can most quickly do this comparison in the Excel VBA or VBScript environment using a dictionary object.
Example with parameter for directory.  The list of files is assumed to be in column F, starting on row 2.
Option Explicit

Public Sub Q_28383341(parmFolder As String)
    Dim oDic As Object
    Dim vFiles() As Variant
    Dim strFilename As String
    Dim vItem As Variant
    Set oDic = CreateObject("scripting.dictionary")
    vFiles = ActiveSheet.Range(ActiveSheet.Range("F2"), ActiveSheet.Range("F2").End(xlDown)).Value
    For Each vItem In vFiles
        oDic.Add CStr(vItem), 0
    Next
    strFilename = Dir(parmFolder & "\*.*", vbNormal)
    Do Until Len(strFilename) = 0
        If oDic.exists(strFilename) Then
            oDic(strFilename) = 1
        Else
            Debug.Print "New file: " & strFilename
        End If
        strFilename = Dir
    Loop
    For Each vItem In oDic
        If oDic(vItem) = 0 Then
            Debug.Print "Missing file: " & vItem
        End If
    Next
End Sub

Open in new window

Invoking this in the Immediate Window gives:
Q_28383341 "C:\Users\Mark\Downloads\New"
New file: New Text Document.txt
Missing file: fciv.exe

Open in new window

0
 
LVL 1

Author Comment

by:SarahDaisy8
Comment Utility
Let me expand on my topic to add clarification, and to talk about my difficulties.  Right now, I receive a .txt file that contains files in directories that have been backed up.  The location of the files doesn't matter, just the name of the files, so I strip out only the names of the files from this .txt file and create a table in SQL.  I end up with one column with items named [username].txt.  The [username] portion can come from 2 different variables.  Those being the 2 columns in Excel that I spoke of previously.  The first column in the email address of the user, and the 2nd column is the user's login name.  Therefore, the corresponding .txt file for the user can either be named [email].txt or [login].txt.  Technically a user can have an email and a username, and the .txt file could be named either one, but I wouldn't know this in advance.  So, basically I need to compare the .txt file to see which ones don't match either an email address OR login name.  The queries I've been using aren't working because I'm getting duplicate data, because if there is an email, it will still check the login name, and if it doesn't find one, it will tell me it's missing.  I need to combine login + email into one check against the .txt file name.
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
Please post a sample of both the database table and the text file contents.

This question is in the Excel zone as well as some database zones.  Are these appropriate to the environment of the solution you seek?  If not, we can change the zones.

What do you think of my posted code solution?
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
If you're working in a database environment I would normally import the text file and let the database do the comparison with left and right (outer) joins.
0
 
LVL 1

Author Closing Comment

by:SarahDaisy8
Comment Utility
Thanks.  Between the two different solutions, I came up with something that works for me.
0

Featured Post

Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

Join & Write a Comment

Load balancing is the method of dividing the total amount of work performed by one computer between two or more computers. Its aim is to get more work done in the same amount of time, ensuring that all the users get served faster.
Describes a method of obtaining an object variable to an already running instance of Microsoft Access so that it can be controlled via automation.
Viewers will learn how to use the SELECT statement in SQL and will be exposed to the many uses the SELECT statement has.
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now