Spreadsheet Comparison Tools - comparing cell data between spreadsheets

I am trying to do a comparison in Excel between 6 very large spreadsheets (approximately 1 million rows each) to try to find duplicate entries of the same files between two different devices.  Essentially, there's a column for file path, and a column for the file size for each device.  The major problem that I'm encountering is that the file paths are often different.  

For example, one of the entries in Column A may be "2017 backup/To Sort/MT Cleanup/mt/2018-04-08.jpg", and in Column B  it might have a file size of  5177Kb,  In a different cell in Column C there may be an entry for "2017 SummerPhotos/To Sort/MT Cleanup/mt/2018-04-08.jpg" with Column D also showing a file size of  5177Kb.  

Is there a tool that can do a comparison by partial cell name (as in matching the last two or three sections of the path), or do a comparison of the entries by the file size?
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Martin LissOlder than dirtCommented:
Well the following will yield the filename

Dim strParts() As String
strParts = Split(Range("A1"), "/")
MsgBox strParts(UBound(strParts))

Open in new window

and so if you do that and then do a Find on the other sheet you should be able find the duplicates, although it won't be quick.
Martin LissOlder than dirtCommented:
Is it possible that there might be duplicate photo names in the same sheet?
peralotAuthor Commented:
It is very possible- there are a massive number of rows, and there are multiple copies of the same file.  Some of them are the original raw file, some were copied from a different drive, and some were enhanced or touched up before being re-saved in another directory.  The biggest problem is trying to find any variant of the same file!
Big Business Goals? Which KPIs Will Help You

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

Martin LissOlder than dirtCommented:
Can you supply a sample file with a couple of thousand rows that has duplicates? Also I just noticed that when referring to duplicates you say "as in matching the last two or three sections of the path". For any given photo name is there any way to tell if I need to search for say two sections rather than three, or do I always have to do it twice? Or perhaps an explanation of when 2018-04-08.jpg in one sheet would not be a duplicate of 2018-04-08.jpg in the other sheet if both file sizes were the same?
peralotAuthor Commented:
The problem is that the data is across six different spreadsheets.  For example, rows 1 - 10,000 in Column A on spreadsheet 1 may have entries that match rows 100,000 to 110,000 in Column A on spreadsheet 3.
peralotAuthor Commented:
Here's a sample from two of the spreadsheets that show comparable data.
Martin LissOlder than dirtCommented:
What is your goal, a single list of unique partial-path plus file names, or is it to merely indicate (how?) that it's a duplicate?

Can a duplicate be defined as: if the date folder path and the file name and the file sizes are the same?
peralotAuthor Commented:
The goal is (if possible) the latter- to determine which files are duplicates given their relative path and file size.  The biggest hurdle I'e seen is that the relative path is not exact for each set of data, since the duplicates were saved in different subdirectories.
Martin LissOlder than dirtCommented:
How would you like the duplicates to be identified?

Please verify if this is true or not.
Can a duplicate be defined as: if the date folder path and the file name and the file sizes are the same?
Martin LissOlder than dirtCommented:
Two more questions:
  • If I compare sheet1 with sheets 2, 3, 4, 5 and 6, do I then have to compare sheet2 with sheets 3, 4, 5 and 6, etc because there may be images in sheet2 that are not in sheet1?
  • What version of Excel are you using?
peralotAuthor Commented:
Answer 1: Yes, the first 3 spreadsheets are a comparison between device 1 and device 3.  The next 3 spreadsheets are a comparison between device 2 and device 3.

Answer 2: Excel 2016 (Office 365).
Martin LissOlder than dirtCommented:
We're close but you haven't answered two of my questions.

1. Does the same date + the same image name + the same size = a duplicate?
2. How should I indicate that an image is a duplicate?
Martin LissOlder than dirtCommented:
Open the two sample files (at least) and the attached workbook. When that workbook is open and you are in the 'Duplicates' sheet, press Ctrl+Shift+D. The images in column 'A' that are pink are duplicates according to my "date + the same image name + the same size" rule.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
peralotAuthor Commented:
Sorry - I missed those questions.  Also, thank you for all your assistance thus far!

Basically, it's not always the same name - for example, there may be an IMG00068 in one set of images (Like Yellowstone Trip), and there may be an IMG00068 in another set of images (e.g., Grand Canyon).  However, the spreadsheet you provided should hopefully help us have a more effective comparison!

I was originally just highlighting matching sets of files that I found.
Martin LissOlder than dirtCommented:
My code includes the size of the images in the duplicate determination and so it doesn't seem likely to me that two images taken in two different locations will both be the same size, and so they won't be seen as duplicates.

If you are satisfied with the results then please close this question. If you don't know how to do that then here's an explanation.
peralotAuthor Commented:
Thanks again for all the assistance - the spreadsheet provided should at least let us narrow down what is duplicated.
Martin LissOlder than dirtCommented:
You’re welcome and I’m glad I was able to help.

If you expand the “Full Biography” section of my profile you’ll find links to some articles I’ve written that may interest you.

Marty - Microsoft MVP 2009 to 2017
              Experts Exchange Most Valuable Expert (MVE) 2015, 2017
              Experts Exchange Top Expert Visual Basic Classic 2012 to 2017
              Experts Exchange Top Expert VBA (current)
Martin LissOlder than dirtCommented:
You contacted me via an EE message saying that you had a problem with my solution. Please describe it here.
peralotAuthor Commented:
Hi Martin,

I tried using the spreadsheet comparison spreadsheet you provided (29098636.xlsm) for a question I'd posted recently, and it had a Visual Basic error:
Run-time error '9': Subscript out of range.

I did open both the spreadsheet you'd sent (enabling macros), and the two spreadsheets I was trying to compare.

I was just wondering if it's something I'm doing wrong or if it's just not liking the data it's trying to compare?
Martin LissOlder than dirtCommented:
See if this corrects the problem.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.