Solved

Software / script advise: merge folders from different sources with many identical but sometimes changed files

Posted on 2016-11-14
9
161 Views
Last Modified: 2016-11-21
Hi Experts

Background: Customer has been using a folder syncing tool that has not worked for many weeks. The tool was (supposedly) keeping in sync a set of files & folders between several computers. As it hasn't worked for a while, many files were changed on one or more computers without syncing to the other computers. So both versions might need to be kept. Now we need to clean the mess

Question: I am wanting to find a (paid or free) tool that can compare the files in the folders, then merge the folders taking some predetermined actions like
1. If file content (file hash) is the same (even if the date is different), do not copy
2. If the file content is different, rename each file, adding the file owner & last modified date to the file names
3. and (can I dream...), if the files are identical but located in different folders (same file hash, different folder) then ... do xyz (not sure what yet!) ... as we know that some users took it upon themselves to re-organise their version of the files & folders in a more logical manner (Sigh)

Have you used a script / or application that can help us clean this mess?

Alexandre
0
Comment
Question by:Alexandre Michel
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
9 Comments
 
LVL 27

Expert Comment

by:skullnobrains
ID: 41887456
unison can do this locally or remotely using a variety of protocols such as ftp, ssh, ...
it features a gui ( but you can script it as well ) that will display the changed files and course of action, and you can change the action or ignore files that you want to handle manually.
... but the first sync is one-way so it won't help with an EXISTING mess
https://www.google.fr/search?q=unison+sync&tbm=isch

for the existing mess, you can use md5sum to grab the checksums of all your files on one side, then move to the other, run the same tool, erase all files with unchanged md5, move all new files to the first side, ... and manually handle the remaining files, ... and setup unison or a similar tool so such a mess does not occur any more
0
 
LVL 25

Expert Comment

by:NVIT
ID: 41888512
... many files were changed on one or more computers without syncing to the other computers.
... So both versions might need to be kept.
Or more than 2 versions.

Also, can you be certain that supposedly matching files to sync have not been renamed? If not, that makes comparison more difficult. And with multiple computers and users in play... yikes.
0
 
LVL 4

Author Comment

by:Alexandre Michel
ID: 41888870
Anyone for a practical solution that has worked for them?

Thanks @skullnobrains for the idea, but that is just a start. After I create a table of hashes, then ... what? How do I go through 10s of thousands of files and delete those that have the same hashes?

I looked on http://alternativeto.net/software/md5sums/ and found http://alternativeto.net/software/duplicate-cleaner/.. this looks promissing and might be able to do what I need to do
0
Database Solutions Engineer FAQs

In this series, we will discuss common questions received as a database Solutions Engineer at Percona. In this role, we speak with a wide array of MySQL and MongoDB users responsible for both extremely large and complex environments to smaller single-server environments.

 
LVL 27

Expert Comment

by:skullnobrains
ID: 41889850
you can use a script, comm or even grep to get the common lines between both locations
cat side1.md5sum | grep -vFf side2.md5sum > commonfiles.md5sum

Open in new window


you can remove the files with something like
cat commonfiles.md5sum | cut -c35- | xargs rm

Open in new window


you could handle renamed files with md5 in a similar way but that would not work if the files were edited. as pointed above, no easy way there. if you think you have lots of such files and know the file types, there is probably a way to identify them nevertheless with some kind of fingerprinting technique. i believe it is most likely a loss of time.

and you need to handle the remaining files manually anyway.

if i were you, i'd use md5 first in order to determine the amount of
- identical files ( see above grep )
- files present on either side only ( remove the md5s and use grep without the -v flag. do twice exchanging side1 and side2 )
- files different on both sides ( remove identical files from both sides, remove md5s column, use the grep for identical files )

if you have too many files and grep eats up your memory, you can sort the files and use comm instead. grep should happily handle a few hundred thousands lines and do it in a reasonable amount of time so this might not be required but that depends on your hardware.

then devise a reasonable way to handle things
you can decide for example to solve conflicts manually if you have a few hundred conflicts, or possibly to keep the most recent copy and backup the other file with an explicit suffix such as ".CONFLICT.locationX.YYMMDD" and let the users know about that policy. btw, rsync is able to apply such a similar policy in a single command. tell it to never erase the destination and store the different files with a suffix for example...
0
 
LVL 4

Author Comment

by:Alexandre Michel
ID: 41890703
Thanks @skullnobrains
But ... I am sorry if I look ignorant, but are these Microsoft commands? I don't recognize them...?
0
 
LVL 27

Assisted Solution

by:skullnobrains
skullnobrains earned 150 total points
ID: 41891066
no. i figured your shares were available from non-win stuff. windows does not come with handy tools for such tasks.

the above will work if you install the required tools : md5sum has a number of windows ports or equivalents and diff as well. cygwin would contain all of the above and more.

for basic comparisons, rsync --dry-run could give you an idea of the volume of different files and let you apply some rules. nb rsync is available in windows with a standard installer but afaik, it is actually the linux rsync bundled with pieces of cygwin.

if you're looking for a graphical tool,
http://www.ultraedit.com/products/ultracompare/feature-map/compare_folders.html
http://winmerge.org/?lang=en
... i guess google and try will be faster than asking here, if that is what you expect.
0
 
LVL 56

Accepted Solution

by:
Bill Prew earned 350 total points
ID: 41891387
I can't solve your problem easily, sorry, but I did want to add a few thoughts in case there is any value to them.

I don't think you will find a clever tool that can easily get things back into shape.  There are too many unknowns in a situation like this to be addressed, and each client copy can pose different and new scenarios.  The time it would take you to identify and codify all of these in scripts or settings is probably longer than just some brute force work.

I would suggest approaching it as organized as possible.  Think about the different scenarios you envision, you mentioned a few in your original post but there are more.  What about deleted files from either the server or client, how should those be handled?  What about new files, and how you differentiate a newly added file on one copy from a deleted file situation (in either case there is an orphan file on one copy of the data that has no match on the other copy).  Come up with all the possible situations you can, and the action plans for each.

Then I would approach it working easiest, or largest expected numbers of files, to hardest.  I would also suggest working each client copy one by one, syncing it with the master, then moving on to the next.  In that approach perhaps eliminating all the exact matching files first from the client copy is the best first step, hopefully getting a lot of "noise" out of the way quickly.  There are tools that can do that pretty easily.

If there are only several clients as you mentioned then I would probably work through a lot of it manually, using a tool like Beyond Compare or something that can visually show you matches and differences based on rules.

http://www.scootersoftware.com/

It can also be scripted, but that takes time to learn and do, so will depend on how many changes have been made and how many client copies you have, etc.

I get the renaming when you find out of syncs, but I'd suggest working those as they are found may be better than renaming them.  Once you rename it gets a little harder to use off the shelf tools to compare the files.  Whereas something like Beyond Compare can show differences between two files right in the same interface that highlights files that don't match, and you can decide how to merge the changes etc right there.

I sort of assumed this earlier, but I'll state it again.  My approach would be to assume the server copy is the master, and then one by one sync each set of client copy differences to it.  It might mean you waste a little work, or revisit the same file more than once, but I think it's the best way to approach this, and inline with what I expect your goal is, to get a single unified copy of all the files on the server to move forward with.

~bp
0
 
LVL 4

Author Closing Comment

by:Alexandre Michel
ID: 41895124
Thanks @skullnubrains & @Bill Prew.
It is Bill's approach that I will take but skyllnobrains give a couple of hints that I will use too. I will do that over a week end
0
 
LVL 27

Expert Comment

by:skullnobrains
ID: 41895878
whatever the approach in the end, you probably had better find a way to remove/identify whatever does not produce a conflict as soon as possible so you can properly evaluate what is left.

i second working on clients and syncing to the server.
but i'm unsure about identifying files changed by multiple clients. it might be simpler to handle all copies at once, or even to mark them or list them somehow and provide enough information to the users so they can collaborate and solve the problem themselves. that would depend on the types of files and how knowlegeable you are regarding said content in order to figure out how to merge the changes.

note that in some cases, scripting is possible : ( log files are easy, office documents when changes are tracked can be handled by merging, ... ) but it is worth the hassle only if you have a number of problems that would solve.

good luck... it's not going to be easy. feel free to post in this thread if you need a hint to handle whatever specific case.
hopefully most files were changed by a single client and a "client copy wins if newer" policy will do the trick.
0

Featured Post

Optimize your web performance

What's in the eBook?
- Full list of reasons for poor performance
- Ultimate measures to speed things up
- Primary web monitoring types
- KPIs you should be monitoring in order to increase your ROI

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article helps those who get the 0xc004d307 error when trying to rearm (reset the license) Office 2013 in a Virtual Desktop Infrastructure (VDI) and/or those trying to prep the master image for Microsoft Key Management (KMS) activation. (i.e.- C…
Ever visit a website where you spotted a really cool looking Font, yet couldn't figure out which font family it belonged to, or how to get a copy of it for your own use? This article explains the process of doing exactly that, as well as showing how…
Windows 10 is mostly good. However the one thing that annoys me is how many clicks you have to do to dial a VPN connection. You have to go to settings from the start menu, (2 clicks), Network and Internet (1 click), Click VPN (another click) then fi…
If you’ve ever visited a web page and noticed a cool font that you really liked the look of, but couldn’t figure out which font it was so that you could use it for your own work, then this video is for you! In this Micro Tutorial, you'll learn yo…

630 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question