troubleshooting Question

Need Script or Macro to Identify and Move Duplicate Files (Based on Size and Date)

Avatar of Jerry L
Jerry LFlag for United States of America asked on
LinuxWindows OSWindows 10* Windows Macro
2 Comments1 Solution249 ViewsLast Modified:
BACKGROUND
I am hosting my domain on a shared Hostgator Linux Server.
I have about 10,000 emails of which 20 percent are duplicates.

I have created a cPanel backup,
downloaded the domain.tar.gz file, and
unzipped it to my Windows 10 file system.

The emails I need are in one folder.
Each email is in plain text format.

DUPLICATE FILES ALGORITHM
In Windows 10 File Manager, when you sort the files by Size, you can see the Date Stamp is usually within one (or a few) minute(s) similar.

When you open the two such files and look at the headers, you will see the Return-Path bounce numbers to be identical, e.g., "bounce-mc.us5_12385835.1708185".

QUESTION
I am trying to find a script, macro, or automated solution that will help me find and move duplicate text files.

Perhaps a simple workaround would be to identify suspected duplicate files that reside in folder Z (using the above mentioned algorithm), then move one of them to folder A, the other to folder B. That alone might be sufficient. I can use Beyond Compare to confirm results.

Adding a more sophisticated text compare on the headers might give full confirmation.

LINUX
I prefer working on the Windows machine, but if you can only suggest an automated solution in Linux, I could build a Linux machine to accomplish the task.

In case it's relevant, the headers look like this:

Return-Path: <bounce-mc.us5_12385835.1708185-user-name=domain-name.com@mail9.atl11.rsgsv.net>
Delivered-To: user-name@domain-name.com
Received: from xyz.mail-server.com
	by xyz.mail-server.com with LMTP id gIXMHuz3yFt+ig8AnVq7BA
	for <user-name@domain-name.com>; Thu, 18 Oct 2018 16:15:24 -0500
Return-path: <bounce-mc.us5_12385835.1708185-user-name=domain-name.com@mail9.atl11.rsgsv.net>
Envelope-to: user-name@domain-name.com
Delivery-date: Thu, 18 Oct 2018 16:15:24 -0500
Received: from mail9.atl11.rsgsv.net ([205.201.133.9]:23373)
	by xyz.mail-server.com with esmtp (Exim 4.91)
	(envelope-from <bounce-mc.us5_12385835.1708185-user-name=domain-name.com@mail9.atl11.rsgsv.net>)
	id 1gDFdX-004GnL-KP
	for user-name@domain-name.com; Thu, 18 Oct 2018 16:15:24 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; s=k1; d=mailchimpapp.net;
 h=Subject:From:Reply-To:To:Date:Message-ID:List-ID:List-Unsubscribe:
 Content-Type:MIME-Version;
 bh=pBva/JYKDv2xyI56QPk/T0DeSTahOwekPH4RrgV9vIA=;
 b=I/LU1CavXSDgLl62LRDFde4sCpPEBsO/N9dCEgVb3qNSFwL/6VGDXLvRfcsTxbGRXDvdb15elFsV
   oQbhi1KiDmeVrWsXNwbd6WakaOaixuJ8vyOlQnbKi8xr5kW6QiqXFqjRYUh+Ge3fbFfGOiufe07H
   XPXwMsJJ6WW7n6tKbF4=
Received: from (127.0.0.1) by mail9.atl11.rsgsv.net id hp3rue2ddl4j for <user-name@domain-name.com>; Thu, 18 Oct 2018 21:15:04 +0000 (envelope-from <bounce-mc.us5_12385835.1708185-user-name=domain-name.com@mail9.atl11.rsgsv.net>)
Subject: =?utf-8?Q?The=20world=20wobbles=20and=20Baidu=20sits=20pretty?=
From: =?utf-8?Q?PingWest?= <newsletter@pingwest.com>
Reply-To:  <us5-1a4f91c391-3870f368f8@conversation01.mailchimpapp.com>
To: <user-name@domain-name.com>
Date: Thu, 18 Oct 2018 21:15:04 +0000
Message-ID: <87ff9eecfa738064ccd0c1c28.e75d109c61.20181018211449.62efb20649.3abbb3d0@mail9.atl11.rsgsv.net>
X-Mailer: MailChimp Mailer - **CID62efb20649e75d109c61**
X-Campaign: mailchimp87ff9eecfa738064ccd0c1c28.62efb20649
X-campaignid: mailchimp87ff9eecfa738064ccd0c1c28.62efb20649
X-Report-Abuse: Please report abuse for this campaign here: 
X-MC-User: 87ff9eecfa738064ccd0c1c28
Feedback-ID: 12385835:12385835.1708185:us5:mc
List-ID: 87ff9eecfa738064ccd0c1c28mc list <87ff9eecfa738064ccd0c1c28.221565.list-id.mcsv.net>
Precedence: bulk
X-Auto-Response-Suppress: OOF, AutoReply
X-Accounttype: pd
List-Unsubscribe: 
List-Unsubscribe-Post: List-Unsubscribe=One-Click
Content-Type: multipart/alternative; boundary="_----------=_MCPart_1118978084"
MIME-Version: 1.0
X-Spam-Status: No, score=-3.2
X-Spam-Score: -31
X-Spam-Bar: ---
X-Spam-Flag: NO
ASKER CERTIFIED SOLUTION
Join our community to see this answer!
Unlock 1 Answer and 2 Comments.
Start Free Trial
Learn from the best

Network and collaborate with thousands of CTOs, CISOs, and IT Pros rooting for you and your success.

Andrew Hancock - VMware vExpert
See if this solution works for you by signing up for a 7 day free trial.
Unlock 1 Answer and 2 Comments.
Try for 7 days

”The time we save is the biggest benefit of E-E to our team. What could take multiple guys 2 hours or more each to find is accessed in around 15 minutes on Experts Exchange.

-Mike Kapnisakis, Warner Bros