Need Script or Macro to Identify and Move Duplicate Files (Based on Size and Date)

Jerry L
Jerry L used Ask the Experts™
on
BACKGROUND
I am hosting my domain on a shared Hostgator Linux Server.
I have about 10,000 emails of which 20 percent are duplicates.

I have created a cPanel backup,
downloaded the domain.tar.gz file, and
unzipped it to my Windows 10 file system.

The emails I need are in one folder.
Each email is in plain text format.

DUPLICATE FILES ALGORITHM
In Windows 10 File Manager, when you sort the files by Size, you can see the Date Stamp is usually within one (or a few) minute(s) similar.

When you open the two such files and look at the headers, you will see the Return-Path bounce numbers to be identical, e.g., "bounce-mc.us5_12385835.1708185".

QUESTION
I am trying to find a script, macro, or automated solution that will help me find and move duplicate text files.

Perhaps a simple workaround would be to identify suspected duplicate files that reside in folder Z (using the above mentioned algorithm), then move one of them to folder A, the other to folder B. That alone might be sufficient. I can use Beyond Compare to confirm results.

Adding a more sophisticated text compare on the headers might give full confirmation.

LINUX
I prefer working on the Windows machine, but if you can only suggest an automated solution in Linux, I could build a Linux machine to accomplish the task.

In case it's relevant, the headers look like this:

Return-Path: <bounce-mc.us5_12385835.1708185-user-name=domain-name.com@mail9.atl11.rsgsv.net>
Delivered-To: user-name@domain-name.com
Received: from xyz.mail-server.com
	by xyz.mail-server.com with LMTP id gIXMHuz3yFt+ig8AnVq7BA
	for <user-name@domain-name.com>; Thu, 18 Oct 2018 16:15:24 -0500
Return-path: <bounce-mc.us5_12385835.1708185-user-name=domain-name.com@mail9.atl11.rsgsv.net>
Envelope-to: user-name@domain-name.com
Delivery-date: Thu, 18 Oct 2018 16:15:24 -0500
Received: from mail9.atl11.rsgsv.net ([205.201.133.9]:23373)
	by xyz.mail-server.com with esmtp (Exim 4.91)
	(envelope-from <bounce-mc.us5_12385835.1708185-user-name=domain-name.com@mail9.atl11.rsgsv.net>)
	id 1gDFdX-004GnL-KP
	for user-name@domain-name.com; Thu, 18 Oct 2018 16:15:24 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; s=k1; d=mailchimpapp.net;
 h=Subject:From:Reply-To:To:Date:Message-ID:List-ID:List-Unsubscribe:
 Content-Type:MIME-Version;
 bh=pBva/JYKDv2xyI56QPk/T0DeSTahOwekPH4RrgV9vIA=;
 b=I/LU1CavXSDgLl62LRDFde4sCpPEBsO/N9dCEgVb3qNSFwL/6VGDXLvRfcsTxbGRXDvdb15elFsV
   oQbhi1KiDmeVrWsXNwbd6WakaOaixuJ8vyOlQnbKi8xr5kW6QiqXFqjRYUh+Ge3fbFfGOiufe07H
   XPXwMsJJ6WW7n6tKbF4=
Received: from (127.0.0.1) by mail9.atl11.rsgsv.net id hp3rue2ddl4j for <user-name@domain-name.com>; Thu, 18 Oct 2018 21:15:04 +0000 (envelope-from <bounce-mc.us5_12385835.1708185-user-name=domain-name.com@mail9.atl11.rsgsv.net>)
Subject: =?utf-8?Q?The=20world=20wobbles=20and=20Baidu=20sits=20pretty?=
From: =?utf-8?Q?PingWest?= <newsletter@pingwest.com>
Reply-To:  <us5-1a4f91c391-3870f368f8@conversation01.mailchimpapp.com>
To: <user-name@domain-name.com>
Date: Thu, 18 Oct 2018 21:15:04 +0000
Message-ID: <87ff9eecfa738064ccd0c1c28.e75d109c61.20181018211449.62efb20649.3abbb3d0@mail9.atl11.rsgsv.net>
X-Mailer: MailChimp Mailer - **CID62efb20649e75d109c61**
X-Campaign: mailchimp87ff9eecfa738064ccd0c1c28.62efb20649
X-campaignid: mailchimp87ff9eecfa738064ccd0c1c28.62efb20649
X-Report-Abuse: Please report abuse for this campaign here: 
X-MC-User: 87ff9eecfa738064ccd0c1c28
Feedback-ID: 12385835:12385835.1708185:us5:mc
List-ID: 87ff9eecfa738064ccd0c1c28mc list <87ff9eecfa738064ccd0c1c28.221565.list-id.mcsv.net>
Precedence: bulk
X-Auto-Response-Suppress: OOF, AutoReply
X-Accounttype: pd
List-Unsubscribe: 
List-Unsubscribe-Post: List-Unsubscribe=One-Click
Content-Type: multipart/alternative; boundary="_----------=_MCPart_1118978084"
MIME-Version: 1.0
X-Spam-Status: No, score=-3.2
X-Spam-Score: -31
X-Spam-Bar: ---
X-Spam-Flag: NO

Open in new window

Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Top Rated Freelancer on MS Technologies
Awarded 2018
Distinguished Expert 2018
Commented:
Hi Jerry for the complexity of the question I think that should be a paid project. However here's a start up wth logic

$path= "path whre the files are"
$allfiles = gci -path $Path -File -Recurse

#buble
for($i=0; $i -lt $allfiles.count ;$i++){
    for($j=0; $j -lt $allfiles.count ;$j++){
        if($i -eq $j){
            Continue;
        }
        else{
            #compare with the file i,j if they are different skip them
            #elsecheck the interestline for both

            $line1= Get-Line $allfiles[$i]
            $line2 = Get-Line $allfiles[$j]

            #if boths applis then is the same file, 
            #do the move
            #continue
        }

}


function Get-Line{
    [CmdletBinding()]
    param(
        [Parameter(Mandatory=$true,Position=0)]$file
    )
    $InterestLine = Get-content $file | where{ $_ -like "*Return-Path*"} | %{
        $bounce = $_.split('_')[1].split("-")[0]  
    }  
    return $bounce
}

Open in new window

Jerry LOperations Manager

Author

Commented:
Thank you for looking into this for me.
I'm going to look for ready-built solutions, such as MapiLab Duplicate Remover.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial