Need Script or Macro to Identify and Move Duplicate Files (Based on Size and Date)

Jerry L
Jerry L used Ask the Experts™
I am hosting my domain on a shared Hostgator Linux Server.
I have about 10,000 emails of which 20 percent are duplicates.

I have created a cPanel backup,
downloaded the domain.tar.gz file, and
unzipped it to my Windows 10 file system.

The emails I need are in one folder.
Each email is in plain text format.

In Windows 10 File Manager, when you sort the files by Size, you can see the Date Stamp is usually within one (or a few) minute(s) similar.

When you open the two such files and look at the headers, you will see the Return-Path bounce numbers to be identical, e.g., "bounce-mc.us5_12385835.1708185".

I am trying to find a script, macro, or automated solution that will help me find and move duplicate text files.

Perhaps a simple workaround would be to identify suspected duplicate files that reside in folder Z (using the above mentioned algorithm), then move one of them to folder A, the other to folder B. That alone might be sufficient. I can use Beyond Compare to confirm results.

Adding a more sophisticated text compare on the headers might give full confirmation.

I prefer working on the Windows machine, but if you can only suggest an automated solution in Linux, I could build a Linux machine to accomplish the task.

In case it's relevant, the headers look like this:

Return-Path: <>
Received: from
	by with LMTP id gIXMHuz3yFt+ig8AnVq7BA
	for <>; Thu, 18 Oct 2018 16:15:24 -0500
Return-path: <>
Delivery-date: Thu, 18 Oct 2018 16:15:24 -0500
Received: from ([]:23373)
	by with esmtp (Exim 4.91)
	(envelope-from <>)
	id 1gDFdX-004GnL-KP
	for; Thu, 18 Oct 2018 16:15:24 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; s=k1;;
Received: from ( by id hp3rue2ddl4j for <>; Thu, 18 Oct 2018 21:15:04 +0000 (envelope-from <>)
Subject: =?utf-8?Q?The=20world=20wobbles=20and=20Baidu=20sits=20pretty?=
From: =?utf-8?Q?PingWest?= <>
Reply-To:  <>
To: <>
Date: Thu, 18 Oct 2018 21:15:04 +0000
Message-ID: <>
X-Mailer: MailChimp Mailer - **CID62efb20649e75d109c61**
X-Campaign: mailchimp87ff9eecfa738064ccd0c1c28.62efb20649
X-campaignid: mailchimp87ff9eecfa738064ccd0c1c28.62efb20649
X-Report-Abuse: Please report abuse for this campaign here: 
X-MC-User: 87ff9eecfa738064ccd0c1c28
Feedback-ID: 12385835:12385835.1708185:us5:mc
List-ID: 87ff9eecfa738064ccd0c1c28mc list <>
Precedence: bulk
X-Auto-Response-Suppress: OOF, AutoReply
X-Accounttype: pd
List-Unsubscribe-Post: List-Unsubscribe=One-Click
Content-Type: multipart/alternative; boundary="_----------=_MCPart_1118978084"
MIME-Version: 1.0
X-Spam-Status: No, score=-3.2
X-Spam-Score: -31
X-Spam-Bar: ---
X-Spam-Flag: NO

Open in new window

Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Top Rated Freelancer on MS Technologies
Awarded 2018
Distinguished Expert 2018
Hi Jerry for the complexity of the question I think that should be a paid project. However here's a start up wth logic

$path= "path whre the files are"
$allfiles = gci -path $Path -File -Recurse

for($i=0; $i -lt $allfiles.count ;$i++){
    for($j=0; $j -lt $allfiles.count ;$j++){
        if($i -eq $j){
            #compare with the file i,j if they are different skip them
            #elsecheck the interestline for both

            $line1= Get-Line $allfiles[$i]
            $line2 = Get-Line $allfiles[$j]

            #if boths applis then is the same file, 
            #do the move


function Get-Line{
    $InterestLine = Get-content $file | where{ $_ -like "*Return-Path*"} | %{
        $bounce = $_.split('_')[1].split("-")[0]  
    return $bounce

Open in new window

Jerry LOperations Manager


Thank you for looking into this for me.
I'm going to look for ready-built solutions, such as MapiLab Duplicate Remover.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial