Avatar of jay bro
jay bro
 asked on

Trying to find a way how I can delete duplicates of a csv files.

Hello,

I have a list of files like around 45 000. I am trying to find a way how i can delete the duplicates.
A file would be listed like the following

2day.uk_p443-20171122-1104.csv
2day.uk_p443-20171123-1720.csv

So there could be duplicate files and I need to find a way how i can remove them and only keep 1 of the files say by the last modified date.

I am looking for a script either can be run on windows or linux ubuntu.

I have tried different file duplicator programs like looking at hash or md5 but it doesn't detect it.
LinuxWindows OS

Avatar of undefined
Last Comment
jay bro

8/22/2022 - Mon
arnold

Duplicate files, or duplicate entries within the content.
You can use different check sum related tools.
Checksum type could be easier, get a digest of each file, and then delete  
The diff is a comparison of files...but this will mean you compare files to each other, which .......
NVIT

How do you define a duplicate?

Is it a same named file that may be in a different folder?

Or, is it a different named file with same contents?
jay bro

ASKER
All the csv files are in the same folder.

Any same domain file would have the same contents .
2day.uk_p443-20171122-1104.csv
2day.uk_p443-20171123-1720.csv

so there could be multiple files that which look similar but a different structure example
test.com_p443-2017173456-6540.csv
test.com_p443-2017175435-3401.csv
test.co.uk_p443-201753533-6465.csv
test.co.uk_p443-201753533-6440.csv

Naming hierachy is the following:
domainname_p443-date-randomnumber.csv
some files could have a different date some days + or - but still be the same domainname file and the randomnumber at the end would be different.

These files will look something similar to this which are stored in one folder.

I need a method of how I can detect the files using a pattern match and put it into groups and then delete the files in that group and keep the file which was modified last.

I have tried all those checksum tools it does not provide any matches as I have used Duplicate Cleaner Pro and it does not return all duplicates.
Experts Exchange has (a) saved my job multiple times, (b) saved me hours, days, and even weeks of work, and often (c) makes me look like a superhero! This place is MAGIC!
Walt Forbes
arnold

How do you identify a duplicate?

What is the pattern you want to match?

Perl is a good text handling scripting.
The issue is I defining the criteria on which you build the logic to analyze the data and acting based on that and the process.

listing a directory then building a pattern on the name,
File creation using stat() .....

Current obstacle I see is an undefined criteria.
jay bro

ASKER
A duplicate is any file with the same domainname in the full filename as I only want to keep one of the files.

So an example of a pattern would be ([a-zA0-9]*.[a-zA-Z]*_p443-[0-9]*-[0-9]*.csv) but then it would have to search further into the files.

2day.uk_p443- up to here the only difference in the file would be the date could be a different date + or - days and the last number would be a 4 digit random number and .csv would be the same.

So the domain name as an example could be 999mitsi.com or 999mitis.co.uk or 9999mitis.co.za or productionstudios.co.uk the _p443- will always be the same.

Then once it finds those files puts it into a group and searches that group for the file with the latest modified date to keep and deletes the other duplicate files.
arnold

the simple method when the contents of the file are of no consequence.
is to use perl with
#!/usr/bin/perl

my $directory="folder_of_files";
my %Filehash
open DIR, "ls -t $directory" || die "Uanble to open '$directory' for listing:$!\n";
while (<DIR>) {
chomp();
if ( /^([A-Za-z0-9\-]+\.[a-z]+)_p443\-(\d+)\-(\d*)\.csv/ ) 
           $filehash{$1}->{"$2-$3"}=$_;
}
}

Open in new window


one way since the sorting is based on dates, newest to oldest, checking if the hash already exists discarding others ,..........
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
jay bro

ASKER
Yes the problem with that approach is I am not able to use the file hashing as a lot of the domain names have different hash numbers. I already tested this out with several programs , it will get rid of some duplicates but not all for some reason some of the file names I have even if you compare the md5 checksum they are different so it won't match any duplicates.
arnold

The hash in the perl example is to hash the domain name

Does the content within the file of any importance?
the point being list the file based on mtime newest to oldest
then if the "domain" in the hash exists, the file being evaluated can be deleted.
can you post a listing example
ls -t
jay bro

ASKER
The content within the files are all the same but its just trying to delete the duplicates and keep 1 of them. But an example I have for a listing this is just for 1 domain name I attached a screenshot.

I tried your perl script but I get errors ( Missing operator before $Filehash?) synatax error at line 9 and something about Unmatched right curly bracket at line 11.
listingofiles.PNG
I started with Experts Exchange in 2004 and it's been a mainstay of my professional computing life since. It helped me launch a career as a programmer / Oracle data analyst
William Peck
ASKER CERTIFIED SOLUTION
arnold

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
jay bro

ASKER
Hello,

Perfect thanks just what I needed.

The only one question I have is with the script.

Because some of my domain names have two . domain names.

so an example tucows.co.uk . Would I add another regex pattern to the existing one /^([A-Za-z0-9\-]+\.[a-z]+).*\.csv$/  which you have an example.

/^([A-Za-z0-9\-]+\.[a-z].[a-z]+).*\.csv$/
arnold

you can change the pattern match for the first to include a period in the name which will cover the separation.

/^([A-Za-z0-9.\-]+\.[a-z]+).*\.csv$/
in the way you have it if it is not a three part domain name (country TLD) it will not match.
jay bro

ASKER
This script works fine and I was able to detect duplicate files with it.
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.