Trying to find a way how I can delete duplicates of a csv files.

Hello,

I have a list of files like around 45 000. I am trying to find a way how i can delete the duplicates.
A file would be listed like the following

2day.uk_p443-20171122-1104.csv
2day.uk_p443-20171123-1720.csv

So there could be duplicate files and I need to find a way how i can remove them and only keep 1 of the files say by the last modified date.

I am looking for a script either can be run on windows or linux ubuntu.

I have tried different file duplicator programs like looking at hash or md5 but it doesn't detect it.
jay broitAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

arnoldCommented:
Duplicate files, or duplicate entries within the content.
You can use different check sum related tools.
Checksum type could be easier, get a digest of each file, and then delete  
The diff is a comparison of files...but this will mean you compare files to each other, which .......
0
NVITCommented:
How do you define a duplicate?

Is it a same named file that may be in a different folder?

Or, is it a different named file with same contents?
0
jay broitAuthor Commented:
All the csv files are in the same folder.

Any same domain file would have the same contents .
2day.uk_p443-20171122-1104.csv
2day.uk_p443-20171123-1720.csv

so there could be multiple files that which look similar but a different structure example
test.com_p443-2017173456-6540.csv
test.com_p443-2017175435-3401.csv
test.co.uk_p443-201753533-6465.csv
test.co.uk_p443-201753533-6440.csv

Naming hierachy is the following:
domainname_p443-date-randomnumber.csv
some files could have a different date some days + or - but still be the same domainname file and the randomnumber at the end would be different.

These files will look something similar to this which are stored in one folder.

I need a method of how I can detect the files using a pattern match and put it into groups and then delete the files in that group and keep the file which was modified last.

I have tried all those checksum tools it does not provide any matches as I have used Duplicate Cleaner Pro and it does not return all duplicates.
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

arnoldCommented:
How do you identify a duplicate?

What is the pattern you want to match?

Perl is a good text handling scripting.
The issue is I defining the criteria on which you build the logic to analyze the data and acting based on that and the process.

listing a directory then building a pattern on the name,
File creation using stat() .....

Current obstacle I see is an undefined criteria.
0
jay broitAuthor Commented:
A duplicate is any file with the same domainname in the full filename as I only want to keep one of the files.

So an example of a pattern would be ([a-zA0-9]*.[a-zA-Z]*_p443-[0-9]*-[0-9]*.csv) but then it would have to search further into the files.

2day.uk_p443- up to here the only difference in the file would be the date could be a different date + or - days and the last number would be a 4 digit random number and .csv would be the same.

So the domain name as an example could be 999mitsi.com or 999mitis.co.uk or 9999mitis.co.za or productionstudios.co.uk the _p443- will always be the same.

Then once it finds those files puts it into a group and searches that group for the file with the latest modified date to keep and deletes the other duplicate files.
0
arnoldCommented:
the simple method when the contents of the file are of no consequence.
is to use perl with
#!/usr/bin/perl

my $directory="folder_of_files";
my %Filehash
open DIR, "ls -t $directory" || die "Uanble to open '$directory' for listing:$!\n";
while (<DIR>) {
chomp();
if ( /^([A-Za-z0-9\-]+\.[a-z]+)_p443\-(\d+)\-(\d*)\.csv/ ) 
           $filehash{$1}->{"$2-$3"}=$_;
}
}

Open in new window


one way since the sorting is based on dates, newest to oldest, checking if the hash already exists discarding others ,..........
0
jay broitAuthor Commented:
Yes the problem with that approach is I am not able to use the file hashing as a lot of the domain names have different hash numbers. I already tested this out with several programs , it will get rid of some duplicates but not all for some reason some of the file names I have even if you compare the md5 checksum they are different so it won't match any duplicates.
0
arnoldCommented:
The hash in the perl example is to hash the domain name

Does the content within the file of any importance?
the point being list the file based on mtime newest to oldest
then if the "domain" in the hash exists, the file being evaluated can be deleted.
can you post a listing example
ls -t
0
jay broitAuthor Commented:
The content within the files are all the same but its just trying to delete the duplicates and keep 1 of them. But an example I have for a listing this is just for 1 domain name I attached a screenshot.

I tried your perl script but I get errors ( Missing operator before $Filehash?) synatax error at line 9 and something about Unmatched right curly bracket at line 11.
listingofiles.PNG
0
arnoldCommented:
lets try this, you run the above and pipe it to the perl script below
NOTE the example relies on the data being sorted from newest to oldest.

#!/usr/bin/perl

my %hash;
while (<STDIN>) {
chomp();
if ( /^([A-Za-z0-9\-]+\.[a-z]+).*\.csv$/ ) {
      if ( length($hash{$1})>0 ) {
              print "$1 already seen, $_ needs to be deleted\n";
              #when ready, unlink "$_"; # delete the file
      }
      else {
             $hash{$1}=1;
     }
}
}

Open in new window



the output of ls -t | perl check_listing.pl
2day.uk already seen, 2day.uk_20171123-1811.csv needs to be deleted
2day.uk already seen, 2day.uk_20171123-1720.csv needs to be deleted
2day.uk already seen, 2day.uk_20171123-1715.csv needs to be deleted
2day.uk already seen, 2day.uk_20171122-2259.csv needs to be deleted
2day.uk already seen, 2day.uk_20171122-2205.csv needs to be deleted
2day.uk already seen, 2day.uk_20171122-1104.csv needs to be deleted
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
jay broitAuthor Commented:
Hello,

Perfect thanks just what I needed.

The only one question I have is with the script.

Because some of my domain names have two . domain names.

so an example tucows.co.uk . Would I add another regex pattern to the existing one /^([A-Za-z0-9\-]+\.[a-z]+).*\.csv$/  which you have an example.

/^([A-Za-z0-9\-]+\.[a-z].[a-z]+).*\.csv$/
0
arnoldCommented:
you can change the pattern match for the first to include a period in the name which will cover the separation.

/^([A-Za-z0-9.\-]+\.[a-z]+).*\.csv$/
in the way you have it if it is not a three part domain name (country TLD) it will not match.
0
jay broitAuthor Commented:
This script works fine and I was able to detect duplicate files with it.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Linux

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.