[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1157
  • Last Modified:

Script to find duplicate files, but not using filename = filesize

Hi,

oBda kindly wrote this script to sort files based upon picture by camera taken date:
http://www.experts-exchange.com/Programming/Languages/Scripting/Shell/Batch/Q_28271436.html


I've determined now that this leaves me with a dilemma of duplicate files.  Many exist in the same folder so I can't use a filename comparison, so I was thinking of traversing through the folder structure looking for files in the same folder as itself and marking it as a dupe.

I'm not sure how to mark it as a dupe.   I don't like the idea of moving it out of the folder, so I was thinking of just suffixing '_dupe' at the end of the file and before the jpg.


So two files name 'filea.jpg' and 'filea1.jpg' have the exact same filesize.

The script would rename 'file1.jpg' to be 'file1_dupe.jpg'.


Is this possible?
0
TheDadCoder
Asked:
TheDadCoder
  • 39
  • 31
  • +1
4 Solutions
 
knightEknightCommented:
Do these duplicate files also have the same create date/time?
0
 
TheDadCoderAuthor Commented:
Hi knightEknight,

Yes, they should have.  I would imagine they were once the exact same file on the computer (as opposed to imported twice from camera).

They would have been created as a duplicate through crazy backups and copies of the media folder...

So, the exact same file with same DateCreated timestamp...
0
 
knightEknightCommented:
I have an article that may be interesting with regards to your previous question about sorting, and I believe it can be modified to flag duplicates as you suggest as well.  In the mean time, please try this over a set of test files from your camera.  If you like the result then I will use this as a basis for a "dupe-flagger" script:

http://www.experts-exchange.com/Programming/Languages/Scripting/Shell/A_268-Rename-files-to-the-file-date.html
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
TheDadCoderAuthor Commented:
Hi knightEknight,

That's an interesting script and complements the previous sorting into folders script i think.

One question, how does your script deal with dupes?

Dupes as in:

1.  The exact same file

2.  but also pictures that were created the same second; such as cameras which take 20-30 continuous pictures, which could mean circa 5 pictures per second.  Or does it go to millisecond?
0
 
knightEknightCommented:
The script has a little trick for dealing with dupes (with respect to date/time, but not size - at least not yet).  If two or more pictures are taken within one second of one another, it simply increments the seconds, which the script always start at zero for any given minute.  So unless you take more than 60 pictures a minute, it works!

So, for example, if you took three pictures today at 2:15:33 pm, the files would be named:
  20131022_141501.jpg
  20131022_141502.jpg
  20131022_141503.jpg
respectively.

But, when such dupes are found, I believe I can add code to also check the size, and if they are the same, I'll add "_dupe" in the filename.
0
 
TheDadCoderAuthor Commented:
Hi knightEknight,

Here's the output to the article script:

ren "C:\Users\admin\Downloads\picstorename\Image (19).JPG" "20002501_112300.JPG"
ren "C:\Users\admin\Downloads\picstorename\DSCF4715.JPG" "20082509_033000.JPG"
ren "C:\Users\admin\Downloads\picstorename\100_1941.JPG" "20091912_071900.JPG"
ren "C:\Users\admin\Downloads\picstorename\050112 001.JPG" "20111310_062200.JPG"
ren "C:\Users\admin\Downloads\picstorename\050112 139.JPG" "20120501_023400.JPG"
ren "C:\Users\admin\Downloads\picstorename\same filename different date\2004\100_0689.JPG" "20101804_051700.JPG"
ren "C:\Users\admin\Downloads\picstorename\same filename different date\2006\100_0689.JPG" "20062701_021800.JPG"

Open in new window


These are the same test files from the previous script with oBda.

Test file 1
'Image (19).JPG' - this file was renamed to: '20002501_112300.JPG'

This is incorrect, as the script isn't taking into account the meta data DateCreated, from oBda's script. :)    We found that the script infact uses a datemodified date which isn't necessarily the actual datetime the photo was snapped by the camera.

So oBda used exiftool to find the meta data DateCreated and then parsed that, if it existed then he used that if not he used the file's date modified (non-meta data).


Are you able to update your script to utilise the meta data, using exiftool?

A note about the last two test files - these are infact different photos, taken years apart but have the same sequenced filename [from the camera].  I included these in oBda's script to see what the csv lookup table would do with the same filename, but filed it correctly in the correct folder name.
0
 
knightEknightCommented:
Please let me know which, if any, of these commands shows the date(s) you want to use for those two test files:

  dir /tc  100_0689.JPG
  dir /ta  100_0689.JPG
  dir /tw  100_0689.JPG

(I suspect the /tc option will not be right if you run it on a copy the original file.)
0
 
knightEknightCommented:
You are right though, the article assumes un-modified files on the camera.  That's how I use it anyway - I mean its the first thing I do - I run this script on the images before I even take them off the camera, then I touch them up afterwards.
0
 
knightEknightCommented:
It also assumes the clock in the camera is accurate!  :)
0
 
TheDadCoderAuthor Commented:
Here's the results:

dir /tc  100_0689.JPG
dir /ta  100_0689.JPG
This returns: 22/10/2013 09:15



dir /tw  100_0689.JPG
This returns 18/04/2010 23:17

This is the date your script renames it to, however, it should be using the date: 31/12/2004 20:09.  The 2004 date is the actual datetime the picture was taken on camera.
0
 
TheDadCoderAuthor Commented:
I agree, going forward from now I'd imagine the script working as is, but the >100,000 images in the existing library will be erratic i guess. :)
0
 
knightEknightCommented:
what is the result of this command?

date/t

I may need to adjust the default date format in the script.
0
 
TheDadCoderAuthor Commented:
date/t returns:
22/10/2013
0
 
knightEknightCommented:
At least I can do this much, if you will adjust line 18 in the script, it will put the filename in proper yyyymmdd format:

   set fn=!fn:~6,4!!fn:~3,2!!fn:~0,2!_!hour!!fn:~14,2!

I can incorporate the exiftool output, but I am being pulled away at work at the moment and I may not be able to get back to you for a few hours.

In the mean time I have polluted your thread with all this stuff, so if you want to re-post this question new and get a point refund on this one, that would be fine with me.  Now that I have a good idea of what needs to be done, I'll respond either in this question or the new one (if you go that route) when I have something to show off.
0
 
TheDadCoderAuthor Commented:
Hi knighteknight,

I've updated the script with the new line18 and that works a treat, thanks.
0
 
knightEknightCommented:
After looking again at this, I think what you are asking for is a way to identify duplicate files that have the same file name and size (and perhaps date), but that exist in different sub-folders, correct?

If so, then the article script is not a good basis on which to build a solution for this, because it generally works on one distinct filename at a time.

However, I'm working another approach that may do just as well.
0
 
wilcoxonCommented:
Rather than using filesize and/or date/time to detect duplicates, have you considered using the MD5 hash of the file (or other hashing algorithm)?  Filesize will almost certainly result in false positives for "dupe".
0
 
knightEknightCommented:
Agreed, MD5 is a good alternative, I was using the fc utility on same-sized files in my new solution, but MD5 might be cleaner.  But I will finish with the fc solution first because it requires no third-party software.
0
 
knightEknightCommented:
Here's a stab at it.  If there is more than one duplicate it will increment a dupe counter in the filename until there are no more dupes of that file.  Let me know how it works for you.

@echo off
 setlocal enabledelayedexpansion
 
 set filespec=%1
 set dupecount=1

 if "%filespec%"=="" set filespec=*
 
 for /f "delims=" %%F in ('dir/a-d/b/os-dn/s %filespec%') do (

   if !dupecount! equ 1 (
      set pFN=!FN!
      set pFD=!FD!
      set pNAME=!NAME!
      set pEXT=!EXT!
      set pSIZE=!SIZE!
   )

   set FN=%%F
   set FD=%%~tF
   set NAME=%%~nF
   set EXT=%%~xF
   set SIZE=%%~zF

   if defined pSIZE if !pSIZE! equ !SIZE! (
     fc /B /LN=1 "!FN!" "!pFN!" 1>nul 2>nul
     if !errorlevel! equ 0 (
        @echo ren "!FN!" "!pNAME!_dupe!dupecount!!pEXT!"
        ren "!FN!" "!pNAME!_dupe!dupecount!!pEXT!"
        set/a dupecount+=1
     )
   ) else (
     set dupecount=1
   )
 )

 exit/b

Open in new window

0
 
knightEknightCommented:
I forgot to mention that this is a post-sorting script to be used after your original script sorts them to separate sub-folders.  Run this script in the parent folder using *.jpg as the filespec parameter.

I suppose it would probably work even before the sorting script is run, but I tested it as if it were after.  It is based first on filesize and then on the results of fc.exe, so it should work either way.
0
 
TheDadCoderAuthor Commented:
Hi knightEknight,

This looks promising :)    

I did this:

Test 1 (I only did this by mistake cause i fogot to copy some test files in there, but though I'd include the result)
No files in destination folder - gave an error saying no files found.
That's fine and expected.

Test 2
Two unique files
No errors or changes, nothing to disk was recorded.  Some echos' would be nice lol :)


Test 3
Same unique files, but duplicated one of them.
The unique file was left alone.

The other 2 files, which are the same file but with different filenames, but same size.

+ 'Image (19).JPG'
+ 'Image (19) - green.JPG'

After the script was run:

+ 'Image (19) - green.JPG'
+ 'Image (19) - green_dupe1.JPG'

This looks promising as it picked one up and marked it '_dupe1'.

However, I'm concerned that it lost the original filename of 'one of them'.  They both took on the dupe's filename, adding the 'green' word in this case.

Could one of the files names be left intact, and the dupe be updated with '_dupex'?

Not sure how you're working out which is the dupe when both are identical? :)

Test 4
I left the unqiue single file in there.

However, I added another duplicate of the existing 2 files, which are the same file but with different filenames, but same size.

+ 'Image (19).JPG'
+ 'Image (19) - green.JPG'
+ 'Image (19) - blue.JPG'

After the script was run:

+ 'Image (19) - blue.JPG'
+ 'Image (19) - blue_dupe1.JPG'
+ 'Image (19) - blue_dupe2.JPG'

So the incrementing the count works nicely in this test, however we're losing the original filename, of each of the files and taking the last duplicate file's filename? :)

I'm not sure how much of an issue this is, if I ran your other renaming script to the datetime first or afterwards (not forgetting the meta data issue). However, it feels wrong to me, and each file should keep it's filename intact, apart form the added suffix '_dupex'.

What do you think?
0
 
TheDadCoderAuthor Commented:
Hi I've just seen your last comment, sorry missd it whilst testing :)

Taking your advise I ran the rename to date script first, then the dedupe.

This is the result:
(It would have to be run in this order, not dedupe then rename, cause the rename would remove the dupe mark on the duplicate files)

Before both scripts were run:

25/09/2008  15:30         2,948,905 DSCF4715 - Copy (2).JPG
25/09/2008  15:30         2,948,905 DSCF4715 - Copy.JPG
25/09/2008  15:30         2,948,905 DSCF4715.JPG
25/01/2000  11:23           177,690 Image (19).JPG

Open in new window


After rename to date script was run:
25/01/2000  11:23           177,690 20000125_112300.JPG
25/09/2008  15:30         2,948,905 20080925_033000.JPG
25/09/2008  15:30         2,948,905 20080925_033001.JPG
25/09/2008  15:30         2,948,905 20080925_033002.JPG

Open in new window


After dedupe script was run:
25/01/2000  11:23           177,690 20000125_112300.JPG
25/09/2008  15:30         2,948,905 20080925_033000.JPG
25/09/2008  15:30         2,948,905 20080925_033001.JPG
25/09/2008  15:30         2,948,905 20080925_033002.JPG

Open in new window

0
 
knightEknightCommented:
In the third scenario, it all depends on which one you consider the original and which one you consider the dupe.  If  'Image (19) - green.JPG'  is the original, then it worked!  lol

The way it determines which one is the original is purely by sort order, first by size (obviously), then date (earliest first, to be considered the original), then if they are equal then by filename.  The last of these criteria is somewhat arbitrary, but it is why you see the result you do in scenario 3 above.  You can change this specific scenario by changing the order clause on the dir command from this:  os-dn  to this:  os-d-n ... but by doing this, you are fixing one and breaking another.

For example, as is, if the script encounters two duplicates called ABC.jpg and XYZ.jpg with the same file size and date/time, it will use ABC as the original.  But if you make the change in the dir cmd as described above it will use XYZ as the original.  Which is right?  I can't tell by filename alone.

Now, another alternative is to simply tag the latter file with the "dupe" suffix, so you would be left with ABC.jpg and XYZ_dupe.jpg - but I figured you would want to know which file XYZ is a duplicate of.
0
 
knightEknightCommented:
Oh yeah, and per my last comment (ID: 39592462), I think I had it just backwards ... it should be run before the sort script - sorry.  And the script I was talking about was the one from your previous question, not the one from my article.  In other words, I think what you did in your first test was probably correct.
0
 
TheDadCoderAuthor Commented:
Now, another alternative is to simply tag the latter file with the "dupe" suffix, so you would be left with ABC.jpg and XYZ_dupe.jpg - but I figured you would want to know which file XYZ is a duplicate of.

This is fair point, without that I'd just be left with dupes, and I could just search and delete those... but yes, i agree it'd be nice to know the original filename.

But does the original filename need to chnage to match the dupe?  

Can the dupe be renamed to match the original filename?

Or have i just ignored you fine example from a moment ago?! :)
0
 
knightEknightCommented:
The script could be made to delete the dupes instead of renaming them, but again, which one it considers the dupe and which one it considers the original depend only on their alphabetical order if they have the same date/time stamp.
0
 
TheDadCoderAuthor Commented:
Any idea why the renaming script is not placing the correct time of the day:

25/09/2008  15:30         2,948,905 20080925_033000.JPG

Open in new window


datetime is 1530, but filename is 0330, lost 12 hours here?
0
 
knightEknightCommented:
>> Can the dupe be renamed to match the original filename?

That is what the script does now ... but which one it considers the original is based only on alphabetic order by filename (all else being equal).
0
 
TheDadCoderAuthor Commented:

The script could be made to delete the dupes instead of renaming them, but again, which one it considers the dupe and which one it considers the original depend only on their alphabetical order if they have the same date/time stamp.

I'm hoping to use your renaming script, so this isn't really an issue i think.

Other than the 12hour difference issue, as per previous post (ID: 39592524), is it possible to use the exiftool for the metadata datetime?
0
 
knightEknightCommented:
>> 20080925_033000.JPG

This is where the system date format has burned me again.  The script looks for the presence of "pm" (which exists in my local date format) and if it doesn't find it, it assumes "am".  I believe I can fix this...
0
 
knightEknightCommented:
what is the output of this command on your system?

time/t
0
 
TheDadCoderAuthor Commented:
time/t gives:

22:37

I'm based in UK.
0
 
knightEknightCommented:
Change to line 12 of the rename script:

   if /i not "!fn:~17,2!"=="AM" set/a hour+=12
0
 
knightEknightCommented:
wait ... scratch that...
0
 
knightEknightCommented:
I think you should just comment out (or remove) lines 11 and 12 and see if that works.
0
 
TheDadCoderAuthor Commented:
I commented out the below two lines, 11 and 12 i believe:
@echo   if !hour! GTR 10 set/a hour=(!hour!-6^) %% 12
@echo   if "!fn:~17,2!"=="PM" set/a hour+=12

Open in new window


This left me with a filename:
20080925_213000.JPG

where it should be 1530
0
 
knightEknightCommented:
hmmm ... please run this command directly from the command prompt and let me know the output:

for /f "delims=" %F in ('dir/a-d/b/od DSCF4715.jpg') do @echo %~tF
0
 
TheDadCoderAuthor Commented:
It returns:

25/09/2008 15:30
0
 
knightEknightCommented:
okay lets try this change to line 10:  (after commenting-out 11 & 12)

set/a hour=!fn:~11,2!


This one makes me nervous because I don't know why I was doing that on line 10 to begin with, so I don't know what other ramifications this will have.  The article is 4 years old and I haven't looked at it much since - lol.
0
 
TheDadCoderAuthor Commented:
That works fine now, see screenshot.

Based on your concern, I'll do some other files to test this further.


See the screenshot the top file is being named 25 Jan 2000, but was taken on 22 Jan 2000, this is the meta data issue etc.
Screen-Shot-2013-10-22-at-23.33..png
0
 
knightEknightCommented:
understood, now the change I just made won't work in every circumstance (I remembered why on my way home).  I'm sure it's late where you are, so if you will check back in the morning I will hopefully have it wrapped up.
0
 
knightEknightCommented:
Well, while investigating how to incorporate the meta-data, I discovered that the exiftool will do what the rename script does:

   exiftool "-FileName<CreateDate" -d %Y%m%d_%H%M%S%%-c.%%e .

the last "." represents the current directory, or you can specify a path like C:\dir\sub

If "CreateDate" is blank, you can use "FileCreateDate" instead.
0
 
knightEknightCommented:
anyway, regarding the rename script, after making the aforementioned change to line 10, and commenting out lines 11 & 12, line 13 is no longer necessary, so comment it out too.  I think that should do it with respect to that script - at least in your timezone.  I need to republish this article with some enhancements to cover things like this.  Either that, or learn to use the exiftool to do the same thing!

Speaking of that, I didn't get to finish that part of the dupe finder last night.
0
 
TheDadCoderAuthor Commented:
Hi knightEknight, no problem I can imagine you're busy with work and things.
0
 
knightEknightCommented:
Check this!  I found a neat little tool that will rename .jpg files according to their meta create date:  http://www.sentex.net/~mwandel/jhead/

download here: http://www.sentex.net/~mwandel/jhead/jhead.exe

The command to do the re-naming is simple:
 
jhead.exe  -n%Y%m%d_%H%M%S  *.jpg

Open in new window

I think this will do in one step everything we were trying to do with respect to the file naming.  After you run this, then run the dupe finder script and let me know how it goes!


From the jhead.exe help screen:

DATE / TIME MANIPULATION:
  -ft        Set file modification time to Exif time
  -dsft      Set Exif time to file modification time
  -n[format-string]
             Rename files according to date.  Uses exif date if present, file
             date otherwise.  If the optional format-string is not supplied,
             the format is mmdd-hhmmss.  If a format-string is given, it is
             is passed to the 'strftime' function for formatting
             %d Day of month    %H Hour (24-hour)
             %m Month number    %M Minute    %S Second
             %y Year (2 digit 00 - 99)        %Y Year (4 digit 1980-2036)
             For more arguments, look up the 'strftime' function.
             In addition to strftime format codes:
             '%f' as part of the string will include the original file name
             '%i' will include a sequence number, starting from 1. You can
             You can specify '%03i' for example to get leading zeros.
0
 
knightEknightCommented:
Based on that very last stuff, it might be prudent to include a sequence number also:

  jhead.exe  -n%Y%m%d_%H%M%S_%03i  *.jpg
0
 
TheDadCoderAuthor Commented:
ok, thanks. I'll try this.
0
 
TheDadCoderAuthor Commented:
Hi knightEknight,

So i created a test image folder, with the same images, but one duplicated.


I ran the below command:
jhead.exe  -n%Y%m%d_%H%M%S_%03i  C:\Users\admin\Downloads\jughead\test\*.jpg > output.txt

Open in new window


With the output.txt containing:
File 'C:\Users\admin\Downloads\jughead\test\050112 001.JPG' contains no exif date stamp.  Using file date

Open in new window

Which is fine.

However, the test files' filenames did not change, they stayed the same.

What am i missing?  Is there a test flag in the command?
0
 
knightEknightCommented:
Not sure.  Intuitively I'd say you did everything right, at least by looking at it.  The only difference I can see (but I don't believe its significant) is that I ran the command from the same folder, therefore requiring no path on the filespec, and I didn't redirect the output.

Maybe the help screen will shed some light:  jhead.exe -h > jhead.txt


Edit:  The only other difference I notice is that my test files did not have spaces in the name - but I can't believe this would be significant either.
0
 
TheDadCoderAuthor Commented:
Hi knightEknight,

I got the jhead script working in the end.  It just didn't like being run in a bat file, but works fine as you had it and straight from cmd.

However, it doesn't traverse the directory structure - any luck in how to get it to go up the tree?


Thanks,
0
 
knightEknightCommented:
I didn't realize you were trying to run this in a batch file.  That being the case, you will need to double-up the percent symbols when running it in a batch:

jhead.exe  -n%%Y%%m%%d_%%H%%M%%S  *.jpg


I'm looking for an answer to your question about traversing....
0
 
TheDadCoderAuthor Commented:
That double up on the % works fine, thanks.
0
 
knightEknightCommented:
Per the jhead documentation, try this to traverse sub-directories:

jhead.exe  -n%%Y%%m%%d_%%H%%M%%S_%%03i  C:\Users\admin\Downloads\jughead\test\**\*.jpg > output.txt
0
 
knightEknightCommented:
Do this to get the help text:

  jhead.exe -h > jhead.txt

Then open jhead.txt in notepad and search for "recurse".
0
 
TheDadCoderAuthor Commented:
That's perfect, thanks knightEknight.
0
 
knightEknightCommented:
Does the dupe finder script work after the files are all renamed?
0
 
TheDadCoderAuthor Commented:
Hi - it worked on the smaller test set just fine.

1. move to correct meta data date or file date folder
2. rename based upon jhead exiftool data
3. dedupe marking

So I'm currently part way through testing my entire library.

The first stage worked for a while but then just wouldn't work. the cmd window would just close, and no error in piped output.txt.  Upon investigation i notice bracket characters in the filenames it was trying to process.  It seems it doesn't like ( ) and/or ! chars in filenames.

I cleared those out and it worked fine to the end.

Stage 2 I'm currently doing.
The immediate problem with jhead is that it just doesn't like subfolders.  Even hough stage 1 only provided 1 level of folders, the larger amount of folders gives it problems.  If you can imagine stage 1 gives me on near 1 folder for every day i have taken a picture, for the last 13 years or so - many folders!
So I split it down by year, but it still has issues with more recent years where there are >200 folders (days).
So I'm going to have to cut the year folders down again it seems.


Stage 3 - not started on the entire library yet.   I'll post back soon.
0
 
knightEknightCommented:
If jhead doesn't traverse sub-folders well, I suggest using it in a for-loop instead.  Assuming you are running this in a batch file, all % symbols are doubled-up:

for /f %%D in ('dir/s/b/ad "C:\parent\folder"') do @echo jhead.exe  -n%%Y%%m%%d_%%H%%M%%S_%%03i  "%%D\*.jpg"

The above command will only echo what it is about to do.  To actually do the work, remove the @echo
0
 
TheDadCoderAuthor Commented:
Thanks for that, what will it do though?

When jhead.exe crashes out, will the batch file keep running and spawn a new jhead.exe?

If all goes well and it doesn't crash out, will this just loop indefinitely?  How do I know when it's finished?

Thanks,
0
 
knightEknightCommented:
Assuming jhead crashes only when it is traversing sub-folders (because of the number of files involved), I'm hoping this will prevent that entirely by creating a new instance of jhead for each folder.

The script runs linearly, so it will do one folder at a time [until each folder has been traversed once], so you will see when the script finishes.

You will know if it crashes or otherwise doesn't finish a particular folder because there will be files in it that haven't been renamed yet.

edit: If jhead crashes in one folder, I suspect the script will continue and create a new instance of jhead for the next folder.
0
 
TheDadCoderAuthor Commented:
ok, thanks I'll report back once I've tried it.
0
 
knightEknightCommented:
Either way, it wouldn't hurt to run it with the @echo still in it.  :)
0
 
TheDadCoderAuthor Commented:
This is looking good :)

I'm still processing it through, but have found that the process does leave me with lots of empty folders, after the dedupe delete!

I've posted a new question here:
http://www.experts-exchange.com/Programming/Languages/Scripting/Shell/Batch/Q_28281260.html

:)
0
 
TheDadCoderAuthor Commented:
Hi knightEknight,

so I have this in a batch file:

for /f %%D in ('dir/s/b/ad "u:\"') do  jhead.exe  -n%%Y%%m%%d_%%H%%M%%S  "%%D\*.*"

Open in new window


originally is just scan for .jpg, but it seems i have some png files in the library too.

so i edited it and went with *.*.

However, jhead it seems doesn't support png, at least not with this batch file.

When it comes across a png file is returns:

Not JPEG: C:\media/filename.png



Any ideas how this can be overcome?
0
 
knightEknightCommented:
Do .png files have metadata in them?
If not, we can revert back to the (modified) rename script in the article I referenced early on in this thread - but it will use the file date.
0
 
TheDadCoderAuthor Commented:
Hi - I can check the meta question, however reviewing the png files, it seems they are all duplicate of jpg in the same folder (after running sorting script 1).

So I either just delete them via the dedupe script....hmmm, not possible, since png file is 6MB, whereas the jpg is 1.7MB!.


so it's a duplicate png file for some jpg files, not sure why i have png and jpg of the same images however.
0
 
knightEknightCommented:
How goes the processing?
0
 
TheDadCoderAuthor Commented:
Awesome assistance from knightEknight, I couldn't recommend his skill highly enough!

Thanks a lot.
0
 
knightEknightCommented:
Thanks "Dad"!  :p

I'm still on board here, so let me know how things are going with the processing.
0
 
TheDadCoderAuthor Commented:
Hi knightEknight,

Thanks again, but I've found problems, on my side :)   I've found tons of PNG files, in the same folder as the jpg files.  Which are dupes of the png!, but the dedupe script won't mark them as dedupe as they are different file sizes.

I think I have them because of the samsung camera i use. and my chosen method to get the pics of of the camera.

The camera allows me to sync directly on the camera to dropbox, and then dropbox syns them onto my PC - which is a really great way to get see the easily, etc.  But as it's also a camera the camera saves the pics to sd card too, as PNG!

I must have emptied the png into a folder which was then merged in with the jpg in script 1, the folder sorting.

I've gone through jan - feb 2013 and have matches all the png to jpg, so I'm confident to delete all png in the 2013 year folder (hmmm, delete! :S).

I can then confirm step 3, the dedupe. :)
0
 
TheDadCoderAuthor Commented:
Script-wise, we’re done - all works as needed - thank you knightEknight.


Here’s a quick write up and notes:


Problem to be solutioned:

>100,000 images files not really in a great shape, taken as photos over the last 15 years.
Many duplicate files, cause I’ve performed various backups, but then accidentally :( copied them back into the media library as unprocessed that need to be added to the main sorted date folders.

Also, how to easily folderize new pictures going forward into a consistent structure and format.


nb.  The scripts work as needed, but for me they solution the problem i see and they way i wanted it to be solutioned.  I’m confident everyone won’t need/use these files in the same way!

All scripts are in a folder called scripts, which is mapped to w: drive
(all directory mapping to drives is optional)


Solution:
Step 1 - sort image files by date

Script: '1_SortFiles.bat'
This needs ‘exiftool.exe’ to be in the same folder.

Take all images and put into folders, using folder format YYYY-MM-DD

I have two folders in use:
Media_incoming mapped to v: drive
Media_processed mapped to u: drive

I left the file format search to just jpg, jpeg, and png to only select pictures, and not gifs, or bmp, mov, mp4, tif, pdf, wtc.  I managed to root out over 10GB of diskspace which was nice!


Step 2 - rename images file by their date
script: ‘2_RenameFiles_RunMe.bat'
This needs jhead.exe in the same folder.

Rename all pictures taking the date to be the filename


Step 3 - dedupe any duplicates
Script ‘3_MarkDupes_RunMe.bat’, which passes in files into the main script from u: drive.
Main script: '3_MarkDupes_Code.bat'

Mark all duplicates, within the folder, as ‘_dupe’.


Step 4 - delete the duplicates (step 3b?)
I was just going to search for ‘_dupe’ and manually delete them, but I found this file size compare prig and decided to use that.

They also have a shareware type prog for about $10 that compares images, but I decided not to buy it, and use the file size instead.

This makes step 3 questionably redundant; however I do like step 3, cause it applies confidence when deleting files in step 4 :)  So I will keep using it.


fileseize prog:
http://www.mindgems.com/products/Fast-Duplicate-File-Finder/Fast-Duplicate-File-Finder-About.htm


Step 5 - remove empty folders
I then had an issue with dozens of empty folders from the source folder. I can’t just assume they are empty as the may contain other file formats.

So I’m now investigating a way to recursively remove empty folders:
http://www.experts-exchange.com/Programming/Languages/Scripting/Shell/Batch/Q_28281260.html


I did try to add all into a master runme.bat, but this failed to run!
runme.bat
1_SortFiles_RunMe.bat

2_RenameFiles_RunMe.bat

3_MarkDupes_RunMe.bat

Open in new window


All files attached, not step 4 prig, in zip fore reference.


and a final nb, this was intended for pictures, but I will use for movies from cameras aswell.

Thanks to oBdA and knightEknight for their time and scripts.

This fixes a problem that i have had for many years!
ee.zip
0

Featured Post

Hire Technology Freelancers with Gigs

Work with freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely, and get projects done right.

  • 39
  • 31
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now