Solved

Script to find duplicate files, but not using filename = filesize

Posted on 2013-10-22
73
883 Views
Last Modified: 2013-11-01
Hi,

oBda kindly wrote this script to sort files based upon picture by camera taken date:
http://www.experts-exchange.com/Programming/Languages/Scripting/Shell/Batch/Q_28271436.html


I've determined now that this leaves me with a dilemma of duplicate files.  Many exist in the same folder so I can't use a filename comparison, so I was thinking of traversing through the folder structure looking for files in the same folder as itself and marking it as a dupe.

I'm not sure how to mark it as a dupe.   I don't like the idea of moving it out of the folder, so I was thinking of just suffixing '_dupe' at the end of the file and before the jpg.


So two files name 'filea.jpg' and 'filea1.jpg' have the exact same filesize.

The script would rename 'file1.jpg' to be 'file1_dupe.jpg'.


Is this possible?
0
Comment
Question by:TheDadCoder
  • 39
  • 31
  • +1
73 Comments
 
LVL 33

Expert Comment

by:knightEknight
ID: 39591070
Do these duplicate files also have the same create date/time?
0
 

Author Comment

by:TheDadCoder
ID: 39591101
Hi knightEknight,

Yes, they should have.  I would imagine they were once the exact same file on the computer (as opposed to imported twice from camera).

They would have been created as a duplicate through crazy backups and copies of the media folder...

So, the exact same file with same DateCreated timestamp...
0
 
LVL 33

Assisted Solution

by:knightEknight
knightEknight earned 500 total points
ID: 39591120
I have an article that may be interesting with regards to your previous question about sorting, and I believe it can be modified to flag duplicates as you suggest as well.  In the mean time, please try this over a set of test files from your camera.  If you like the result then I will use this as a basis for a "dupe-flagger" script:

http://www.experts-exchange.com/Programming/Languages/Scripting/Shell/A_268-Rename-files-to-the-file-date.html
0
 

Author Comment

by:TheDadCoder
ID: 39591178
Hi knightEknight,

That's an interesting script and complements the previous sorting into folders script i think.

One question, how does your script deal with dupes?

Dupes as in:

1.  The exact same file

2.  but also pictures that were created the same second; such as cameras which take 20-30 continuous pictures, which could mean circa 5 pictures per second.  Or does it go to millisecond?
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39591286
The script has a little trick for dealing with dupes (with respect to date/time, but not size - at least not yet).  If two or more pictures are taken within one second of one another, it simply increments the seconds, which the script always start at zero for any given minute.  So unless you take more than 60 pictures a minute, it works!

So, for example, if you took three pictures today at 2:15:33 pm, the files would be named:
  20131022_141501.jpg
  20131022_141502.jpg
  20131022_141503.jpg
respectively.

But, when such dupes are found, I believe I can add code to also check the size, and if they are the same, I'll add "_dupe" in the filename.
0
 

Author Comment

by:TheDadCoder
ID: 39591366
Hi knightEknight,

Here's the output to the article script:

ren "C:\Users\admin\Downloads\picstorename\Image (19).JPG" "20002501_112300.JPG"
ren "C:\Users\admin\Downloads\picstorename\DSCF4715.JPG" "20082509_033000.JPG"
ren "C:\Users\admin\Downloads\picstorename\100_1941.JPG" "20091912_071900.JPG"
ren "C:\Users\admin\Downloads\picstorename\050112 001.JPG" "20111310_062200.JPG"
ren "C:\Users\admin\Downloads\picstorename\050112 139.JPG" "20120501_023400.JPG"
ren "C:\Users\admin\Downloads\picstorename\same filename different date\2004\100_0689.JPG" "20101804_051700.JPG"
ren "C:\Users\admin\Downloads\picstorename\same filename different date\2006\100_0689.JPG" "20062701_021800.JPG"

Open in new window


These are the same test files from the previous script with oBda.

Test file 1
'Image (19).JPG' - this file was renamed to: '20002501_112300.JPG'

This is incorrect, as the script isn't taking into account the meta data DateCreated, from oBda's script. :)    We found that the script infact uses a datemodified date which isn't necessarily the actual datetime the photo was snapped by the camera.

So oBda used exiftool to find the meta data DateCreated and then parsed that, if it existed then he used that if not he used the file's date modified (non-meta data).


Are you able to update your script to utilise the meta data, using exiftool?

A note about the last two test files - these are infact different photos, taken years apart but have the same sequenced filename [from the camera].  I included these in oBda's script to see what the csv lookup table would do with the same filename, but filed it correctly in the correct folder name.
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39591512
Please let me know which, if any, of these commands shows the date(s) you want to use for those two test files:

  dir /tc  100_0689.JPG
  dir /ta  100_0689.JPG
  dir /tw  100_0689.JPG

(I suspect the /tc option will not be right if you run it on a copy the original file.)
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39591521
You are right though, the article assumes un-modified files on the camera.  That's how I use it anyway - I mean its the first thing I do - I run this script on the images before I even take them off the camera, then I touch them up afterwards.
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39591533
It also assumes the clock in the camera is accurate!  :)
0
 

Author Comment

by:TheDadCoder
ID: 39591542
Here's the results:

dir /tc  100_0689.JPG
dir /ta  100_0689.JPG
This returns: 22/10/2013 09:15



dir /tw  100_0689.JPG
This returns 18/04/2010 23:17

This is the date your script renames it to, however, it should be using the date: 31/12/2004 20:09.  The 2004 date is the actual datetime the picture was taken on camera.
0
 

Author Comment

by:TheDadCoder
ID: 39591549
I agree, going forward from now I'd imagine the script working as is, but the >100,000 images in the existing library will be erratic i guess. :)
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39591552
what is the result of this command?

date/t

I may need to adjust the default date format in the script.
0
 

Author Comment

by:TheDadCoder
ID: 39591564
date/t returns:
22/10/2013
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39591598
At least I can do this much, if you will adjust line 18 in the script, it will put the filename in proper yyyymmdd format:

   set fn=!fn:~6,4!!fn:~3,2!!fn:~0,2!_!hour!!fn:~14,2!

I can incorporate the exiftool output, but I am being pulled away at work at the moment and I may not be able to get back to you for a few hours.

In the mean time I have polluted your thread with all this stuff, so if you want to re-post this question new and get a point refund on this one, that would be fine with me.  Now that I have a good idea of what needs to be done, I'll respond either in this question or the new one (if you go that route) when I have something to show off.
0
 

Author Comment

by:TheDadCoder
ID: 39591707
Hi knighteknight,

I've updated the script with the new line18 and that works a treat, thanks.
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39592123
After looking again at this, I think what you are asking for is a way to identify duplicate files that have the same file name and size (and perhaps date), but that exist in different sub-folders, correct?

If so, then the article script is not a good basis on which to build a solution for this, because it generally works on one distinct filename at a time.

However, I'm working another approach that may do just as well.
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39592134
Rather than using filesize and/or date/time to detect duplicates, have you considered using the MD5 hash of the file (or other hashing algorithm)?  Filesize will almost certainly result in false positives for "dupe".
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39592141
Agreed, MD5 is a good alternative, I was using the fc utility on same-sized files in my new solution, but MD5 might be cleaner.  But I will finish with the fc solution first because it requires no third-party software.
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39592328
Here's a stab at it.  If there is more than one duplicate it will increment a dupe counter in the filename until there are no more dupes of that file.  Let me know how it works for you.

@echo off
 setlocal enabledelayedexpansion
 
 set filespec=%1
 set dupecount=1

 if "%filespec%"=="" set filespec=*
 
 for /f "delims=" %%F in ('dir/a-d/b/os-dn/s %filespec%') do (

   if !dupecount! equ 1 (
      set pFN=!FN!
      set pFD=!FD!
      set pNAME=!NAME!
      set pEXT=!EXT!
      set pSIZE=!SIZE!
   )

   set FN=%%F
   set FD=%%~tF
   set NAME=%%~nF
   set EXT=%%~xF
   set SIZE=%%~zF

   if defined pSIZE if !pSIZE! equ !SIZE! (
     fc /B /LN=1 "!FN!" "!pFN!" 1>nul 2>nul
     if !errorlevel! equ 0 (
        @echo ren "!FN!" "!pNAME!_dupe!dupecount!!pEXT!"
        ren "!FN!" "!pNAME!_dupe!dupecount!!pEXT!"
        set/a dupecount+=1
     )
   ) else (
     set dupecount=1
   )
 )

 exit/b

Open in new window

0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39592462
I forgot to mention that this is a post-sorting script to be used after your original script sorts them to separate sub-folders.  Run this script in the parent folder using *.jpg as the filespec parameter.

I suppose it would probably work even before the sorting script is run, but I tested it as if it were after.  It is based first on filesize and then on the results of fc.exe, so it should work either way.
0
 

Author Comment

by:TheDadCoder
ID: 39592476
Hi knightEknight,

This looks promising :)    

I did this:

Test 1 (I only did this by mistake cause i fogot to copy some test files in there, but though I'd include the result)
No files in destination folder - gave an error saying no files found.
That's fine and expected.

Test 2
Two unique files
No errors or changes, nothing to disk was recorded.  Some echos' would be nice lol :)


Test 3
Same unique files, but duplicated one of them.
The unique file was left alone.

The other 2 files, which are the same file but with different filenames, but same size.

+ 'Image (19).JPG'
+ 'Image (19) - green.JPG'

After the script was run:

+ 'Image (19) - green.JPG'
+ 'Image (19) - green_dupe1.JPG'

This looks promising as it picked one up and marked it '_dupe1'.

However, I'm concerned that it lost the original filename of 'one of them'.  They both took on the dupe's filename, adding the 'green' word in this case.

Could one of the files names be left intact, and the dupe be updated with '_dupex'?

Not sure how you're working out which is the dupe when both are identical? :)

Test 4
I left the unqiue single file in there.

However, I added another duplicate of the existing 2 files, which are the same file but with different filenames, but same size.

+ 'Image (19).JPG'
+ 'Image (19) - green.JPG'
+ 'Image (19) - blue.JPG'

After the script was run:

+ 'Image (19) - blue.JPG'
+ 'Image (19) - blue_dupe1.JPG'
+ 'Image (19) - blue_dupe2.JPG'

So the incrementing the count works nicely in this test, however we're losing the original filename, of each of the files and taking the last duplicate file's filename? :)

I'm not sure how much of an issue this is, if I ran your other renaming script to the datetime first or afterwards (not forgetting the meta data issue). However, it feels wrong to me, and each file should keep it's filename intact, apart form the added suffix '_dupex'.

What do you think?
0
 

Author Comment

by:TheDadCoder
ID: 39592501
Hi I've just seen your last comment, sorry missd it whilst testing :)

Taking your advise I ran the rename to date script first, then the dedupe.

This is the result:
(It would have to be run in this order, not dedupe then rename, cause the rename would remove the dupe mark on the duplicate files)

Before both scripts were run:

25/09/2008  15:30         2,948,905 DSCF4715 - Copy (2).JPG
25/09/2008  15:30         2,948,905 DSCF4715 - Copy.JPG
25/09/2008  15:30         2,948,905 DSCF4715.JPG
25/01/2000  11:23           177,690 Image (19).JPG

Open in new window


After rename to date script was run:
25/01/2000  11:23           177,690 20000125_112300.JPG
25/09/2008  15:30         2,948,905 20080925_033000.JPG
25/09/2008  15:30         2,948,905 20080925_033001.JPG
25/09/2008  15:30         2,948,905 20080925_033002.JPG

Open in new window


After dedupe script was run:
25/01/2000  11:23           177,690 20000125_112300.JPG
25/09/2008  15:30         2,948,905 20080925_033000.JPG
25/09/2008  15:30         2,948,905 20080925_033001.JPG
25/09/2008  15:30         2,948,905 20080925_033002.JPG

Open in new window

0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39592503
In the third scenario, it all depends on which one you consider the original and which one you consider the dupe.  If  'Image (19) - green.JPG'  is the original, then it worked!  lol

The way it determines which one is the original is purely by sort order, first by size (obviously), then date (earliest first, to be considered the original), then if they are equal then by filename.  The last of these criteria is somewhat arbitrary, but it is why you see the result you do in scenario 3 above.  You can change this specific scenario by changing the order clause on the dir command from this:  os-dn  to this:  os-d-n ... but by doing this, you are fixing one and breaking another.

For example, as is, if the script encounters two duplicates called ABC.jpg and XYZ.jpg with the same file size and date/time, it will use ABC as the original.  But if you make the change in the dir cmd as described above it will use XYZ as the original.  Which is right?  I can't tell by filename alone.

Now, another alternative is to simply tag the latter file with the "dupe" suffix, so you would be left with ABC.jpg and XYZ_dupe.jpg - but I figured you would want to know which file XYZ is a duplicate of.
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39592513
Oh yeah, and per my last comment (ID: 39592462), I think I had it just backwards ... it should be run before the sort script - sorry.  And the script I was talking about was the one from your previous question, not the one from my article.  In other words, I think what you did in your first test was probably correct.
0
 

Author Comment

by:TheDadCoder
ID: 39592517
Now, another alternative is to simply tag the latter file with the "dupe" suffix, so you would be left with ABC.jpg and XYZ_dupe.jpg - but I figured you would want to know which file XYZ is a duplicate of.

This is fair point, without that I'd just be left with dupes, and I could just search and delete those... but yes, i agree it'd be nice to know the original filename.

But does the original filename need to chnage to match the dupe?  

Can the dupe be renamed to match the original filename?

Or have i just ignored you fine example from a moment ago?! :)
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39592521
The script could be made to delete the dupes instead of renaming them, but again, which one it considers the dupe and which one it considers the original depend only on their alphabetical order if they have the same date/time stamp.
0
 

Author Comment

by:TheDadCoder
ID: 39592524
Any idea why the renaming script is not placing the correct time of the day:

25/09/2008  15:30         2,948,905 20080925_033000.JPG

Open in new window


datetime is 1530, but filename is 0330, lost 12 hours here?
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39592526
>> Can the dupe be renamed to match the original filename?

That is what the script does now ... but which one it considers the original is based only on alphabetic order by filename (all else being equal).
0
 

Author Comment

by:TheDadCoder
ID: 39592535

The script could be made to delete the dupes instead of renaming them, but again, which one it considers the dupe and which one it considers the original depend only on their alphabetical order if they have the same date/time stamp.

I'm hoping to use your renaming script, so this isn't really an issue i think.

Other than the 12hour difference issue, as per previous post (ID: 39592524), is it possible to use the exiftool for the metadata datetime?
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39592538
>> 20080925_033000.JPG

This is where the system date format has burned me again.  The script looks for the presence of "pm" (which exists in my local date format) and if it doesn't find it, it assumes "am".  I believe I can fix this...
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39592551
what is the output of this command on your system?

time/t
0
 

Author Comment

by:TheDadCoder
ID: 39592566
time/t gives:

22:37

I'm based in UK.
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39592569
Change to line 12 of the rename script:

   if /i not "!fn:~17,2!"=="AM" set/a hour+=12
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39592572
wait ... scratch that...
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39592575
I think you should just comment out (or remove) lines 11 and 12 and see if that works.
0
 

Author Comment

by:TheDadCoder
ID: 39592589
I commented out the below two lines, 11 and 12 i believe:
@echo   if !hour! GTR 10 set/a hour=(!hour!-6^) %% 12
@echo   if "!fn:~17,2!"=="PM" set/a hour+=12

Open in new window


This left me with a filename:
20080925_213000.JPG

where it should be 1530
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 33

Expert Comment

by:knightEknight
ID: 39592597
hmmm ... please run this command directly from the command prompt and let me know the output:

for /f "delims=" %F in ('dir/a-d/b/od DSCF4715.jpg') do @echo %~tF
0
 

Author Comment

by:TheDadCoder
ID: 39592622
It returns:

25/09/2008 15:30
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39592651
okay lets try this change to line 10:  (after commenting-out 11 & 12)

set/a hour=!fn:~11,2!


This one makes me nervous because I don't know why I was doing that on line 10 to begin with, so I don't know what other ramifications this will have.  The article is 4 years old and I haven't looked at it much since - lol.
0
 

Author Comment

by:TheDadCoder
ID: 39592716
That works fine now, see screenshot.

Based on your concern, I'll do some other files to test this further.


See the screenshot the top file is being named 25 Jan 2000, but was taken on 22 Jan 2000, this is the meta data issue etc.
Screen-Shot-2013-10-22-at-23.33..png
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39592820
understood, now the change I just made won't work in every circumstance (I remembered why on my way home).  I'm sure it's late where you are, so if you will check back in the morning I will hopefully have it wrapped up.
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39593046
Well, while investigating how to incorporate the meta-data, I discovered that the exiftool will do what the rename script does:

   exiftool "-FileName<CreateDate" -d %Y%m%d_%H%M%S%%-c.%%e .

the last "." represents the current directory, or you can specify a path like C:\dir\sub

If "CreateDate" is blank, you can use "FileCreateDate" instead.
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39594030
anyway, regarding the rename script, after making the aforementioned change to line 10, and commenting out lines 11 & 12, line 13 is no longer necessary, so comment it out too.  I think that should do it with respect to that script - at least in your timezone.  I need to republish this article with some enhancements to cover things like this.  Either that, or learn to use the exiftool to do the same thing!

Speaking of that, I didn't get to finish that part of the dupe finder last night.
0
 

Author Comment

by:TheDadCoder
ID: 39595016
Hi knightEknight, no problem I can imagine you're busy with work and things.
0
 
LVL 33

Accepted Solution

by:
knightEknight earned 500 total points
ID: 39596282
Check this!  I found a neat little tool that will rename .jpg files according to their meta create date:  http://www.sentex.net/~mwandel/jhead/

download here: http://www.sentex.net/~mwandel/jhead/jhead.exe

The command to do the re-naming is simple:
 
jhead.exe  -n%Y%m%d_%H%M%S  *.jpg

Open in new window

I think this will do in one step everything we were trying to do with respect to the file naming.  After you run this, then run the dupe finder script and let me know how it goes!


From the jhead.exe help screen:

DATE / TIME MANIPULATION:
  -ft        Set file modification time to Exif time
  -dsft      Set Exif time to file modification time
  -n[format-string]
             Rename files according to date.  Uses exif date if present, file
             date otherwise.  If the optional format-string is not supplied,
             the format is mmdd-hhmmss.  If a format-string is given, it is
             is passed to the 'strftime' function for formatting
             %d Day of month    %H Hour (24-hour)
             %m Month number    %M Minute    %S Second
             %y Year (2 digit 00 - 99)        %Y Year (4 digit 1980-2036)
             For more arguments, look up the 'strftime' function.
             In addition to strftime format codes:
             '%f' as part of the string will include the original file name
             '%i' will include a sequence number, starting from 1. You can
             You can specify '%03i' for example to get leading zeros.
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39596289
Based on that very last stuff, it might be prudent to include a sequence number also:

  jhead.exe  -n%Y%m%d_%H%M%S_%03i  *.jpg
0
 

Author Comment

by:TheDadCoder
ID: 39596445
ok, thanks. I'll try this.
0
 

Author Comment

by:TheDadCoder
ID: 39597297
Hi knightEknight,

So i created a test image folder, with the same images, but one duplicated.


I ran the below command:
jhead.exe  -n%Y%m%d_%H%M%S_%03i  C:\Users\admin\Downloads\jughead\test\*.jpg > output.txt

Open in new window


With the output.txt containing:
File 'C:\Users\admin\Downloads\jughead\test\050112 001.JPG' contains no exif date stamp.  Using file date

Open in new window

Which is fine.

However, the test files' filenames did not change, they stayed the same.

What am i missing?  Is there a test flag in the command?
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39597409
Not sure.  Intuitively I'd say you did everything right, at least by looking at it.  The only difference I can see (but I don't believe its significant) is that I ran the command from the same folder, therefore requiring no path on the filespec, and I didn't redirect the output.

Maybe the help screen will shed some light:  jhead.exe -h > jhead.txt


Edit:  The only other difference I notice is that my test files did not have spaces in the name - but I can't believe this would be significant either.
0
 
LVL 26

Expert Comment

by:arober11
ID: 39602870
0
 

Author Comment

by:TheDadCoder
ID: 39606698
Hi knightEknight,

I got the jhead script working in the end.  It just didn't like being run in a bat file, but works fine as you had it and straight from cmd.

However, it doesn't traverse the directory structure - any luck in how to get it to go up the tree?


Thanks,
0
 
LVL 33

Assisted Solution

by:knightEknight
knightEknight earned 500 total points
ID: 39606717
I didn't realize you were trying to run this in a batch file.  That being the case, you will need to double-up the percent symbols when running it in a batch:

jhead.exe  -n%%Y%%m%%d_%%H%%M%%S  *.jpg


I'm looking for an answer to your question about traversing....
0
 

Author Comment

by:TheDadCoder
ID: 39606726
That double up on the % works fine, thanks.
0
 
LVL 33

Assisted Solution

by:knightEknight
knightEknight earned 500 total points
ID: 39606731
Per the jhead documentation, try this to traverse sub-directories:

jhead.exe  -n%%Y%%m%%d_%%H%%M%%S_%%03i  C:\Users\admin\Downloads\jughead\test\**\*.jpg > output.txt
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39606738
Do this to get the help text:

  jhead.exe -h > jhead.txt

Then open jhead.txt in notepad and search for "recurse".
0
 

Author Comment

by:TheDadCoder
ID: 39607026
That's perfect, thanks knightEknight.
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39610568
Does the dupe finder script work after the files are all renamed?
0
 

Author Comment

by:TheDadCoder
ID: 39610905
Hi - it worked on the smaller test set just fine.

1. move to correct meta data date or file date folder
2. rename based upon jhead exiftool data
3. dedupe marking

So I'm currently part way through testing my entire library.

The first stage worked for a while but then just wouldn't work. the cmd window would just close, and no error in piped output.txt.  Upon investigation i notice bracket characters in the filenames it was trying to process.  It seems it doesn't like ( ) and/or ! chars in filenames.

I cleared those out and it worked fine to the end.

Stage 2 I'm currently doing.
The immediate problem with jhead is that it just doesn't like subfolders.  Even hough stage 1 only provided 1 level of folders, the larger amount of folders gives it problems.  If you can imagine stage 1 gives me on near 1 folder for every day i have taken a picture, for the last 13 years or so - many folders!
So I split it down by year, but it still has issues with more recent years where there are >200 folders (days).
So I'm going to have to cut the year folders down again it seems.


Stage 3 - not started on the entire library yet.   I'll post back soon.
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39611512
If jhead doesn't traverse sub-folders well, I suggest using it in a for-loop instead.  Assuming you are running this in a batch file, all % symbols are doubled-up:

for /f %%D in ('dir/s/b/ad "C:\parent\folder"') do @echo jhead.exe  -n%%Y%%m%%d_%%H%%M%%S_%%03i  "%%D\*.jpg"

The above command will only echo what it is about to do.  To actually do the work, remove the @echo
0
 

Author Comment

by:TheDadCoder
ID: 39611519
Thanks for that, what will it do though?

When jhead.exe crashes out, will the batch file keep running and spawn a new jhead.exe?

If all goes well and it doesn't crash out, will this just loop indefinitely?  How do I know when it's finished?

Thanks,
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39611544
Assuming jhead crashes only when it is traversing sub-folders (because of the number of files involved), I'm hoping this will prevent that entirely by creating a new instance of jhead for each folder.

The script runs linearly, so it will do one folder at a time [until each folder has been traversed once], so you will see when the script finishes.

You will know if it crashes or otherwise doesn't finish a particular folder because there will be files in it that haven't been renamed yet.

edit: If jhead crashes in one folder, I suspect the script will continue and create a new instance of jhead for the next folder.
0
 

Author Comment

by:TheDadCoder
ID: 39611562
ok, thanks I'll report back once I've tried it.
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39611566
Either way, it wouldn't hurt to run it with the @echo still in it.  :)
0
 

Author Comment

by:TheDadCoder
ID: 39612065
This is looking good :)

I'm still processing it through, but have found that the process does leave me with lots of empty folders, after the dedupe delete!

I've posted a new question here:
http://www.experts-exchange.com/Programming/Languages/Scripting/Shell/Batch/Q_28281260.html

:)
0
 

Author Comment

by:TheDadCoder
ID: 39612501
Hi knightEknight,

so I have this in a batch file:

for /f %%D in ('dir/s/b/ad "u:\"') do  jhead.exe  -n%%Y%%m%%d_%%H%%M%%S  "%%D\*.*"

Open in new window


originally is just scan for .jpg, but it seems i have some png files in the library too.

so i edited it and went with *.*.

However, jhead it seems doesn't support png, at least not with this batch file.

When it comes across a png file is returns:

Not JPEG: C:\media/filename.png



Any ideas how this can be overcome?
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39612515
Do .png files have metadata in them?
If not, we can revert back to the (modified) rename script in the article I referenced early on in this thread - but it will use the file date.
0
 

Author Comment

by:TheDadCoder
ID: 39612540
Hi - I can check the meta question, however reviewing the png files, it seems they are all duplicate of jpg in the same folder (after running sorting script 1).

So I either just delete them via the dedupe script....hmmm, not possible, since png file is 6MB, whereas the jpg is 1.7MB!.


so it's a duplicate png file for some jpg files, not sure why i have png and jpg of the same images however.
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39614852
How goes the processing?
0
 

Author Closing Comment

by:TheDadCoder
ID: 39616446
Awesome assistance from knightEknight, I couldn't recommend his skill highly enough!

Thanks a lot.
0
 
LVL 33

Expert Comment

by:knightEknight
ID: 39616618
Thanks "Dad"!  :p

I'm still on board here, so let me know how things are going with the processing.
0
 

Author Comment

by:TheDadCoder
ID: 39616637
Hi knightEknight,

Thanks again, but I've found problems, on my side :)   I've found tons of PNG files, in the same folder as the jpg files.  Which are dupes of the png!, but the dedupe script won't mark them as dedupe as they are different file sizes.

I think I have them because of the samsung camera i use. and my chosen method to get the pics of of the camera.

The camera allows me to sync directly on the camera to dropbox, and then dropbox syns them onto my PC - which is a really great way to get see the easily, etc.  But as it's also a camera the camera saves the pics to sd card too, as PNG!

I must have emptied the png into a folder which was then merged in with the jpg in script 1, the folder sorting.

I've gone through jan - feb 2013 and have matches all the png to jpg, so I'm confident to delete all png in the 2013 year folder (hmmm, delete! :S).

I can then confirm step 3, the dedupe. :)
0
 

Author Comment

by:TheDadCoder
ID: 39617227
Script-wise, we’re done - all works as needed - thank you knightEknight.


Here’s a quick write up and notes:


Problem to be solutioned:

>100,000 images files not really in a great shape, taken as photos over the last 15 years.
Many duplicate files, cause I’ve performed various backups, but then accidentally :( copied them back into the media library as unprocessed that need to be added to the main sorted date folders.

Also, how to easily folderize new pictures going forward into a consistent structure and format.


nb.  The scripts work as needed, but for me they solution the problem i see and they way i wanted it to be solutioned.  I’m confident everyone won’t need/use these files in the same way!

All scripts are in a folder called scripts, which is mapped to w: drive
(all directory mapping to drives is optional)


Solution:
Step 1 - sort image files by date

Script: '1_SortFiles.bat'
This needs ‘exiftool.exe’ to be in the same folder.

Take all images and put into folders, using folder format YYYY-MM-DD

I have two folders in use:
Media_incoming mapped to v: drive
Media_processed mapped to u: drive

I left the file format search to just jpg, jpeg, and png to only select pictures, and not gifs, or bmp, mov, mp4, tif, pdf, wtc.  I managed to root out over 10GB of diskspace which was nice!


Step 2 - rename images file by their date
script: ‘2_RenameFiles_RunMe.bat'
This needs jhead.exe in the same folder.

Rename all pictures taking the date to be the filename


Step 3 - dedupe any duplicates
Script ‘3_MarkDupes_RunMe.bat’, which passes in files into the main script from u: drive.
Main script: '3_MarkDupes_Code.bat'

Mark all duplicates, within the folder, as ‘_dupe’.


Step 4 - delete the duplicates (step 3b?)
I was just going to search for ‘_dupe’ and manually delete them, but I found this file size compare prig and decided to use that.

They also have a shareware type prog for about $10 that compares images, but I decided not to buy it, and use the file size instead.

This makes step 3 questionably redundant; however I do like step 3, cause it applies confidence when deleting files in step 4 :)  So I will keep using it.


fileseize prog:
http://www.mindgems.com/products/Fast-Duplicate-File-Finder/Fast-Duplicate-File-Finder-About.htm


Step 5 - remove empty folders
I then had an issue with dozens of empty folders from the source folder. I can’t just assume they are empty as the may contain other file formats.

So I’m now investigating a way to recursively remove empty folders:
http://www.experts-exchange.com/Programming/Languages/Scripting/Shell/Batch/Q_28281260.html


I did try to add all into a master runme.bat, but this failed to run!
runme.bat
1_SortFiles_RunMe.bat

2_RenameFiles_RunMe.bat

3_MarkDupes_RunMe.bat

Open in new window


All files attached, not step 4 prig, in zip fore reference.


and a final nb, this was intended for pictures, but I will use for movies from cameras aswell.

Thanks to oBdA and knightEknight for their time and scripts.

This fixes a problem that i have had for many years!
ee.zip
0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
You may have already been in the need to update a whole folder stucture using a script. Robocopy does it well and even provides a list of non-updated files in a log (if asked to). Generally those files that were locked by a user or a process by the …
Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now