?
Solved

Identifying orphan files in a folder

Posted on 2009-04-02
9
Medium Priority
?
746 Views
Last Modified: 2013-12-20
Supposing you had a folder with say 60,000 files in it.  Each file is supposed to relate to an entry in a database table.  However the table only has 50,000 rows in it.  i.e. there are 10,000 orphan files.  

What you want to do is identify and probably delete those 10,000 orphans.  What would be the best way of doing it?  Ideally something that's reasonably quick; I've tried a few things that work, but take a long time.  I usually kill them before they get more than a few hundred files into it, after several minutes.  I've tried the following ideas:

1. Do a <cfdirectory> and a database query to get all the filenames.  Then do a Query-of-Query to get everything from the cfdirectory WHERE Name NOT IN (#QuotedValueList(dbQuery.filename)#)

2. Do a <cfdirectory> and a database query to get all the filenames.  Loop round the cfdirectory, using ListFindNoCase(ValueList(dbQuery.filename), cfdirectory.name) to identify files that aren't in the query.

3. Do a <cfdirectory>, then loop round that, doing a query for each file to see if it's in the database.


Any other suggestions?

At this stage I'd probably want to do it in two steps:
1. Identify the files
2. Do the process again, but this time delete the files.  I'd expect step 1 would produce a list of names (write to a file or db table) that I could then loop through to delete.  

Database is SQL Server.  Using CF 7 on Windows.  Willing to consider non-ColdFusion solutions.
0
Comment
Question by:duncancumming
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 2
  • 2
  • +1
9 Comments
 
LVL 16

Author Comment

by:duncancumming
ID: 24047754
Just had an idea:
  1. do the db query to get the filenames, save that to a file.  
  2. use DOS to save a DIR on the folder to a file
  3. Use WinMerge to compare the two files to identify differences.
0
 
LVL 27

Expert Comment

by:azadisaryev
ID: 24047779
see if this works faster for you:

<cfquery name="getFiles" ...>
SELECT filenamecolumn
FROM sometable
ORDER BY filenamecolumn
</cfquery>

<cfset arrOrphans = arraynew(1)>
<cfset dir = "C:\path\to\directory\">

<cfoutput query="getFiles">
 <cfif NOT fileexists(dir & filenamecolumn)>
  <cfset arrayappend(arrOrphans, filenamecolumn)>
 </cfif>
</cfoutput>

<!--- do whatever you want with the orphan files array here, i.e. dump it to screen, write to file/db, save into session scope var, etc etc --->
<cfdump var="#arrOrphans#">

Azadi
0
 
LVL 52

Expert Comment

by:_agx_
ID: 24049021
Duncan,

Nothing to add here except to say that your idea of writing the results to a file and comparing them sounds like a good approach.  I have done something similar when dealing with very large directories/tables.  
0
Use Case: Protecting a Hybrid Cloud Infrastructure

Microsoft Azure is rapidly becoming the norm in dynamic IT environments. This document describes the challenges that organizations face when protecting data in a hybrid cloud IT environment and presents a use case to demonstrate how Acronis Backup protects all data.

 
LVL 16

Author Comment

by:duncancumming
ID: 24049246
I think that'll give me the opposite of what I'm after.  That will give me a list of all the files that are in the database but don't exist in the folder.  That might also be useful (it's beginning to look like there's mismatches on both sides!) but it's really the list of files that don't exist in the database that I'm after.  

I wonder if I could reverse your code though, and subtract matching elements from the cfdirectory results until I'm left with just the orphan files?  There's a QueryDeleteRows UDF, but I suspect it would be quite slow on a query object of 60,000 rows.



0
 
LVL 16

Author Comment

by:duncancumming
ID: 24049255
sorry, previous comment aimed at Azadi, not agx.
0
 
LVL 39

Expert Comment

by:gdemaria
ID: 24049892
Here's an idea to query both folder and database.  Be sure both queries are ordered by the file name!

This will walk through both queries comparing the record with each other.  Because both queries are in alphatical order by file name, you can tell which query has the missing value.  So this should tell you if the file is missing in the database or the folder.

Note the "cfloop"  is intended to loop forever because the cfbreak will stop the loop.
But I set the cfloop to loop only 1000 records for testing, just so it doesn't run forever if there is a problem.   When ready, you can change the cfloop to run for a very high number of records or with a true condition so it runs forever (until cfbreak).



<!--- be certain the two queries are order by the file name! ---->
<cfquery name="getFiles" ...>
  SELECT filenamecolumn
  FROM sometable
  ORDER BY filenamecolumn
</cfquery>
 
<cfdirectory name="getDir" ...>
 
<!--- loop counters to walk through the database/directory queries --->
<cfset dataCt = 1>
<cfset fileCt = 1> 
 
<!--- how many records to test? 
      in production set to very high number or condition="true" 
      the loop will break when no more files to test  ---->
<cfloop index="kk" from="1" to="1000"> 
  <!--- in case the database or folder have more records, keep going when one is at the end --->
  <cfif dataCt lte getFiles.recordCount>
     <cfset databaseFile = getFiles[fileName][dataCt]>
  <cfelse>
     <cfset databaseFile = ""> <!--- end of the database records --->
  </cfif>
  <cfif diskFile lte getDir.recordCount>
     <cfset diskFile = getDir[filename][fileCt]>
  <cfelse>
     <cfset diskFile = "">
  </cfif>
 
  <!--- if both database and folder have reached the end, you are done, break... ---->
  <cfif len(databasefile) eq 0 and len(diskFile) eq 0>
     The End.
     <cfbreak>
  <cfelseif databaseFile is diskFile> 
     #databaseFile# match!<br>
     <cfset dataCt = dataCt + 1>
     <cfset fileCt  = fileCt + 1>
  <cfelseif len(diskFile) eq 0 or databasefile lt diskFile>
     #databasefile# does not have a physical file
     <cfset dataCt = dataCt + 1>
  <cfelseif len(databasefile) eq 0 or diskFile lt databasefile>
     #diskFile# does not have a database record
     <cfset fileCt = fileCt + 1>
  </cfif>
</cfloop>

Open in new window

0
 
LVL 39

Expert Comment

by:gdemaria
ID: 24049935
oops, I saw a type-o.

Fixed it, and tried to clear up the comments a bit..

<!--- be certain the two queries are order by the file name! ---->
<cfquery name="getFiles" ...>
  SELECT filenamecolumn
  FROM sometable
  ORDER BY filenamecolumn
</cfquery>
 
<cfdirectory name="getDir" ...>
 
<!--- loop counters to walk through the database/directory queries --->
<cfset dataCt = 1>
<cfset fileCt = 1> 
 
<!--- how many records to test? 
      in production set to very high number or condition="true" 
      the loop will break when no more files to test  ---->
<cfloop index="kk" from="1" to="1000"> 
  <!--- get the file name from the database ---->
  <cfif dataCt lte getFiles.recordCount>
     <cfset databaseFile = getFiles[fileName][dataCt]>
  <cfelse> <!--- no more database records, but keep going in case there are more files in the folder --->
     <cfset databaseFile = ""> 
  </cfif>
  <!--- get the file name from the folder ---->
  <cfif fileCt lte getDir.recordCount>
     <cfset diskFile = getDir[filename][fileCt]>
  <cfelse> <!--- no more folder files, but keep going in case more database records ---->
     <cfset diskFile = "">
  </cfif>
 
  <!--- if both database and folder have reached the end, you are done, break... ---->
  <cfif len(databasefile) eq 0 and len(diskFile) eq 0>
     The End.
     <cfbreak>
  <!--- now compare the two queries to see which record is missing --->
  <cfelseif databaseFile is diskFile> 
     #databaseFile# match!<br>
     <cfset dataCt = dataCt + 1>
     <cfset fileCt  = fileCt + 1>
  <cfelseif len(diskFile) eq 0 or databasefile lt diskFile>
     #databasefile# does not have a physical file
     <cfset dataCt = dataCt + 1>
  <cfelseif len(databasefile) eq 0 or diskFile lt databasefile>
     #diskFile# does not have a database record
     <cfset fileCt = fileCt + 1>
  </cfif>
</cfloop>

Open in new window

0
 
LVL 52

Accepted Solution

by:
_agx_ earned 2000 total points
ID: 24050049
Duncan,

I think the biggest time drag will be the looping.  I just noticed you are using sql server.  Another option might be to:

1. use DOS to save a DIR on the folder to a file  (that should be quick)
2. Bulk insert the directory list into a temp table
3. Do an outer join between your two tables to find the orphaned files. Insert the orphans into another table
4. Then delete the orphaned files in sets/chunks (to avoid timeouts)
0
 
LVL 16

Author Closing Comment

by:duncancumming
ID: 31565687
This is probably the solution I'll go down, or something similar.  Thanks all for the suggestions.
0

Featured Post

Supports up to 4K resolution!

The VS192 2-Port 4K DisplayPort Splitter is perfect for anyone who needs to send one source of DisplayPort high definition video to two or four DisplayPort displays. The VS192 can split and also expand DisplayPort audio/video signal on two or four DisplayPort monitors.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I spent nearly three days trying to figure out how incorporate OAuth in Coldfusion for the Eventful API. Hopefully, this article will allow Coldfusion Programmers to buzz through the API when they need to. Basically, what this script does is authori…
This article provides a case study on how our local youth baseball league deployed a new website, including the platform selection, implementation and benefits to the league.
The purpose of this video is to demonstrate how to create a Printer Friendly PDF on a WordPress Page. This will be demonstrated using a Windows 8 PC. Tools Used are Photoshop, Awesome Screenshot” Google Chrome Extension, and SmallPDF.com Log…
The purpose of this video is to demonstrate how to integrate Mailchimp with Facebook. This will be demonstrated using a Windows 8 PC. Mailchimp and Facebook will be used. Log into your Mailchimp account. : Click on your name. Go to Account Setti…
Suggested Courses

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question