• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 770
  • Last Modified:

Identifying orphan files in a folder

Supposing you had a folder with say 60,000 files in it.  Each file is supposed to relate to an entry in a database table.  However the table only has 50,000 rows in it.  i.e. there are 10,000 orphan files.  

What you want to do is identify and probably delete those 10,000 orphans.  What would be the best way of doing it?  Ideally something that's reasonably quick; I've tried a few things that work, but take a long time.  I usually kill them before they get more than a few hundred files into it, after several minutes.  I've tried the following ideas:

1. Do a <cfdirectory> and a database query to get all the filenames.  Then do a Query-of-Query to get everything from the cfdirectory WHERE Name NOT IN (#QuotedValueList(dbQuery.filename)#)

2. Do a <cfdirectory> and a database query to get all the filenames.  Loop round the cfdirectory, using ListFindNoCase(ValueList(dbQuery.filename), cfdirectory.name) to identify files that aren't in the query.

3. Do a <cfdirectory>, then loop round that, doing a query for each file to see if it's in the database.


Any other suggestions?

At this stage I'd probably want to do it in two steps:
1. Identify the files
2. Do the process again, but this time delete the files.  I'd expect step 1 would produce a list of names (write to a file or db table) that I could then loop through to delete.  

Database is SQL Server.  Using CF 7 on Windows.  Willing to consider non-ColdFusion solutions.
0
duncancumming
Asked:
duncancumming
  • 4
  • 2
  • 2
  • +1
1 Solution
 
duncancummingAuthor Commented:
Just had an idea:
  1. do the db query to get the filenames, save that to a file.  
  2. use DOS to save a DIR on the folder to a file
  3. Use WinMerge to compare the two files to identify differences.
0
 
azadisaryevCommented:
see if this works faster for you:

<cfquery name="getFiles" ...>
SELECT filenamecolumn
FROM sometable
ORDER BY filenamecolumn
</cfquery>

<cfset arrOrphans = arraynew(1)>
<cfset dir = "C:\path\to\directory\">

<cfoutput query="getFiles">
 <cfif NOT fileexists(dir & filenamecolumn)>
  <cfset arrayappend(arrOrphans, filenamecolumn)>
 </cfif>
</cfoutput>

<!--- do whatever you want with the orphan files array here, i.e. dump it to screen, write to file/db, save into session scope var, etc etc --->
<cfdump var="#arrOrphans#">

Azadi
0
 
_agx_Commented:
Duncan,

Nothing to add here except to say that your idea of writing the results to a file and comparing them sounds like a good approach.  I have done something similar when dealing with very large directories/tables.  
0
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
duncancummingAuthor Commented:
I think that'll give me the opposite of what I'm after.  That will give me a list of all the files that are in the database but don't exist in the folder.  That might also be useful (it's beginning to look like there's mismatches on both sides!) but it's really the list of files that don't exist in the database that I'm after.  

I wonder if I could reverse your code though, and subtract matching elements from the cfdirectory results until I'm left with just the orphan files?  There's a QueryDeleteRows UDF, but I suspect it would be quite slow on a query object of 60,000 rows.



0
 
duncancummingAuthor Commented:
sorry, previous comment aimed at Azadi, not agx.
0
 
gdemariaCommented:
Here's an idea to query both folder and database.  Be sure both queries are ordered by the file name!

This will walk through both queries comparing the record with each other.  Because both queries are in alphatical order by file name, you can tell which query has the missing value.  So this should tell you if the file is missing in the database or the folder.

Note the "cfloop"  is intended to loop forever because the cfbreak will stop the loop.
But I set the cfloop to loop only 1000 records for testing, just so it doesn't run forever if there is a problem.   When ready, you can change the cfloop to run for a very high number of records or with a true condition so it runs forever (until cfbreak).



<!--- be certain the two queries are order by the file name! ---->
<cfquery name="getFiles" ...>
  SELECT filenamecolumn
  FROM sometable
  ORDER BY filenamecolumn
</cfquery>
 
<cfdirectory name="getDir" ...>
 
<!--- loop counters to walk through the database/directory queries --->
<cfset dataCt = 1>
<cfset fileCt = 1> 
 
<!--- how many records to test? 
      in production set to very high number or condition="true" 
      the loop will break when no more files to test  ---->
<cfloop index="kk" from="1" to="1000"> 
  <!--- in case the database or folder have more records, keep going when one is at the end --->
  <cfif dataCt lte getFiles.recordCount>
     <cfset databaseFile = getFiles[fileName][dataCt]>
  <cfelse>
     <cfset databaseFile = ""> <!--- end of the database records --->
  </cfif>
  <cfif diskFile lte getDir.recordCount>
     <cfset diskFile = getDir[filename][fileCt]>
  <cfelse>
     <cfset diskFile = "">
  </cfif>
 
  <!--- if both database and folder have reached the end, you are done, break... ---->
  <cfif len(databasefile) eq 0 and len(diskFile) eq 0>
     The End.
     <cfbreak>
  <cfelseif databaseFile is diskFile> 
     #databaseFile# match!<br>
     <cfset dataCt = dataCt + 1>
     <cfset fileCt  = fileCt + 1>
  <cfelseif len(diskFile) eq 0 or databasefile lt diskFile>
     #databasefile# does not have a physical file
     <cfset dataCt = dataCt + 1>
  <cfelseif len(databasefile) eq 0 or diskFile lt databasefile>
     #diskFile# does not have a database record
     <cfset fileCt = fileCt + 1>
  </cfif>
</cfloop>

Open in new window

0
 
gdemariaCommented:
oops, I saw a type-o.

Fixed it, and tried to clear up the comments a bit..

<!--- be certain the two queries are order by the file name! ---->
<cfquery name="getFiles" ...>
  SELECT filenamecolumn
  FROM sometable
  ORDER BY filenamecolumn
</cfquery>
 
<cfdirectory name="getDir" ...>
 
<!--- loop counters to walk through the database/directory queries --->
<cfset dataCt = 1>
<cfset fileCt = 1> 
 
<!--- how many records to test? 
      in production set to very high number or condition="true" 
      the loop will break when no more files to test  ---->
<cfloop index="kk" from="1" to="1000"> 
  <!--- get the file name from the database ---->
  <cfif dataCt lte getFiles.recordCount>
     <cfset databaseFile = getFiles[fileName][dataCt]>
  <cfelse> <!--- no more database records, but keep going in case there are more files in the folder --->
     <cfset databaseFile = ""> 
  </cfif>
  <!--- get the file name from the folder ---->
  <cfif fileCt lte getDir.recordCount>
     <cfset diskFile = getDir[filename][fileCt]>
  <cfelse> <!--- no more folder files, but keep going in case more database records ---->
     <cfset diskFile = "">
  </cfif>
 
  <!--- if both database and folder have reached the end, you are done, break... ---->
  <cfif len(databasefile) eq 0 and len(diskFile) eq 0>
     The End.
     <cfbreak>
  <!--- now compare the two queries to see which record is missing --->
  <cfelseif databaseFile is diskFile> 
     #databaseFile# match!<br>
     <cfset dataCt = dataCt + 1>
     <cfset fileCt  = fileCt + 1>
  <cfelseif len(diskFile) eq 0 or databasefile lt diskFile>
     #databasefile# does not have a physical file
     <cfset dataCt = dataCt + 1>
  <cfelseif len(databasefile) eq 0 or diskFile lt databasefile>
     #diskFile# does not have a database record
     <cfset fileCt = fileCt + 1>
  </cfif>
</cfloop>

Open in new window

0
 
_agx_Commented:
Duncan,

I think the biggest time drag will be the looping.  I just noticed you are using sql server.  Another option might be to:

1. use DOS to save a DIR on the folder to a file  (that should be quick)
2. Bulk insert the directory list into a temp table
3. Do an outer join between your two tables to find the orphaned files. Insert the orphans into another table
4. Then delete the orphaned files in sets/chunks (to avoid timeouts)
0
 
duncancummingAuthor Commented:
This is probably the solution I'll go down, or something similar.  Thanks all for the suggestions.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: Microsoft Office 2010

This course will introduce you to the interfaces and features of Microsoft Office 2010 Word, Excel, PowerPoint, Outlook, and Access. You will learn about the features that are shared between all products in the Office suite, as well as the new features that are product specific.

  • 4
  • 2
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now