Solved

Identifying orphan files in a folder

Posted on 2009-04-02
9
727 Views
Last Modified: 2013-12-20
Supposing you had a folder with say 60,000 files in it.  Each file is supposed to relate to an entry in a database table.  However the table only has 50,000 rows in it.  i.e. there are 10,000 orphan files.  

What you want to do is identify and probably delete those 10,000 orphans.  What would be the best way of doing it?  Ideally something that's reasonably quick; I've tried a few things that work, but take a long time.  I usually kill them before they get more than a few hundred files into it, after several minutes.  I've tried the following ideas:

1. Do a <cfdirectory> and a database query to get all the filenames.  Then do a Query-of-Query to get everything from the cfdirectory WHERE Name NOT IN (#QuotedValueList(dbQuery.filename)#)

2. Do a <cfdirectory> and a database query to get all the filenames.  Loop round the cfdirectory, using ListFindNoCase(ValueList(dbQuery.filename), cfdirectory.name) to identify files that aren't in the query.

3. Do a <cfdirectory>, then loop round that, doing a query for each file to see if it's in the database.


Any other suggestions?

At this stage I'd probably want to do it in two steps:
1. Identify the files
2. Do the process again, but this time delete the files.  I'd expect step 1 would produce a list of names (write to a file or db table) that I could then loop through to delete.  

Database is SQL Server.  Using CF 7 on Windows.  Willing to consider non-ColdFusion solutions.
0
Comment
Question by:duncancumming
  • 4
  • 2
  • 2
  • +1
9 Comments
 
LVL 16

Author Comment

by:duncancumming
ID: 24047754
Just had an idea:
  1. do the db query to get the filenames, save that to a file.  
  2. use DOS to save a DIR on the folder to a file
  3. Use WinMerge to compare the two files to identify differences.
0
 
LVL 27

Expert Comment

by:azadisaryev
ID: 24047779
see if this works faster for you:

<cfquery name="getFiles" ...>
SELECT filenamecolumn
FROM sometable
ORDER BY filenamecolumn
</cfquery>

<cfset arrOrphans = arraynew(1)>
<cfset dir = "C:\path\to\directory\">

<cfoutput query="getFiles">
 <cfif NOT fileexists(dir & filenamecolumn)>
  <cfset arrayappend(arrOrphans, filenamecolumn)>
 </cfif>
</cfoutput>

<!--- do whatever you want with the orphan files array here, i.e. dump it to screen, write to file/db, save into session scope var, etc etc --->
<cfdump var="#arrOrphans#">

Azadi
0
 
LVL 52

Expert Comment

by:_agx_
ID: 24049021
Duncan,

Nothing to add here except to say that your idea of writing the results to a file and comparing them sounds like a good approach.  I have done something similar when dealing with very large directories/tables.  
0
 
LVL 16

Author Comment

by:duncancumming
ID: 24049246
I think that'll give me the opposite of what I'm after.  That will give me a list of all the files that are in the database but don't exist in the folder.  That might also be useful (it's beginning to look like there's mismatches on both sides!) but it's really the list of files that don't exist in the database that I'm after.  

I wonder if I could reverse your code though, and subtract matching elements from the cfdirectory results until I'm left with just the orphan files?  There's a QueryDeleteRows UDF, but I suspect it would be quite slow on a query object of 60,000 rows.



0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 16

Author Comment

by:duncancumming
ID: 24049255
sorry, previous comment aimed at Azadi, not agx.
0
 
LVL 39

Expert Comment

by:gdemaria
ID: 24049892
Here's an idea to query both folder and database.  Be sure both queries are ordered by the file name!

This will walk through both queries comparing the record with each other.  Because both queries are in alphatical order by file name, you can tell which query has the missing value.  So this should tell you if the file is missing in the database or the folder.

Note the "cfloop"  is intended to loop forever because the cfbreak will stop the loop.
But I set the cfloop to loop only 1000 records for testing, just so it doesn't run forever if there is a problem.   When ready, you can change the cfloop to run for a very high number of records or with a true condition so it runs forever (until cfbreak).




<!--- be certain the two queries are order by the file name! ---->

<cfquery name="getFiles" ...>

  SELECT filenamecolumn

  FROM sometable

  ORDER BY filenamecolumn

</cfquery>
 

<cfdirectory name="getDir" ...>
 

<!--- loop counters to walk through the database/directory queries --->

<cfset dataCt = 1>

<cfset fileCt = 1> 
 

<!--- how many records to test? 

      in production set to very high number or condition="true" 

      the loop will break when no more files to test  ---->

<cfloop index="kk" from="1" to="1000"> 

  <!--- in case the database or folder have more records, keep going when one is at the end --->

  <cfif dataCt lte getFiles.recordCount>

     <cfset databaseFile = getFiles[fileName][dataCt]>

  <cfelse>

     <cfset databaseFile = ""> <!--- end of the database records --->

  </cfif>

  <cfif diskFile lte getDir.recordCount>

     <cfset diskFile = getDir[filename][fileCt]>

  <cfelse>

     <cfset diskFile = "">

  </cfif>
 

  <!--- if both database and folder have reached the end, you are done, break... ---->

  <cfif len(databasefile) eq 0 and len(diskFile) eq 0>

     The End.

     <cfbreak>

  <cfelseif databaseFile is diskFile> 

     #databaseFile# match!<br>

     <cfset dataCt = dataCt + 1>

     <cfset fileCt  = fileCt + 1>

  <cfelseif len(diskFile) eq 0 or databasefile lt diskFile>

     #databasefile# does not have a physical file

     <cfset dataCt = dataCt + 1>

  <cfelseif len(databasefile) eq 0 or diskFile lt databasefile>

     #diskFile# does not have a database record

     <cfset fileCt = fileCt + 1>

  </cfif>

</cfloop>

Open in new window

0
 
LVL 39

Expert Comment

by:gdemaria
ID: 24049935
oops, I saw a type-o.

Fixed it, and tried to clear up the comments a bit..

<!--- be certain the two queries are order by the file name! ---->

<cfquery name="getFiles" ...>

  SELECT filenamecolumn

  FROM sometable

  ORDER BY filenamecolumn

</cfquery>
 

<cfdirectory name="getDir" ...>
 

<!--- loop counters to walk through the database/directory queries --->

<cfset dataCt = 1>

<cfset fileCt = 1> 
 

<!--- how many records to test? 

      in production set to very high number or condition="true" 

      the loop will break when no more files to test  ---->

<cfloop index="kk" from="1" to="1000"> 

  <!--- get the file name from the database ---->

  <cfif dataCt lte getFiles.recordCount>

     <cfset databaseFile = getFiles[fileName][dataCt]>

  <cfelse> <!--- no more database records, but keep going in case there are more files in the folder --->

     <cfset databaseFile = ""> 

  </cfif>

  <!--- get the file name from the folder ---->

  <cfif fileCt lte getDir.recordCount>

     <cfset diskFile = getDir[filename][fileCt]>

  <cfelse> <!--- no more folder files, but keep going in case more database records ---->

     <cfset diskFile = "">

  </cfif>
 

  <!--- if both database and folder have reached the end, you are done, break... ---->

  <cfif len(databasefile) eq 0 and len(diskFile) eq 0>

     The End.

     <cfbreak>

  <!--- now compare the two queries to see which record is missing --->

  <cfelseif databaseFile is diskFile> 

     #databaseFile# match!<br>

     <cfset dataCt = dataCt + 1>

     <cfset fileCt  = fileCt + 1>

  <cfelseif len(diskFile) eq 0 or databasefile lt diskFile>

     #databasefile# does not have a physical file

     <cfset dataCt = dataCt + 1>

  <cfelseif len(databasefile) eq 0 or diskFile lt databasefile>

     #diskFile# does not have a database record

     <cfset fileCt = fileCt + 1>

  </cfif>

</cfloop>

Open in new window

0
 
LVL 52

Accepted Solution

by:
_agx_ earned 500 total points
ID: 24050049
Duncan,

I think the biggest time drag will be the looping.  I just noticed you are using sql server.  Another option might be to:

1. use DOS to save a DIR on the folder to a file  (that should be quick)
2. Bulk insert the directory list into a temp table
3. Do an outer join between your two tables to find the orphaned files. Insert the orphans into another table
4. Then delete the orphaned files in sets/chunks (to avoid timeouts)
0
 
LVL 16

Author Closing Comment

by:duncancumming
ID: 31565687
This is probably the solution I'll go down, or something similar.  Thanks all for the suggestions.
0

Featured Post

Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

Join & Write a Comment

CFGRID Custom Functionality Series -  Part 1 Hi Guys, I was once asked how it is possible to to add a hyperlink in the cfgrid and open the window to show the data. Now this is quite simple, I have to use the EXT JS library for this and I achiev…
I spent nearly three days trying to figure out how incorporate OAuth in Coldfusion for the Eventful API. Hopefully, this article will allow Coldfusion Programmers to buzz through the API when they need to. Basically, what this script does is authori…
The purpose of this video is to demonstrate how to make a WordPress Site faster and smaller in size by cleaning up the database. This will be demonstrated using a Windows 8 PC. Plugin WP Optimize will be used. Go to your WordPress login page. T…
The purpose of this video is to demonstrate how to Import and export files in WordPress. This will be demonstrated using a Windows 8 PC. Go to your WordPress login page. This will look like the following: mywebsite.com/wp-login.php : Click on Too…

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now