Solved

Identifying orphan files in a folder

Posted on 2009-04-02
737 Views
Supposing you had a folder with say 60,000 files in it.  Each file is supposed to relate to an entry in a database table.  However the table only has 50,000 rows in it.  i.e. there are 10,000 orphan files.

What you want to do is identify and probably delete those 10,000 orphans.  What would be the best way of doing it?  Ideally something that's reasonably quick; I've tried a few things that work, but take a long time.  I usually kill them before they get more than a few hundred files into it, after several minutes.  I've tried the following ideas:

1. Do a <cfdirectory> and a database query to get all the filenames.  Then do a Query-of-Query to get everything from the cfdirectory WHERE Name NOT IN (#QuotedValueList(dbQuery.filename)#)

2. Do a <cfdirectory> and a database query to get all the filenames.  Loop round the cfdirectory, using ListFindNoCase(ValueList(dbQuery.filename), cfdirectory.name) to identify files that aren't in the query.

3. Do a <cfdirectory>, then loop round that, doing a query for each file to see if it's in the database.

Any other suggestions?

At this stage I'd probably want to do it in two steps:
1. Identify the files
2. Do the process again, but this time delete the files.  I'd expect step 1 would produce a list of names (write to a file or db table) that I could then loop through to delete.

Database is SQL Server.  Using CF 7 on Windows.  Willing to consider non-ColdFusion solutions.
0
Question by:duncancumming
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

• Help others & share knowledge
• Earn cash & points
• 4
• 2
• 2
• +1

LVL 16

Author Comment

ID: 24047754
1. do the db query to get the filenames, save that to a file.
2. use DOS to save a DIR on the folder to a file
3. Use WinMerge to compare the two files to identify differences.
0

LVL 27

Expert Comment

ID: 24047779
see if this works faster for you:

<cfquery name="getFiles" ...>
SELECT filenamecolumn
FROM sometable
ORDER BY filenamecolumn
</cfquery>

<cfset arrOrphans = arraynew(1)>
<cfset dir = "C:\path\to\directory\">

<cfoutput query="getFiles">
<cfif NOT fileexists(dir & filenamecolumn)>
<cfset arrayappend(arrOrphans, filenamecolumn)>
</cfif>
</cfoutput>

<!--- do whatever you want with the orphan files array here, i.e. dump it to screen, write to file/db, save into session scope var, etc etc --->
<cfdump var="#arrOrphans#">

0

LVL 52

Expert Comment

ID: 24049021
Duncan,

Nothing to add here except to say that your idea of writing the results to a file and comparing them sounds like a good approach.  I have done something similar when dealing with very large directories/tables.
0

LVL 16

Author Comment

ID: 24049246
I think that'll give me the opposite of what I'm after.  That will give me a list of all the files that are in the database but don't exist in the folder.  That might also be useful (it's beginning to look like there's mismatches on both sides!) but it's really the list of files that don't exist in the database that I'm after.

I wonder if I could reverse your code though, and subtract matching elements from the cfdirectory results until I'm left with just the orphan files?  There's a QueryDeleteRows UDF, but I suspect it would be quite slow on a query object of 60,000 rows.

0

LVL 16

Author Comment

ID: 24049255
sorry, previous comment aimed at Azadi, not agx.
0

LVL 39

Expert Comment

ID: 24049892
Here's an idea to query both folder and database.  Be sure both queries are ordered by the file name!

This will walk through both queries comparing the record with each other.  Because both queries are in alphatical order by file name, you can tell which query has the missing value.  So this should tell you if the file is missing in the database or the folder.

Note the "cfloop"  is intended to loop forever because the cfbreak will stop the loop.
But I set the cfloop to loop only 1000 records for testing, just so it doesn't run forever if there is a problem.   When ready, you can change the cfloop to run for a very high number of records or with a true condition so it runs forever (until cfbreak).


<!--- be certain the two queries are order by the file name! ---->
<cfquery name="getFiles" ...>
SELECT filenamecolumn
FROM sometable
ORDER BY filenamecolumn
</cfquery>

<cfdirectory name="getDir" ...>

<!--- loop counters to walk through the database/directory queries --->
<cfset dataCt = 1>
<cfset fileCt = 1>

<!--- how many records to test?
in production set to very high number or condition="true"
the loop will break when no more files to test  ---->
<cfloop index="kk" from="1" to="1000">
<!--- in case the database or folder have more records, keep going when one is at the end --->
<cfif dataCt lte getFiles.recordCount>
<cfset databaseFile = getFiles[fileName][dataCt]>
<cfelse>
<cfset databaseFile = ""> <!--- end of the database records --->
</cfif>
<cfif diskFile lte getDir.recordCount>
<cfset diskFile = getDir[filename][fileCt]>
<cfelse>
<cfset diskFile = "">
</cfif>

<!--- if both database and folder have reached the end, you are done, break... ---->
<cfif len(databasefile) eq 0 and len(diskFile) eq 0>
The End.
<cfbreak>
<cfelseif databaseFile is diskFile>
#databaseFile# match!<br>
<cfset dataCt = dataCt + 1>
<cfset fileCt  = fileCt + 1>
<cfelseif len(diskFile) eq 0 or databasefile lt diskFile>
#databasefile# does not have a physical file
<cfset dataCt = dataCt + 1>
<cfelseif len(databasefile) eq 0 or diskFile lt databasefile>
#diskFile# does not have a database record
<cfset fileCt = fileCt + 1>
</cfif>
</cfloop>

0

LVL 39

Expert Comment

ID: 24049935
oops, I saw a type-o.

Fixed it, and tried to clear up the comments a bit..

<!--- be certain the two queries are order by the file name! ---->
<cfquery name="getFiles" ...>
SELECT filenamecolumn
FROM sometable
ORDER BY filenamecolumn
</cfquery>

<cfdirectory name="getDir" ...>

<!--- loop counters to walk through the database/directory queries --->
<cfset dataCt = 1>
<cfset fileCt = 1>

<!--- how many records to test?
in production set to very high number or condition="true"
the loop will break when no more files to test  ---->
<cfloop index="kk" from="1" to="1000">
<!--- get the file name from the database ---->
<cfif dataCt lte getFiles.recordCount>
<cfset databaseFile = getFiles[fileName][dataCt]>
<cfelse> <!--- no more database records, but keep going in case there are more files in the folder --->
<cfset databaseFile = "">
</cfif>
<!--- get the file name from the folder ---->
<cfif fileCt lte getDir.recordCount>
<cfset diskFile = getDir[filename][fileCt]>
<cfelse> <!--- no more folder files, but keep going in case more database records ---->
<cfset diskFile = "">
</cfif>

<!--- if both database and folder have reached the end, you are done, break... ---->
<cfif len(databasefile) eq 0 and len(diskFile) eq 0>
The End.
<cfbreak>
<!--- now compare the two queries to see which record is missing --->
<cfelseif databaseFile is diskFile>
#databaseFile# match!<br>
<cfset dataCt = dataCt + 1>
<cfset fileCt  = fileCt + 1>
<cfelseif len(diskFile) eq 0 or databasefile lt diskFile>
#databasefile# does not have a physical file
<cfset dataCt = dataCt + 1>
<cfelseif len(databasefile) eq 0 or diskFile lt databasefile>
#diskFile# does not have a database record
<cfset fileCt = fileCt + 1>
</cfif>
</cfloop>

0

LVL 52

Accepted Solution

_agx_ earned 500 total points
ID: 24050049
Duncan,

I think the biggest time drag will be the looping.  I just noticed you are using sql server.  Another option might be to:

1. use DOS to save a DIR on the folder to a file  (that should be quick)
2. Bulk insert the directory list into a temp table
3. Do an outer join between your two tables to find the orphaned files. Insert the orphans into another table
4. Then delete the orphaned files in sets/chunks (to avoid timeouts)
0

LVL 16

Author Closing Comment

ID: 31565687
This is probably the solution I'll go down, or something similar.  Thanks all for the suggestions.
0

Featured Post

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction In this tutorial, I'll explain how to create an animated progress meter in a wireframe prototype developed using Axure RP 7.0 - a leading prototyping tool for designing web sites and software. (For more information about Axure and gett…
Lease-to-own eliminates the expenditure of hardware replacement and allows you to pay off the server over time. Usually, this is much cheaper than leasing servers. Think of lease-to-own as credit without interest.
The purpose of this video is to demonstrate how to properly insert a Vimeo Video into a WordPress site or Blog. This will be demonstrated using a Windows 8 PC. Go to your WordPress login page. This will look like the following: mywebsite.com/wp…
The purpose of this video is to demonstrate how to Test the speed of a WordPress Website. Site Speed is an important metric of a site’s health. Slow site speed can result in viewers leaving your site quickly and not seeing your content. This…
Suggested Courses
Course of the Month6 days, 9 hours left to enroll