duncancumming
asked on
Identifying orphan files in a folder
Supposing you had a folder with say 60,000 files in it. Each file is supposed to relate to an entry in a database table. However the table only has 50,000 rows in it. i.e. there are 10,000 orphan files.
What you want to do is identify and probably delete those 10,000 orphans. What would be the best way of doing it? Ideally something that's reasonably quick; I've tried a few things that work, but take a long time. I usually kill them before they get more than a few hundred files into it, after several minutes. I've tried the following ideas:
1. Do a <cfdirectory> and a database query to get all the filenames. Then do a Query-of-Query to get everything from the cfdirectory WHERE Name NOT IN (#QuotedValueList(dbQuery. filename)# )
2. Do a <cfdirectory> and a database query to get all the filenames. Loop round the cfdirectory, using ListFindNoCase(ValueList(d bQuery.fil ename), cfdirectory.name) to identify files that aren't in the query.
3. Do a <cfdirectory>, then loop round that, doing a query for each file to see if it's in the database.
Any other suggestions?
At this stage I'd probably want to do it in two steps:
1. Identify the files
2. Do the process again, but this time delete the files. I'd expect step 1 would produce a list of names (write to a file or db table) that I could then loop through to delete.
Database is SQL Server. Using CF 7 on Windows. Willing to consider non-ColdFusion solutions.
What you want to do is identify and probably delete those 10,000 orphans. What would be the best way of doing it? Ideally something that's reasonably quick; I've tried a few things that work, but take a long time. I usually kill them before they get more than a few hundred files into it, after several minutes. I've tried the following ideas:
1. Do a <cfdirectory> and a database query to get all the filenames. Then do a Query-of-Query to get everything from the cfdirectory WHERE Name NOT IN (#QuotedValueList(dbQuery.
2. Do a <cfdirectory> and a database query to get all the filenames. Loop round the cfdirectory, using ListFindNoCase(ValueList(d
3. Do a <cfdirectory>, then loop round that, doing a query for each file to see if it's in the database.
Any other suggestions?
At this stage I'd probably want to do it in two steps:
1. Identify the files
2. Do the process again, but this time delete the files. I'd expect step 1 would produce a list of names (write to a file or db table) that I could then loop through to delete.
Database is SQL Server. Using CF 7 on Windows. Willing to consider non-ColdFusion solutions.
see if this works faster for you:
<cfquery name="getFiles" ...>
SELECT filenamecolumn
FROM sometable
ORDER BY filenamecolumn
</cfquery>
<cfset arrOrphans = arraynew(1)>
<cfset dir = "C:\path\to\directory\">
<cfoutput query="getFiles">
<cfif NOT fileexists(dir & filenamecolumn)>
<cfset arrayappend(arrOrphans, filenamecolumn)>
</cfif>
</cfoutput>
<!--- do whatever you want with the orphan files array here, i.e. dump it to screen, write to file/db, save into session scope var, etc etc --->
<cfdump var="#arrOrphans#">
Azadi
<cfquery name="getFiles" ...>
SELECT filenamecolumn
FROM sometable
ORDER BY filenamecolumn
</cfquery>
<cfset arrOrphans = arraynew(1)>
<cfset dir = "C:\path\to\directory\">
<cfoutput query="getFiles">
<cfif NOT fileexists(dir & filenamecolumn)>
<cfset arrayappend(arrOrphans, filenamecolumn)>
</cfif>
</cfoutput>
<!--- do whatever you want with the orphan files array here, i.e. dump it to screen, write to file/db, save into session scope var, etc etc --->
<cfdump var="#arrOrphans#">
Azadi
Duncan,
Nothing to add here except to say that your idea of writing the results to a file and comparing them sounds like a good approach. I have done something similar when dealing with very large directories/tables.
Nothing to add here except to say that your idea of writing the results to a file and comparing them sounds like a good approach. I have done something similar when dealing with very large directories/tables.
ASKER
I think that'll give me the opposite of what I'm after. That will give me a list of all the files that are in the database but don't exist in the folder. That might also be useful (it's beginning to look like there's mismatches on both sides!) but it's really the list of files that don't exist in the database that I'm after.
I wonder if I could reverse your code though, and subtract matching elements from the cfdirectory results until I'm left with just the orphan files? There's a QueryDeleteRows UDF, but I suspect it would be quite slow on a query object of 60,000 rows.
I wonder if I could reverse your code though, and subtract matching elements from the cfdirectory results until I'm left with just the orphan files? There's a QueryDeleteRows UDF, but I suspect it would be quite slow on a query object of 60,000 rows.
ASKER
sorry, previous comment aimed at Azadi, not agx.
Here's an idea to query both folder and database. Be sure both queries are ordered by the file name!
This will walk through both queries comparing the record with each other. Because both queries are in alphatical order by file name, you can tell which query has the missing value. So this should tell you if the file is missing in the database or the folder.
Note the "cfloop" is intended to loop forever because the cfbreak will stop the loop.
But I set the cfloop to loop only 1000 records for testing, just so it doesn't run forever if there is a problem. When ready, you can change the cfloop to run for a very high number of records or with a true condition so it runs forever (until cfbreak).
This will walk through both queries comparing the record with each other. Because both queries are in alphatical order by file name, you can tell which query has the missing value. So this should tell you if the file is missing in the database or the folder.
Note the "cfloop" is intended to loop forever because the cfbreak will stop the loop.
But I set the cfloop to loop only 1000 records for testing, just so it doesn't run forever if there is a problem. When ready, you can change the cfloop to run for a very high number of records or with a true condition so it runs forever (until cfbreak).
<!--- be certain the two queries are order by the file name! ---->
<cfquery name="getFiles" ...>
SELECT filenamecolumn
FROM sometable
ORDER BY filenamecolumn
</cfquery>
<cfdirectory name="getDir" ...>
<!--- loop counters to walk through the database/directory queries --->
<cfset dataCt = 1>
<cfset fileCt = 1>
<!--- how many records to test?
in production set to very high number or condition="true"
the loop will break when no more files to test ---->
<cfloop index="kk" from="1" to="1000">
<!--- in case the database or folder have more records, keep going when one is at the end --->
<cfif dataCt lte getFiles.recordCount>
<cfset databaseFile = getFiles[fileName][dataCt]>
<cfelse>
<cfset databaseFile = ""> <!--- end of the database records --->
</cfif>
<cfif diskFile lte getDir.recordCount>
<cfset diskFile = getDir[filename][fileCt]>
<cfelse>
<cfset diskFile = "">
</cfif>
<!--- if both database and folder have reached the end, you are done, break... ---->
<cfif len(databasefile) eq 0 and len(diskFile) eq 0>
The End.
<cfbreak>
<cfelseif databaseFile is diskFile>
#databaseFile# match!<br>
<cfset dataCt = dataCt + 1>
<cfset fileCt = fileCt + 1>
<cfelseif len(diskFile) eq 0 or databasefile lt diskFile>
#databasefile# does not have a physical file
<cfset dataCt = dataCt + 1>
<cfelseif len(databasefile) eq 0 or diskFile lt databasefile>
#diskFile# does not have a database record
<cfset fileCt = fileCt + 1>
</cfif>
</cfloop>
oops, I saw a type-o.
Fixed it, and tried to clear up the comments a bit..
Fixed it, and tried to clear up the comments a bit..
<!--- be certain the two queries are order by the file name! ---->
<cfquery name="getFiles" ...>
SELECT filenamecolumn
FROM sometable
ORDER BY filenamecolumn
</cfquery>
<cfdirectory name="getDir" ...>
<!--- loop counters to walk through the database/directory queries --->
<cfset dataCt = 1>
<cfset fileCt = 1>
<!--- how many records to test?
in production set to very high number or condition="true"
the loop will break when no more files to test ---->
<cfloop index="kk" from="1" to="1000">
<!--- get the file name from the database ---->
<cfif dataCt lte getFiles.recordCount>
<cfset databaseFile = getFiles[fileName][dataCt]>
<cfelse> <!--- no more database records, but keep going in case there are more files in the folder --->
<cfset databaseFile = "">
</cfif>
<!--- get the file name from the folder ---->
<cfif fileCt lte getDir.recordCount>
<cfset diskFile = getDir[filename][fileCt]>
<cfelse> <!--- no more folder files, but keep going in case more database records ---->
<cfset diskFile = "">
</cfif>
<!--- if both database and folder have reached the end, you are done, break... ---->
<cfif len(databasefile) eq 0 and len(diskFile) eq 0>
The End.
<cfbreak>
<!--- now compare the two queries to see which record is missing --->
<cfelseif databaseFile is diskFile>
#databaseFile# match!<br>
<cfset dataCt = dataCt + 1>
<cfset fileCt = fileCt + 1>
<cfelseif len(diskFile) eq 0 or databasefile lt diskFile>
#databasefile# does not have a physical file
<cfset dataCt = dataCt + 1>
<cfelseif len(databasefile) eq 0 or diskFile lt databasefile>
#diskFile# does not have a database record
<cfset fileCt = fileCt + 1>
</cfif>
</cfloop>
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
This is probably the solution I'll go down, or something similar. Thanks all for the suggestions.
ASKER