Link to home
Start Free TrialLog in
Avatar of cmistre
cmistre

asked on

Wildcard cfquery & delete files?

I have a situation where I have a very large directory of filenames, that are regulary imported via FTP automatically.  

Since they are not imported via coldfusion, there isnt a database associated with which files belong to which records.  The filenames themselves do contain names that allow me to associate them with their corresponding records, so I can query & retrieved the files as needed, but I am having trouble figuring out how to delete the files when the records become outdated.

I need to be able to do a an entire directory search of what is in the FTP folder against the records that are actually active in my database.  

Is there a way I can wildcard delete files? For instance if my records are labeled from 1000-2000.  The associated records for record 1000 are labeled 1000-1, 1000-2, 1000-3, etc.

What I need to do is check the files in the directory against record numbers and delete any files in the directory that do not have a corresponding record in the database.  So if I run the code every week, it can wildcard delete any file that contains a record number in the filename that no longer exists in the live database.
Avatar of cmistre
cmistre

ASKER

Here is the concept for what I need to do, but I am not sure how to query based on a filename?

Or the best way to loop through and delete the records that are basically outdated.  The first 10 characters of the filename will always match the record id of the active records.  This is the only way I could think of to do this, but of course it doesnt work since I dont know how to actually reference the name of the file in the directory.

 I used **filename** to show where I need to reference this name.
<cfloop>
 
<cfquery name="get_filename">
select **filename** from
**directory**
</cfquery>
 
<cfquery name="active_records">
select id
from active_records
where id like '%filename%'
</cfquery>
 
<cfif #active_records.id# not like #trim(left(**Filename**,10))#>
DELETE FILE FROM DIRECTORY
(i know the code to delete a file)
</cfif>
 
</cfloop>

Open in new window

Avatar of _agx_
Assuming the files are on a local drive, use cfdirectory to get a listing of the files in a particular directory. You can use "filter" to restrict which files are returned.

Example: Find all files starting with:  "1000-"
<cfdirectory directory="c:\someFolder\yourFiles"
     action="list"
    filter="1000-*"
    name="fileList"
    type="file">

The result is a query and the file name is stored in the "Name" column
<!--- show all files found --->
<cfoutput query="fileList">
      File name = #Name#<br>
</cfoutput>
Avatar of cmistre

ASKER

That shows how to display the file names, but how to delete?

Can you give an example with code on how to delete "non-matching" records?

The directory itself may have as many as a few hundred thousand files in it, and the filter way looks like a good way to only bring up those files that match a certain file number, but I need to go a step further and delete ALL files of a record number that do NOT match a record id in the active_records database.

So I need to

1. query the id#'s of all records in the current working database.
2. query the filenames in the directory
3. delete files in the directory that do not have a matching record in the current database any longer.
What database type are you using?
Avatar of cmistre

ASKER

Its mysql 5.0
Also, do you have full control over the server or are you using a shared server (ie with limited access to tags like cfexecute, etc..)?
Since there are potentially hundreds of thousands of records you should definitely do this in chunks.  Here is an outline of one way to do it, that would most certainly be more efficient than looping w/CF

https://www.experts-exchange.com/questions/24287703/Identifying-orphan-files-in-a-folder.html

Also, going forward you might consider adding a column to your table that would store the related file name.  You could set up a scheduled task that would run periodically. It could check all records with a NULL file name and update the record.  That would drastically reduce the problem of orphaned files, as you could delete related  physical files at the same time as the database records.
Avatar of cmistre

ASKER

This is on a dedicated server so I do have full control.  However I don't have a really good way to change the way I recieve the information so I have to modify the way I update rather than the way the information is stored.

I recieve information 2 different updates.  One update completely replaces the old records, while the other update is the images associated with the records, but its only the latest images.  When the text update no longer contains that product, I am left with these "orphaned" image files which eats up alot of space.

I have read the article you linked but I dont know how to work with dos directories etc.  I have thought of looping through text database, and moving matching image files into different folder and deleting the old, but since these images are actually used for display on a website, if they were moved, I would also have to always update the images url in the website everytime I updated.

Does this new information give you any other ideas?  Thanks for your help

The image updates make sense, but I am not quite clear on how your database is updated yet.  So I will focus on the idea in the other thread.

Using DOS to get a listing of the files in a directory is very simple.  You can use the "dir" (directory list) command and send the output to a file.   You can either run it from a command line, or from cfexecute. An example would be something like this.  

--- the "/b" means show only the file names, not the extra information
c:\>  DIR /b  c:\yourDirectory

Then you will have a file containing all of the image names.  You can create a table in your MySQL database and use LOAD DATA INFILE to import the file names into the new table.  Then do a JOIN with your other table, and "flag" all of the images that have no associated record.  It is not much different than a regular update.  Then you can do a select to find all image records that are outdated and loop to delete them.  

Of course you should do this in chunks as it may take a while with so many files and records.

Correction:  I forgot the "output to file" syntax

--- Save results to a file named "myFileList.txt"
c:\>  DIR /b  c:\yourDirectory > myFileList.txt
Avatar of cmistre

ASKER

I am hoping to be able to automate this cleanup process, so I'm probably going to want to set this up via cfexecute to build the table from filenames.  I've never used CFExecute, and not familiar with outer join either.  I'll do some looking up but hope you can help me write a little code if I can't figure it out.

If the process of how the site updates will help sort this out, I'll give you some more details.

There is one master database for the information my website needs and I use a client that is scheduled  to automatically grab the data on a pretty frequent basis.  Coldfusion also has a scheduled task which completely truncates the entire data table, and re-imports the new data download from scratch.  This was so I didn't have to cross-reference the old table with the new table for changes. It was easier to simply drop the old and re-import the data for the freshest update.

The images however are just FTP'd directly into the directory that displays them on the website.  I can't drop that whole images folder and re-download the matching images every time, so this is how I end up with old stored image files that wouldn't belong to any current records.

I'm gonna give the dos / cfexecute / outer join a try.  I played with cfdirectory a bit, but I think its going to crash my server, its been unable to resolve for hours and still hasnt stopped as of now.
Avatar of cmistre

ASKER

Hmm, OK I am pretty confused just with cfexecute already.  I really want this to be a scheduled procedure or something I can just click inside the admin panel to begin the cleanup.

Can you please give me an example of how to run the dos command via cfexecute?
>  I played with cfdirectory a bit, but I think its going to crash my server, its been unable
> to resolve for hours and still hasnt stopped as of now.

Yes, that is why I said you have to do this in chunks ;-)  Hundreds of thousands of files is just too much information to process at one time.  It would take a while - even using a command tool.  Adding CF on top of that definitely won't help.

> and re-imports the new data download from scratch.  This was so I didn't have to
> cross-reference the old table with the new table for changes.

That may be easier in terms of data, but what about the file information? I really think you need to find a way a way to do this incrementally.  To try and run this full routine on 100K+ every time you get an update will take too long and is just not feasible.

The best approach it is to:

1)  Create a process to do the initial cleanup.  This will be less efficient, because it only needs to run once.  But with this much information, you are probably better off doing this at the o/s level .. rather than adding CF to the mix.  There is absolutely NO way you will be able to process 100K files/records from a single cfm script.  The server will time out or crash first.

2)  Create a more efficient method (schedule task) to do the incremental updates



Avatar of cmistre

ASKER

Even incrementally, I would always need to compare the old table to the new one and delete the image files that exist in the directory that don't exist in the new table.  Wouldnt it still have to reference the same exact directory with hundreds of thousands of records?  Its not like they give me a list of photos that need to be deleted...

So... I know the procedure is long and ugly, but it has to be done.  Can you help me with cfexecute?
> Even incrementally, I would always need to compare the old table to the new
> one and delete the image files that exist in the directory that don't exist in the new table.

That is only because you are truncating the table every time.  If you used a more efficient method to identify

1) New
2) Changed and
3) Deleted Items

You would have a much smaller set of information to work with each time.  

Unless you have a _LOT_ of hardware on the server, you will not be able to run this kind of process at all, let alone on a regular basis.  At worst it will bring down the server, at best the site will be unusable during that time.  Not to sound harsh, but it is just not feasible. You have to break up the process into smaller pieces.

Can you post a screen shot of the database table structure and a sample of the data?  Maybe I can help you figure out a better way to update the table.
Avatar of cmistre

ASKER

that would probably complicate things even more since the tables are pretty huge with alot of information and remarks about the items.

The best way to understand is this...  Each record does have an ID and thats how I associate the images with the record.  All that is given to me is a record id and a photo count.  So if record ID is 1000, and the photo count is 3, then I know that the filenames are 1000-1.jpeg, 1000-2.jpeg, 1000-3.jpeg.

Since the new image files are automatically imported directly into the folder, I dont think there is a way to keep a table of filenames current since the downloading client cant really run a mysql procedure to update my table when it adds a file.  This leaves me always having to compare the directory files to the record ID's and look for the orphans.
So the relationship between ID's and images is one-to-many, not one-to-one (ie One record ID can have many images, not just one).

But if you at least knew which ID's were deleted when you get the updates, you would know exactly which files you needed to delete. You would not have to do a full comparison each time.  

How many database records are you talking about?
> But if you at least knew which ID's were deleted when you get the updates, you would know
> exactly which files you needed to delete. You would not have to do a full comparison each time.

Again, think two stages: initial cleanup and maintenance actions going forward.

(Unfortunately, I have to head out soon.  Holiday weekend)
Avatar of cmistre

ASKER

If I didnt truncate the data table, it too would grow out of control in the same way the images are.  The cumulative update option will only add new records.  It has no capabilty to cross-reference my personal MySql table and delete the old records that are no longer in existence.

One thought is if I didnt truncate, then I would need to do this...

1.  remove data records as well as their associated images in the current database that do not exist in the new update file.
2.  do NOT insert records that are already in existence in the database.

Aside from that, I'm pretty sure I can come up with a way to keep an accurate table of the filenames in the directory so I can avoid having to use some type of directory list function after I make the initial table.

Avatar of cmistre

ASKER

We kinda crossed paths in the last response.

There are somewhere between 30,000 & 50,000 total records in the database, and each record has anywhere from 1-10 images, some have even more.  Managing the images is a real pain.

Im thinking of the possibilities of the cumulative update...

My thought process is that I have import the new data into a totally different table, but then perform some type of loop by the record ID to flag records that meet one of 3 criteria.

1.  Records that exist in update that dont exist in current table - ADD
2.  Records that exist in both places - OVERWRITE OLD new since it could contain changes to price, etc.
3.  Records that do NOT exist in UPDATE that exist in table - DELETE record and all filenames that begin with the prefix of the id#

This sounds more manageable but I had originally thought the loop idea was not a good idea for the 50,000ish data records, but its definitely a better alternative than the hundreds of thousands of images.

ASKER CERTIFIED SOLUTION
Avatar of _agx_
_agx_
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>  We kinda crossed paths in the last response.

And in this one too ;-)

Our timing was off again, but yes. Now we are on the same page.  EXCEPT - do not do the sql updates in a loop.  SQL is much better at processing whole sets of data. Plus looping will take much longer.   The only place you must use looping is for the delete of the image files.  But again, you can do that in a scheduled task, which will make the processing smoother.

I have to head out now.  I have about 5 hours to get some sleep before traveling ;-)
Avatar of cmistre

ASKER

have fun, man thx for the help so far.

I'm still lost on how to write the real code for the joins but will play around.
> I'm still lost on how to write the real code for the joins but will play around.

The examples posted above should be pretty close to the real code.  I just said "psuedo-sql" because they were not tested.