Link to home
Start Free TrialLog in
Avatar of rincewind666
rincewind666

asked on

Search through list of HTML files to find file names

I have a list of HTML files in listbox1. I need the delphi code to search through these to find any file names mentioned in the code and add these to grid1 thus:

First column: the HTML file that the file name is mentioned in.
Second column: the found file name.
Third column: position where the file name was found.

I am using Delphi 6. I am giving the maximum 500 points for this.  Many thanks for your help.
Avatar of esoftbg
esoftbg
Flag of Bulgaria image

> mentioned in the code ?
Avatar of rincewind666
rincewind666

ASKER

I mean the HTML code of the HTML files.
Avatar of Wim ten Brink
What's the definition of a filename? Do you consider \\Server\Share\Folder\File.txt a filename? Is C:\Folder\File.txt a filename? Is http://Server/Folder/File.txt a filename? Is filename.txt a filename?

Depending of what you're looking for, search for the characters '//', '\\' or just ':\' and use some smart algorithm to determine if it is part of what you consider to be a filename.

Of course, if you're just looking for links, search for the <a> tag or the <img> tag in the file, since they will refer to other files. But first make clear what exactly it is you're looking for since the text 'filename' itself could already be a valid filename while 'c:\folder.txt' could actually be a foldername instead of a filename...
I have HTML files that does not contain self file name inside self content.
How it will be determined > First column:  the HTML file that the file name is mentioned in. ?
Even worse: c:\<b>folder</b>\Filename.txt
When the user would see above text, he would see a valid filename, with the 'folder' part in a bold font. But would you consider it a valid filename or not? :-)
As an example, I need to go through the HTML code of each HTML file in the listbox and extract the links such as (for example):

<A href="http://www.demo.com/purchase.cgi">Order</A>

In the above example, I would want the grid to display:

index.htm        http://www.demo.com/purchase.cgi           100

"index.htm" is the name of the file in the listbox. "http://www.demo.com/purchase.cgi" is the link in the searched HTML file. 100 is the character position where the link starts.

IMPORTANT: it will not always be absolute links as above. It could be just the filename (purchase.cgi) or any folder names before it (cgi-bin/shop/purchase.cgi) or (.../cgi-bin/purchase.cgi).  It could also be images, pl files, relative links, etc.  However, all filenames will end with an extension (.gif, .jpg, .cgi, .pl, htm or whatever).

Also it will not always be an actual link ("A href" in the html code).  It could be an image (the HTML code is diffrerent) or a POST to a cgi file) or whatever.

Hope this clarifies things.  Thanks.
ASKER CERTIFIED SOLUTION
Avatar of moorhouselondon
moorhouselondon
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Many thanks for your help.