asked on

Search through list of HTML files to find file names

I have a list of HTML files in listbox1. I need the delphi code to search through these to find any file names mentioned in the code and add these to grid1 thus:

First column: the HTML file that the file name is mentioned in.
Second column: the found file name.
Third column: position where the file name was found.

I am using Delphi 6. I am giving the maximum 500 points for this. Many thanks for your help.

esoftbg

> mentioned in the code ?

rincewind666

ASKER

I mean the HTML code of the HTML files.

Wim ten Brink

What's the definition of a filename? Do you consider \\Server\Share\Folder\File.txt a filename? Is C:\Folder\File.txt a filename? Is http://Server/Folder/File.txt a filename? Is filename.txt a filename?

Depending of what you're looking for, search for the characters '//', '\\' or just ':\' and use some smart algorithm to determine if it is part of what you consider to be a filename.

Of course, if you're just looking for links, search for the <a> tag or the <img> tag in the file, since they will refer to other files. But first make clear what exactly it is you're looking for since the text 'filename' itself could already be a valid filename while 'c:\folder.txt' could actually be a foldername instead of a filename...

esoftbg

I have HTML files that does not contain self file name inside self content.
How it will be determined > First column: the HTML file that the file name is mentioned in. ?

Wim ten Brink

Even worse: c:\<b>folder</b>\Filename.txt
When the user would see above text, he would see a valid filename, with the 'folder' part in a bold font. But would you consider it a valid filename or not? :-)

rincewind666

ASKER

As an example, I need to go through the HTML code of each HTML file in the listbox and extract the links such as (for example):

<A href="http://www.demo.com/purchase.cgi">Order</A>

In the above example, I would want the grid to display:

index.htm http://www.demo.com/purchase.cgi 100

"index.htm" is the name of the file in the listbox. "http://www.demo.com/purchase.cgi" is the link in the searched HTML file. 100 is the character position where the link starts.

IMPORTANT: it will not always be absolute links as above. It could be just the filename (purchase.cgi) or any folder names before it (cgi-bin/shop/purchase.cgi) or (.../cgi-bin/purchase.cgi). It could also be images, pl files, relative links, etc. However, all filenames will end with an extension (.gif, .jpg, .cgi, .pl, htm or whatever).

Also it will not always be an actual link ("A href" in the html code). It could be an image (the HTML code is diffrerent) or a POST to a cgi file) or whatever.

Hope this clarifies things. Thanks.

ASKER CERTIFIED SOLUTION