Search through list of HTML files to find file names

I have a list of HTML files in listbox1. I need the delphi code to search through these to find any file names mentioned in the code and add these to grid1 thus:

First column: the HTML file that the file name is mentioned in.
Second column: the found file name.
Third column: position where the file name was found.

I am using Delphi 6. I am giving the maximum 500 points for this.  Many thanks for your help.
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

> mentioned in the code ?
rincewind666Author Commented:
I mean the HTML code of the HTML files.
Wim ten BrinkSelf-employed developerCommented:
What's the definition of a filename? Do you consider \\Server\Share\Folder\File.txt a filename? Is C:\Folder\File.txt a filename? Is http://Server/Folder/File.txt a filename? Is filename.txt a filename?

Depending of what you're looking for, search for the characters '//', '\\' or just ':\' and use some smart algorithm to determine if it is part of what you consider to be a filename.

Of course, if you're just looking for links, search for the <a> tag or the <img> tag in the file, since they will refer to other files. But first make clear what exactly it is you're looking for since the text 'filename' itself could already be a valid filename while 'c:\folder.txt' could actually be a foldername instead of a filename...
Cloud Class® Course: Microsoft Exchange Server

The MCTS: Microsoft Exchange Server 2010 certification validates your skills in supporting the maintenance and administration of the Exchange servers in an enterprise environment. Learn everything you need to know with this course.

I have HTML files that does not contain self file name inside self content.
How it will be determined > First column:  the HTML file that the file name is mentioned in. ?
Wim ten BrinkSelf-employed developerCommented:
Even worse: c:\<b>folder</b>\Filename.txt
When the user would see above text, he would see a valid filename, with the 'folder' part in a bold font. But would you consider it a valid filename or not? :-)
rincewind666Author Commented:
As an example, I need to go through the HTML code of each HTML file in the listbox and extract the links such as (for example):

<A href="">Order</A>

In the above example, I would want the grid to display:

index.htm           100

"index.htm" is the name of the file in the listbox. "" is the link in the searched HTML file. 100 is the character position where the link starts.

IMPORTANT: it will not always be absolute links as above. It could be just the filename (purchase.cgi) or any folder names before it (cgi-bin/shop/purchase.cgi) or (.../cgi-bin/purchase.cgi).  It could also be images, pl files, relative links, etc.  However, all filenames will end with an extension (.gif, .jpg, .cgi, .pl, htm or whatever).

Also it will not always be an actual link ("A href" in the html code).  It could be an image (the HTML code is diffrerent) or a POST to a cgi file) or whatever.

Hope this clarifies things.  Thanks.
Workshop_Alex is of course correct in asking what a filename is.  Assuming you mean a location or a file name, try this technique:-  

It is probably best to look at the underlying code rather than looking at what is visible.  For example, if you look at the Source for this page, you will see Site News as a link, the file/location name is declared invisibly.  So you need to go through character by character treating the file as a text stream.

One way to approach this problem of yours is to do various tests on the character stream and to see if any one of them produces a file name or location name.  For example:

<a href=""

Set state to 0
To parse this particular construct you would search for '<' (state 1)
If this were followed by 'a'  (state 2)
and then by ' ' (state 3)
then by 'h' (state 4)
etc. when you've hit the inverted comma sign you know that everything up to the next inverted commas is a file/location name.

So that is one construct you can test for.  You can do exactly the same with other constructs.  You will of course need to deal with unexpected input - someone's html may not follow the formatting rules, you want to be able to recover from invalid input without crashing.

The technique we are talking about here is State Table analysis.  You build a table with these columns:

Next State.

For example

Character=' '
Next State=3

If each character is put through this list and nothing matches it then you say that State reverts back to its initial state of 0.

This technique will work for situations where there is a choice part way through the parsing, and with many entry points.  You can build up your tests gradually as you begin to find out things which you need which haven't been caught by your existing parses.

This is a very brief summary of a very frequently used programming technique.  Do some googling to find out more, but if you have any specific questions let us know...

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Wim ten BrinkSelf-employed developerCommented:
Actually, since you want to extract filenames from certain tags, all you have to do is just find the specific tags that might be pointing to other files. Thus these tags, for example: <a> <img> <form> <object>
And there are a few more tags too that you need to check. Within the < and > of the tag you will find the filename that you are looking for, as some kind of attribute. href= for <a>, source= for some other tags.

So, where to start? Well, first you open a HTML file and append the whole contents to a string or stringlist. Then you go find the < character in the file. When you find it, continue to read the rest of the tag. (In other words, continue reading until you encounter a space or a > character.) Then check if this tag can hold a filename and if it does, try to find the attribute between the < and > that holds the filename. If it's not there then there's no filename. Otherwise, you're there. Then continue to search for the next <...

But this approach might have some flaws. Some webpages are interactive and use javascript or vbscript to alter the values of certain tags. They could easily change the filename of an <img> tag to something else, and you won't be able to discover these filenames because you read the source, you don't execute it. Scripting is an easy technique to prevent people from walking through the files of a webpage.

If a html page is well-formed then it would be possible to use the XMLDocument component to read the attributes of certain tags in an easier way. Unfortunately, most webpages are NOT well-formed. And even the XMLDocument won't be able to execute the scripts...
perhaps you'd get some help from a PAQ I contributed to:
rincewind666Author Commented:
Many thanks for your help.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.