Search through list of HTML files to find file names

Posted on 2004-11-03
Last Modified: 2010-05-18
I have a list of HTML files in listbox1. I need the delphi code to search through these to find any file names mentioned in the code and add these to grid1 thus:

First column: the HTML file that the file name is mentioned in.
Second column: the found file name.
Third column: position where the file name was found.

I am using Delphi 6. I am giving the maximum 500 points for this.  Many thanks for your help.
Question by:rincewind666
    LVL 12

    Expert Comment

    > mentioned in the code ?

    Author Comment

    I mean the HTML code of the HTML files.
    LVL 17

    Expert Comment

    by:Wim ten Brink
    What's the definition of a filename? Do you consider \\Server\Share\Folder\File.txt a filename? Is C:\Folder\File.txt a filename? Is http://Server/Folder/File.txt a filename? Is filename.txt a filename?

    Depending of what you're looking for, search for the characters '//', '\\' or just ':\' and use some smart algorithm to determine if it is part of what you consider to be a filename.

    Of course, if you're just looking for links, search for the <a> tag or the <img> tag in the file, since they will refer to other files. But first make clear what exactly it is you're looking for since the text 'filename' itself could already be a valid filename while 'c:\folder.txt' could actually be a foldername instead of a filename...
    LVL 12

    Expert Comment

    I have HTML files that does not contain self file name inside self content.
    How it will be determined > First column:  the HTML file that the file name is mentioned in. ?
    LVL 17

    Expert Comment

    by:Wim ten Brink
    Even worse: c:\<b>folder</b>\Filename.txt
    When the user would see above text, he would see a valid filename, with the 'folder' part in a bold font. But would you consider it a valid filename or not? :-)

    Author Comment

    As an example, I need to go through the HTML code of each HTML file in the listbox and extract the links such as (for example):

    <A href="">Order</A>

    In the above example, I would want the grid to display:

    index.htm           100

    "index.htm" is the name of the file in the listbox. "" is the link in the searched HTML file. 100 is the character position where the link starts.

    IMPORTANT: it will not always be absolute links as above. It could be just the filename (purchase.cgi) or any folder names before it (cgi-bin/shop/purchase.cgi) or (.../cgi-bin/purchase.cgi).  It could also be images, pl files, relative links, etc.  However, all filenames will end with an extension (.gif, .jpg, .cgi, .pl, htm or whatever).

    Also it will not always be an actual link ("A href" in the html code).  It could be an image (the HTML code is diffrerent) or a POST to a cgi file) or whatever.

    Hope this clarifies things.  Thanks.
    LVL 31

    Accepted Solution

    Workshop_Alex is of course correct in asking what a filename is.  Assuming you mean a location or a file name, try this technique:-  

    It is probably best to look at the underlying code rather than looking at what is visible.  For example, if you look at the Source for this page, you will see Site News as a link, the file/location name is declared invisibly.  So you need to go through character by character treating the file as a text stream.

    One way to approach this problem of yours is to do various tests on the character stream and to see if any one of them produces a file name or location name.  For example:

    <a href=""

    Set state to 0
    To parse this particular construct you would search for '<' (state 1)
    If this were followed by 'a'  (state 2)
    and then by ' ' (state 3)
    then by 'h' (state 4)
    etc. when you've hit the inverted comma sign you know that everything up to the next inverted commas is a file/location name.

    So that is one construct you can test for.  You can do exactly the same with other constructs.  You will of course need to deal with unexpected input - someone's html may not follow the formatting rules, you want to be able to recover from invalid input without crashing.

    The technique we are talking about here is State Table analysis.  You build a table with these columns:

    Next State.

    For example

    Character=' '
    Next State=3

    If each character is put through this list and nothing matches it then you say that State reverts back to its initial state of 0.

    This technique will work for situations where there is a choice part way through the parsing, and with many entry points.  You can build up your tests gradually as you begin to find out things which you need which haven't been caught by your existing parses.

    This is a very brief summary of a very frequently used programming technique.  Do some googling to find out more, but if you have any specific questions let us know...
    LVL 17

    Assisted Solution

    by:Wim ten Brink
    Actually, since you want to extract filenames from certain tags, all you have to do is just find the specific tags that might be pointing to other files. Thus these tags, for example: <a> <img> <form> <object>
    And there are a few more tags too that you need to check. Within the < and > of the tag you will find the filename that you are looking for, as some kind of attribute. href= for <a>, source= for some other tags.

    So, where to start? Well, first you open a HTML file and append the whole contents to a string or stringlist. Then you go find the < character in the file. When you find it, continue to read the rest of the tag. (In other words, continue reading until you encounter a space or a > character.) Then check if this tag can hold a filename and if it does, try to find the attribute between the < and > that holds the filename. If it's not there then there's no filename. Otherwise, you're there. Then continue to search for the next <...

    But this approach might have some flaws. Some webpages are interactive and use javascript or vbscript to alter the values of certain tags. They could easily change the filename of an <img> tag to something else, and you won't be able to discover these filenames because you read the source, you don't execute it. Scripting is an easy technique to prevent people from walking through the files of a webpage.

    If a html page is well-formed then it would be possible to use the XMLDocument component to read the attributes of certain tags in an easier way. Unfortunately, most webpages are NOT well-formed. And even the XMLDocument won't be able to execute the scripts...
    LVL 7

    Expert Comment

    perhaps you'd get some help from a PAQ I contributed to:

    Author Comment

    Many thanks for your help.

    Featured Post

    Find Ransomware Secrets With All-Source Analysis

    Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

    Join & Write a Comment

    Introduction The parallel port is a very commonly known port, it was widely used to connect a printer to the PC, if you look at the back of your computer, for those who don't have newer computers, there will be a port with 25 pins and a small print…
    Introduction Raise your hands if you were as upset with FireMonkey as I was when I discovered that there was no TListview.  I use TListView in almost all of my applications I've written, and I was not going to compromise by resorting to TStringGrid…
    To add imagery to an HTML email signature, you have two options available to you. You can either add a logo/image by embedding it directly into the signature or hosting it externally and linking to it. The vast majority of email clients display l…
    In this tutorial you'll learn about bandwidth monitoring with flows and packet sniffing with our network monitoring solution PRTG Network Monitor ( If you're interested in additional methods for monitoring bandwidt…

    754 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    22 Experts available now in Live!

    Get 1:1 Help Now