Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 265
  • Last Modified:

Search through list of HTML files to find file names

I have a list of HTML files in listbox1. I need the delphi code to search through these to find any file names mentioned in the code and add these to grid1 thus:

First column: the HTML file that the file name is mentioned in.
Second column: the found file name.
Third column: position where the file name was found.

I am using Delphi 6. I am giving the maximum 500 points for this.  Many thanks for your help.
  • 3
  • 3
  • 2
  • +2
2 Solutions
> mentioned in the code ?
rincewind666Author Commented:
I mean the HTML code of the HTML files.
Wim ten BrinkSelf-employed developerCommented:
What's the definition of a filename? Do you consider \\Server\Share\Folder\File.txt a filename? Is C:\Folder\File.txt a filename? Is http://Server/Folder/File.txt a filename? Is filename.txt a filename?

Depending of what you're looking for, search for the characters '//', '\\' or just ':\' and use some smart algorithm to determine if it is part of what you consider to be a filename.

Of course, if you're just looking for links, search for the <a> tag or the <img> tag in the file, since they will refer to other files. But first make clear what exactly it is you're looking for since the text 'filename' itself could already be a valid filename while 'c:\folder.txt' could actually be a foldername instead of a filename...

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

I have HTML files that does not contain self file name inside self content.
How it will be determined > First column:  the HTML file that the file name is mentioned in. ?
Wim ten BrinkSelf-employed developerCommented:
Even worse: c:\<b>folder</b>\Filename.txt
When the user would see above text, he would see a valid filename, with the 'folder' part in a bold font. But would you consider it a valid filename or not? :-)
rincewind666Author Commented:
As an example, I need to go through the HTML code of each HTML file in the listbox and extract the links such as (for example):

<A href="http://www.demo.com/purchase.cgi">Order</A>

In the above example, I would want the grid to display:

index.htm        http://www.demo.com/purchase.cgi           100

"index.htm" is the name of the file in the listbox. "http://www.demo.com/purchase.cgi" is the link in the searched HTML file. 100 is the character position where the link starts.

IMPORTANT: it will not always be absolute links as above. It could be just the filename (purchase.cgi) or any folder names before it (cgi-bin/shop/purchase.cgi) or (.../cgi-bin/purchase.cgi).  It could also be images, pl files, relative links, etc.  However, all filenames will end with an extension (.gif, .jpg, .cgi, .pl, htm or whatever).

Also it will not always be an actual link ("A href" in the html code).  It could be an image (the HTML code is diffrerent) or a POST to a cgi file) or whatever.

Hope this clarifies things.  Thanks.
Workshop_Alex is of course correct in asking what a filename is.  Assuming you mean a location or a file name, try this technique:-  

It is probably best to look at the underlying code rather than looking at what is visible.  For example, if you look at the Source for this page, you will see Site News as a link, the file/location name is declared invisibly.  So you need to go through character by character treating the file as a text stream.

One way to approach this problem of yours is to do various tests on the character stream and to see if any one of them produces a file name or location name.  For example:

<a href="http://experts-exchange.4jobs.com/"

Set state to 0
To parse this particular construct you would search for '<' (state 1)
If this were followed by 'a'  (state 2)
and then by ' ' (state 3)
then by 'h' (state 4)
etc. when you've hit the inverted comma sign you know that everything up to the next inverted commas is a file/location name.

So that is one construct you can test for.  You can do exactly the same with other constructs.  You will of course need to deal with unexpected input - someone's html may not follow the formatting rules, you want to be able to recover from invalid input without crashing.

The technique we are talking about here is State Table analysis.  You build a table with these columns:

Next State.

For example

Character=' '
Next State=3

If each character is put through this list and nothing matches it then you say that State reverts back to its initial state of 0.

This technique will work for situations where there is a choice part way through the parsing, and with many entry points.  You can build up your tests gradually as you begin to find out things which you need which haven't been caught by your existing parses.

This is a very brief summary of a very frequently used programming technique.  Do some googling to find out more, but if you have any specific questions let us know...
Wim ten BrinkSelf-employed developerCommented:
Actually, since you want to extract filenames from certain tags, all you have to do is just find the specific tags that might be pointing to other files. Thus these tags, for example: <a> <img> <form> <object>
And there are a few more tags too that you need to check. Within the < and > of the tag you will find the filename that you are looking for, as some kind of attribute. href= for <a>, source= for some other tags.

So, where to start? Well, first you open a HTML file and append the whole contents to a string or stringlist. Then you go find the < character in the file. When you find it, continue to read the rest of the tag. (In other words, continue reading until you encounter a space or a > character.) Then check if this tag can hold a filename and if it does, try to find the attribute between the < and > that holds the filename. If it's not there then there's no filename. Otherwise, you're there. Then continue to search for the next <...

But this approach might have some flaws. Some webpages are interactive and use javascript or vbscript to alter the values of certain tags. They could easily change the filename of an <img> tag to something else, and you won't be able to discover these filenames because you read the source, you don't execute it. Scripting is an easy technique to prevent people from walking through the files of a webpage.

If a html page is well-formed then it would be possible to use the XMLDocument component to read the attributes of certain tags in an easier way. Unfortunately, most webpages are NOT well-formed. And even the XMLDocument won't be able to execute the scripts...
perhaps you'd get some help from a PAQ I contributed to: http://www.experts-exchange.com/Programming/Programming_Languages/Delphi/Q_20830833.html
rincewind666Author Commented:
Many thanks for your help.

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

  • 3
  • 3
  • 2
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now