Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

Analyzing HTML documents

Posted on 2002-05-03
7
306 Views
Last Modified: 2010-05-02
Hi,

I want to analyze some html files (stored locally) and to be able to retrieve all the links, images or whatever other tags.
Now I do this by loading the file into a webbrowser control, then I can read
WebBrowser1.document.Images
WebBrowser1.document.Links
WebBrowser1.document.All....
etc.
The problem is many files have images inserted with absolute http:// paths, try to connect to another websites(like visit counters etc), display alerts, prompts or confirmation boxes, script errors and so on.
I want to analyze the file without the user to see anything. But if he's offline, having anything that tries to access the web may have undesired results like launching phone dialers, error messages, etc. If he's online, the file will be loaded slowly because it accesses online stuffs.
Also there is no way to prevent alerts and prompts to appear. Setting offline and silent properties to True has no effect.

So my question is can I analyze a Html file without loading it into a webbrowser(or how to avoid the above problems if I use a webbrowser).
Of course I don't mean a substring search solution, like to look for "<A HREF", then to look for the closing ">" etc.

Thanks
0
Comment
Question by:hveld
  • 4
  • 2
7 Comments
 
LVL 28

Accepted Solution

by:
AzraSound earned 200 total points
ID: 6988492
Try this:

- Set a reference to Microsoft HTML Object Library
- Load your HTML file into a string buffer using native VB file commands
- Use code similar to the following to generate an HTMLDocument object you can analyze



Public Function GetHTMLDocument(ByVal HTMLCode As String) As HTMLDocument
    Dim htmlDoc   As New HTMLDocument

    htmlDoc.body.innerhtml = HTMLCode
    Set GetHTMLDocument = htmlDoc
End Function
0
 
LVL 16

Expert Comment

by:Richie_Simonetti
ID: 6988543
hearing....
0
 

Author Comment

by:hveld
ID: 6989259
that's OK, thanks!
Works fine!
But this way the URL of the file to analyze is about:blank because htmlDoc is created from a string.
So I can't get absolute local paths to images or linked files, if they are included in the html document with relative paths.
Of course I can try to obtain the abs. local paths to them, having the path to the file being analyzed - but this means to change a lot of already tested code.
I tried to set
htmlDoc.location = "local_path_to_the_file"
but this opens the file in IE.
If the first image in the html file is Image1.jpg,
htmlDoc.images(0).src
returns
about:blankImage1.jpg
but I need the full local path to the image
Can this be done?
If no, I'll accept your solution as an answer and will rewrite my code
0
Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

 
LVL 28

Expert Comment

by:AzraSound
ID: 6989920
Perhaps you can alter the function to strip the about:blank statements for you, since we know this is a limitation of loading an html document in this way.


Public Function GetHTMLDocument(ByVal HTMLCode As String) As HTMLDocument
    Dim htmlDoc     As New HTMLDocument
    Dim htmlEle     As HTMLHtmlElement

    htmlDoc.body.innerHTML = HTMLCode
   
   
    For Each htmlEle In htmlDoc.All.tags("A")
        htmlEle.href = Replace$(htmlEle.href, "about:blank", "")
    Next
   
    For Each htmlEle In htmlDoc.All.tags("IMG")
        htmlEle.src = Replace$(htmlEle.href, "about:blank", "")
    Next
   
    Set GetHTMLDocument = htmlDoc
End Function
0
 
LVL 28

Expert Comment

by:AzraSound
ID: 6989921
Sorry, second For...Next loop should say .src on both sides of the equation.
0
 

Author Comment

by:hveld
ID: 6990244
OK, thanx!
However I didn't mean what you last suggested.
When I load a file in a webbrowser control, for
webbrowser1.Document.Images(0).src
i get something like
file:///C:\Test\Img.jpg
Using your way, I get
about:blankImg.jpg
If I remove about:blank, I'll still get the relative path to the image, but I need  C:\Test\Img.jpg
Replacing about:blank with the html file path will not work in all cases(like for images above the html file folder)
What I needed is to create htmlDoc from a file not from a string, in order to get the same results I get from a file loaded in a webbrowser control
0
 
LVL 28

Expert Comment

by:AzraSound
ID: 6990668
Try it out, and let me know what happens.  Since image sources can be relative paths, I imagine that will still remain the same even if you load it from a file and remove the about:blank portion.  For example, an html file on my machine my say:

<img src="images/mypic.jpg">

So even if you load as a string, the about:blank will appear before that, but you should be able to reconstruct the full path to that image, if you need it, since you have the path to the actual html file itself, and now, the relative path to the image.  So in our example case, if the full path to the html file to open was:

C:\HTMLDocuments\MyWebpage\myPage.html

Then you know the image above must be located in:

C:\HTMLDocuments\MyWebpage\images\

Be sure to do the Replace function for all elements that can have a source or href type tag.  Some that come to mind are:

Frames
Javascript files
Stylesheet files
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Most everyone who has done any programming in VB6 knows that you can do something in code like Debug.Print MyVar and that when the program runs from the IDE, the value of MyVar will be displayed in the Immediate Window. Less well known is Debug.Asse…
When designing a form there are several BorderStyles to choose from, all of which can be classified as either 'Fixed' or 'Sizable' and I'd guess that 'Fixed Single' or one of the other fixed types is the most popular choice. I assume it's the most p…
Get people started with the process of using Access VBA to control Excel using automation, Microsoft Access can control other applications. An example is the ability to programmatically talk to Excel. Using automation, an Access application can laun…
This lesson covers basic error handling code in Microsoft Excel using VBA. This is the first lesson in a 3-part series that uses code to loop through an Excel spreadsheet in VBA and then fix errors, taking advantage of error handling code. This l…

856 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question