Link to home
Start Free TrialLog in
Avatar of aseem_dayal
aseem_dayal

asked on

HTML source access

Hi,

I am trying to programatically access the HTML source code for a given URL (equivalent of performing a 'view source' operation on any browser). I have tried using the 'Microsoft Internet Explorer Library' :

browser.Navigate2("[some url]")
source=browser.document.documentElement.outerHTML

however without success, as the information obtained does not come out correctly on a consistent basis (i.e. it does not work incase of certain url's). All help on how to consistently obtain the source for a given url would be appreciated.

Aseem
Avatar of priya_pbk
priya_pbk

Give a reference to Microsoft Internet controls by going to Tools->Components-> and click Microsoft Internet controls

I tried this way(2 command buttons on the form)
Private Sub Command1_Click()
WebBrowser1.Navigate "https://www.experts-exchange.com"
End Sub

Private Sub Command2_Click()
MsgBox WebBrowser1.Document.documentElement.outerHtml
End Sub

-priya
Avatar of aseem_dayal

ASKER

Hi Priya,

Thanks for the input, but as I have already mentioned in my question, the 'browser.Document.documentElement.outerHTML' does not consistently give the correct URL source.
but weren't you referencing it to 'Microsoft Internet Explorer Library' ..is that the same as the component reference to "Microsoft Internet controls"(which i had mentioned), whereby you have to put the web browser manually on your form.

and
>>does not consistently give the correct URL source.
why? what happens, i have used it lot many times. what does it show you?

-priya
Avatar of Richie_Simonetti
that's couldn't be:
MsgBox WebBrowser1.Document.documentElement.innerHtml
instead?
i used that without problem.
You have to use it in documentcomplete event of webbrowser.
ASKER CERTIFIED SOLUTION
Avatar of Anthony Perkins
Anthony Perkins
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I'm with Richie, ensure the page is FULLY loaded before attempting to get its source.  If it hasnt loaded yet, there is a chance its source hasnt been completely downloaded yet.
well, if you will not use webbrowser control, you could use inet control instead

function GrabHtml(url as string) as string
dim s as string
s= inet1.openurl(url,icString)
GrabHtml=s
end sub
Here's how you can make sure the source is fully loaded.

WebBrowser1.Navigate "http://www.webpage.com"
Do While WebBrowser1.ReadyState < 4 '= READYSTATE_COMPLETE
   DoEvents
Loop
strText = WebBrowser1.Document.body.innertext
strHTML = WebBrowser1.Document.body.innerhtml

RichW
I have had the same problem on trying to get at a logged in internet bankking page that displays my account info.

I think that maybe this is similiar.
Sorry, I was trying this way

strHTML = WebBrowser1.Document.body.outerhtml
To state my comment more clear:
' wb1 is a WebBrowser control

Private Sub Form_Load()
WB1.Navigate "wwww.somedomain.com/some/index.html"

End Sub

Private Sub WB1_DocumentComplete(ByVal pDisp As Object, URL As Variant)
If (pDisp Is WB1.Object) Then
     debug.print wb1.document.documentelement.innerhtml
   
End If
End Sub
I just got it like this

after the page has opened I needed to get at the frames that the document was filled with


Set parentObj = WebBrowser1.Document.parentWindow

For a = 0 To jlobj.frames.length - 1
 Debug.Print jlobj.frames(a).Document.body.outerhtml
Next a

Watch the object names - should have been

Set parentObj = WebBrowser1.Document.parentWindow

For a = 0 To parentObj.frames.length - 1
    Debug.Print parentObj.frames(a).Document.body.outerhtml
Next a
but we  weren't talking about frames, or i missed something?
Frames are about the only reason that I can think of that would result in inconsistent operation.
Sorry, not to me.
If page has frames, docummentelement.innerhtml would shows HTML contents of main document (those "frameset" bunch of things) only.
Priya :

1. Yes the 'Microsoft Internet Explorer Library' works  same as the Web-Browser control.

2. The inconsistency that I encountered was when trying to obtain source HTML from pages generated from an exchange OWA server, in certain instances, incase you have access to OWA : the page generated in response to a mail reply does not produce the correct HTML.

Richie Simonetti/AzraSound/RichW :

I have ensured that I access the HTML source only after the 'navigation completed' event occurs.

acperkins :

Will try your suggestion and get back.


Aseem

acperkins solution works like a charm !

Not only does provide the information faster than any other methods, it works consistently across all URLS.

To everyone involved in this discussion, I would recommend that they use 'MSXML.XMLHTTP' as a defacto standard for obtaining source URL's.

Thanks for the contributions.

Aseem