Solved

HTML source access

Posted on 2002-07-23
18
201 Views
Last Modified: 2010-05-02
Hi,

I am trying to programatically access the HTML source code for a given URL (equivalent of performing a 'view source' operation on any browser). I have tried using the 'Microsoft Internet Explorer Library' :

browser.Navigate2("[some url]")
source=browser.document.documentElement.outerHTML

however without success, as the information obtained does not come out correctly on a consistent basis (i.e. it does not work incase of certain url's). All help on how to consistently obtain the source for a given url would be appreciated.

Aseem
0
Comment
Question by:aseem_dayal
  • 5
  • 5
  • 3
  • +4
18 Comments
 
LVL 2

Expert Comment

by:priya_pbk
ID: 7171459
Give a reference to Microsoft Internet controls by going to Tools->Components-> and click Microsoft Internet controls

I tried this way(2 command buttons on the form)
Private Sub Command1_Click()
WebBrowser1.Navigate "http://www.experts-exchange.com"
End Sub

Private Sub Command2_Click()
MsgBox WebBrowser1.Document.documentElement.outerHtml
End Sub

-priya
0
 

Author Comment

by:aseem_dayal
ID: 7171498
Hi Priya,

Thanks for the input, but as I have already mentioned in my question, the 'browser.Document.documentElement.outerHTML' does not consistently give the correct URL source.
0
 
LVL 2

Expert Comment

by:priya_pbk
ID: 7171509
but weren't you referencing it to 'Microsoft Internet Explorer Library' ..is that the same as the component reference to "Microsoft Internet controls"(which i had mentioned), whereby you have to put the web browser manually on your form.

and
>>does not consistently give the correct URL source.
why? what happens, i have used it lot many times. what does it show you?

-priya
0
Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

 
LVL 16

Expert Comment

by:Richie_Simonetti
ID: 7171927
that's couldn't be:
MsgBox WebBrowser1.Document.documentElement.innerHtml
instead?
i used that without problem.
You have to use it in documentcomplete event of webbrowser.
0
 
LVL 75

Accepted Solution

by:
Anthony Perkins earned 100 total points
ID: 7171969
Try this:
Make a reference to MSXML (v2.6, v3 or v4)

Private Function GetHTML() As String
Dim httpObj As MSXML2.XMLHTTP

Set httpObj = New MSXML2.XMLHTTP
With httpObj
  .open "GET", "http://www.msn.com", False
  .send
  GetHTML = .responseText
End With
Set httpObj = Nothing

End Function

Note:  If you are using a prior version to v2.6, than change the code as follows:
Dim httpObj As MSXML.XMLHTTPRequest

Set httpObj = New MSXML.XMLHTTPRequest

Anthony
0
 
LVL 28

Expert Comment

by:AzraSound
ID: 7172090
I'm with Richie, ensure the page is FULLY loaded before attempting to get its source.  If it hasnt loaded yet, there is a chance its source hasnt been completely downloaded yet.
0
 
LVL 16

Expert Comment

by:Richie_Simonetti
ID: 7172094
well, if you will not use webbrowser control, you could use inet control instead

function GrabHtml(url as string) as string
dim s as string
s= inet1.openurl(url,icString)
GrabHtml=s
end sub
0
 
LVL 4

Expert Comment

by:RichW
ID: 7172449
Here's how you can make sure the source is fully loaded.

WebBrowser1.Navigate "http://www.webpage.com"
Do While WebBrowser1.ReadyState < 4 '= READYSTATE_COMPLETE
   DoEvents
Loop
strText = WebBrowser1.Document.body.innertext
strHTML = WebBrowser1.Document.body.innerhtml

RichW
0
 
LVL 3

Expert Comment

by:Hornet241
ID: 7172472
I have had the same problem on trying to get at a logged in internet bankking page that displays my account info.

I think that maybe this is similiar.
0
 
LVL 3

Expert Comment

by:Hornet241
ID: 7172474
Sorry, I was trying this way

strHTML = WebBrowser1.Document.body.outerhtml
0
 
LVL 16

Expert Comment

by:Richie_Simonetti
ID: 7172527
To state my comment more clear:
' wb1 is a WebBrowser control

Private Sub Form_Load()
WB1.Navigate "wwww.somedomain.com/some/index.html"

End Sub

Private Sub WB1_DocumentComplete(ByVal pDisp As Object, URL As Variant)
If (pDisp Is WB1.Object) Then
     debug.print wb1.document.documentelement.innerhtml
   
End If
End Sub
0
 
LVL 3

Expert Comment

by:Hornet241
ID: 7172599
I just got it like this

after the page has opened I needed to get at the frames that the document was filled with


Set parentObj = WebBrowser1.Document.parentWindow

For a = 0 To jlobj.frames.length - 1
 Debug.Print jlobj.frames(a).Document.body.outerhtml
Next a

0
 
LVL 3

Expert Comment

by:Hornet241
ID: 7172616
Watch the object names - should have been

Set parentObj = WebBrowser1.Document.parentWindow

For a = 0 To parentObj.frames.length - 1
    Debug.Print parentObj.frames(a).Document.body.outerhtml
Next a
0
 
LVL 16

Expert Comment

by:Richie_Simonetti
ID: 7172664
but we  weren't talking about frames, or i missed something?
0
 
LVL 3

Expert Comment

by:Hornet241
ID: 7172892
Frames are about the only reason that I can think of that would result in inconsistent operation.
0
 
LVL 16

Expert Comment

by:Richie_Simonetti
ID: 7172960
Sorry, not to me.
If page has frames, docummentelement.innerhtml would shows HTML contents of main document (those "frameset" bunch of things) only.
0
 

Author Comment

by:aseem_dayal
ID: 7173476
Priya :

1. Yes the 'Microsoft Internet Explorer Library' works  same as the Web-Browser control.

2. The inconsistency that I encountered was when trying to obtain source HTML from pages generated from an exchange OWA server, in certain instances, incase you have access to OWA : the page generated in response to a mail reply does not produce the correct HTML.

Richie Simonetti/AzraSound/RichW :

I have ensured that I access the HTML source only after the 'navigation completed' event occurs.

acperkins :

Will try your suggestion and get back.


Aseem

0
 

Author Comment

by:aseem_dayal
ID: 7173634
acperkins solution works like a charm !

Not only does provide the information faster than any other methods, it works consistently across all URLS.

To everyone involved in this discussion, I would recommend that they use 'MSXML.XMLHTTP' as a defacto standard for obtaining source URL's.

Thanks for the contributions.

Aseem

0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Have you ever wanted to restrict the users input in a textbox to numbers, and while doing that make sure that they can't 'cheat' by pasting in non-numeric text? Of course you can do that with code you write yourself but it's tedious and error-prone …
If you need to start windows update installation remotely or as a scheduled task you will find this very helpful.
Get people started with the utilization of class modules. Class modules can be a powerful tool in Microsoft Access. They allow you to create self-contained objects that encapsulate functionality. They can easily hide the complexity of a process from…
Show developers how to use a criteria form to limit the data that appears on an Access report. It is a common requirement that users can specify the criteria for a report at runtime. The easiest way to accomplish this is using a criteria form that a…

808 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question