Link to home
Start Free TrialLog in
Avatar of YohanShminge
YohanShminge

asked on

Inet.OpenURL only returns part of the page!!!

This is probably a simple question, but its late and I'm too tired to deal with it.  (Plus I've got points to burn)

I'm trying to retrieve a page using the Inet control, however, it does not return the entire page, only a fraction of it.  For example:

page = Inet1.OpenURL("http://www.msn.com/")

This returns only:

<html><head><base href="http://g.msn.com/0US!s5.31472_315529/" /> ....other stuff.... <a href="73.a5539/2??cm=LeftNav8">Tec

And that's precisely where it ends.  I've tried using Winsock to do this, and I've had the most success with it, but for some reason its tacking random strings of three/four characters onto the beginning of its data chunck.  But anyways, I'm rambling...

Thanks for your time, guys!

ASKER CERTIFIED SOLUTION
Avatar of zzzzzooc
zzzzzooc

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of learning_t0_pr0gram
learning_t0_pr0gram

sorry, change:

Private Sub Winsock1_DataArrival()

to

Private Sub Winsock1_DataArrival(ByVal bytesTotal As Long)
hmm.. actually, msn seems to not like connection from winsock ...if you're not trying to do msn, it should work
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
msg = "GET http://www.msn.com HTTP/1.0" + VbCrLf <--- your problem, just make it "GET / HTTP1/.0" & vbcrlf & vbcrlf
If memory serves me right, that part's correct. CrLf seperates each field in the header and an additional CrLf will denote the end of the header. So with that said, the below should be the problem:

>>msg = msg + "Host: www.msn.com" + vbcrlf
>>msg = msg + vbcrlf + vbcrlf

You'll end up having 3 CrLfs (instead of 2) which the server probably won't accept. Also, after reviewing the differences in protocol 1.0 and 1.1, if the Inet control is implementing 1.0 (older version), it may have issues with keeping a persistent connection during requests.

RFC for HTTP/1.1 if you decide to go the Winsock way:
ftp://ftp.isi.edu/in-notes/rfc2616.txt
zzzzzzoc, you need 3 at the end.. i've made many, many programs with winsock..
and Brian, doing GET http://....the site, s the same as doing GET / HTTP/1.0
oh.. brian.. i was thinking of using proxies to connect.. my mistake, i am sorry  :(
Avatar of YohanShminge

ASKER

Thank you all for your time!  I have extensive experience with winsock, but in this particular instance something strange is happening.

What I am actually trying to do is connect up with experts-exchange and retrieve questions, much like QuickPost.  What is strange is that when the Winsock1_DataArrival event fires, sometimes the snippets of HTML bring in some sort of little header, like "EC4" or "FD8" or "SC5" - I really dont see any pattern other than there's always three characters, but I don't think there's always this header.

So, instead of using Winsock, this time I decided to go with the Inet control because I thought it would be simpler.  However, that does not appear to be the case.  Unless someone can explain why I would have this 3 character header, I think I'll go with URLDownloadToFile and see how that performs.

FYI, when using Winsock, you usually have to provide the full request header in order for the server to return a reply, and you definately need the two vbCrlfs after the header.  If you'd like to reproduce my scenario with EE, here is my code (requires Webbrowser control + winsock control, default names):

Dim page As String
Dim ret As String

Private Sub Form_Load()
Winsock1.Close
Winsock1.Connect "https://www.experts-exchange.com", 80
ret = Chr(13) + Chr(10)
End Sub

Private Sub Form_Resize()
WebBrowser1.Width = Me.Width - 125
WebBrowser1.Height = Me.Height - 525
End Sub

Private Sub Winsock1_Close()
On Error Resume Next
Winsock1.Close
Open "c:\tempurl.html" For Binary As 1
    Put 1, 1, page
Close #1
WebBrowser1.Navigate ("c:\tempurl.html")
End Sub

Private Sub Winsock1_Connect()
info = "GET /Security/Win_Security/Q_20942129.html HTTP/1.1" + ret + _
    "Accept: */*" + ret + _
    "Accept-Language: en-us" + ret + _
    "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" + ret + _
    "Host: https://www.experts-exchange.com" + ret + _
    "Connection: Close" + ret + _
    "Cache-Control: no-cache" + ret + ret
Winsock1.SendData info
End Sub

Private Sub Winsock1_DataArrival(ByVal bytesTotal As Long)
Dim info As String
Winsock1.GetData info
Debug.Print info
page = page + info
End Sub
zzzzzooc, Your solution worked fine!  Thanks to everyone who participated!  I still am wondering why EE sends my those characters, though...
Yohan, what characters? i looked at your code and i don't get any such characters...
Just random things that you might not notice.  Always at the start of the data received.  Ex. "EC4" , "FD8" , "SC5" I just re-ran the code above and the random characters pop up above the page editor box, above the list of TAs, and above the first post by CrazyOne.
I don't notice anything..



Option Explicit

Private sPage As String
Private Sub Form_Load()
    Winsock1.Close
    Winsock1.Connect "https://www.experts-exchange.com", 80
End Sub
Private Sub Winsock1_Close()
    Dim iPos As Integer
    iPos = InStr(1, sPage, vbCrLf & vbCrLf)
    If iPos > 0 Then
        Open "c:\temp.html" For Output As 1
            Print #1, Mid(sPage, iPos + Len(vbCrLf & vbCrLf))
        Close 1
        WebBrowser1.Navigate "file://c:\temp.html"
    End If
    Winsock1.Close
End Sub
Private Sub Winsock1_Connect()
    Dim sSend As String
    sSend = "GET /Security/Win_Security/Q_20942129.html HTTP/1.1" & vbCrLf
    sSend = sSend & "Host: https://www.experts-exchange.com" & vbCrLf
    sSend = sSend & "Connection: Close" & vbCrLf
    Winsock1.SendData sSend & vbCrLf
End Sub
Private Sub Winsock1_DataArrival(ByVal bytesTotal As Long)
    Dim sBuff As String
    Winsock1.GetData sBuff
    sPage = sPage & sBuff
End Sub
Using that exact code you just posted, this is the file that is generated for me:

http://zealgames.tripod.com/temp.html

I noticed at least two problems, at the very top there's "1C2F" and then, right before the search box, there's "9CC" ...
I don't get those results. URLDownloadToFile doesn't return the same characters?
nope, URLDownloadToFile works perfectly.  Do you think something is wrong with my winsock control?  Its never done this before.
If there was something interfering with winsock, it'd affect both URLDownloadToFile and the Winsock control.

Did you use my method of using Output instead of Binary? I recall some characters being converted incorrectly from Putting.
OK, this is very interesting.  I thought I might try the code with my firewall turned off (NPF 2004), since it can sometimes mess up pages with its ad removal and popup blocking features, and lo and behold, no more characters!  Don't ask me why winsock would be any different from URLDownloadToFile, but I guess it is!

I have one last question: do you think it would be faster to use Winsock or URLDownloadToFile?
Wow, I just tried the Inet control again without the firewall running and it retrieved the entire page... Strange.  I'll have to take a closer look at NPF's settings.  So now I have three options:

Inet, Winsock, or the API?

What do you think?
I'd go with URLDownloadToFile as it automatically retrieves the file and saves it to disk without the hassle of doing it yourself. The Inet control hangs a lot from my experience (while attempting to Cancel or because of Timeout durations or other reasons) and the Winsock control is a lot of overhead since you'll need to have multiple procedures to connect, get data, save to disk and check for errors.
I agree!  Thanks for everything!
This is the difference between HTTP protocol 1.0 and 1.1
1.1 sends the checksum characters and 1.0 doesn't.

do "GET / HTTP/1.0" not the "GET / HTTP/1.1".

Janis