Link to home
Start Free TrialLog in
Avatar of fischermx
fischermxFlag for Mexico

asked on

Not using axWebBrowser

Hi,

I have a little application that performs naviation through a given site, using the axWebBrowser control. This works fine and you can see the software navigating through all the site, like an speedy slideshow.

                                axWebBrowser1.Navigate( (string)location );

But, now to increase performance I was requiered to re-do this without the visual stuff, i.e. removing the axWebBrowser component.
I recall I saw at some place that axWebBrowser was merely a wrapper and rendering for another class that was doing all the html parsing work. But I can't find now where I saw that.

So, basically, I need to get access to a class that returns me the list of links fo each given url, so I can perform the navigation.

Thanks.
ASKER CERTIFIED SOLUTION
Avatar of MaximKammerer
MaximKammerer
Flag of Austria image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of fischermx

ASKER

That HTMLDocumentClass is what I was looking for, thanks ! but I thought it has a different functionality, but it is pretty much the same as the axWebBrowser just without browsing.

Now, let me ask you, I already tried by using WebRequest, but that puts all the parsing work in my side, right ? I mean, I didn't find a way to get document elements from it ... or am I missing something ?

So, the question is, could be a way to combine both the webrequest and the document class ? Something like get the webresponse and assign it somewhere to the document class for it to get it parsed without revisiting the page ?

Yes, the webrequest only gets you the document. You can find an html parser written in C# at:

http://www.planetsourcecode.com/vb/scripts/ShowCode.asp?txtCodeId=2201&lngWId=10

Best regards,
Maxim
Maxim :

Thanks for your help.

The sample in the first link you send me dont work at all. I followed up that thread in google and it seems the guy didn't make it.
I tested it, too. The event handler used there is never being fired and again searching on google, I have only one reference to it and it is that thread, so I would guess, that's not that path to follow.


Now, this method works too without a parser :

                  System.Net.WebRequest webRequest = WebRequest.Create(url);
                  System.Net.WebResponse webResponse = webRequest.GetResponse();
                  System.IO.StreamReader streamReader = new System.IO.StreamReader(webResponse.GetResponseStream(),
                                                                                                                                   System.Text.Encoding.Default);

                  HTMLDocument hd = new HTMLDocument();
                  IHTMLDocument2 ihd2 = (IHTMLDocument2)hd;
                  ihd2.write(streamReader.ReadToEnd());
                  ihd2.close();

But it has two problems :
1.- It  does not solve correctly the relative paths, which may is solved using the parser you're showing now.
2.- The request response is a bit erratic on this method. Let me explain, if I point to http://www.google.com, I get the google page in my country. This is the default behavior in google site when you enter google first time, then you get a link "google in english", you click there, and then you are never sent back to the other page. I have no idea how to control this or what is the cause. May be some extra parameters ?

Ad 1.) Perhaps the IHTMLDocument2 interface resolves relative links correctly if you also set the .URL property to the address you got the document from (otherwise it has no information what the link is relative to).
Ad 2.) Probably google stores this kind of information in Cookies. You could try something like this:

System.Net.CookieContainer jar = new System.Net.CookieContainer();

System.Net.WebRequest webRequest = WebRequest.Create(url);
webRequest.CookieContainer = jar;     // is null by default == cookie handling disabled

System.Net.WebResponse webResponse = webRequest.GetResponse();

// continue processing...

Best regards,
Maxim
Thanks !
 
Now my code looks like this :

                        CookieContainer jar = new CookieContainer();
                  WebRequest webRequest = WebRequest.Create(url);
                  // webRequest.CookieContainer = jar;     // webRequest does not contain CookieContainer
                  WebResponse webResponse = webRequest.GetResponse();
                  StreamReader streamReader = new StreamReader(webResponse.GetResponseStream(), System.Text.Encoding.Default);

                  HTMLDocument hd = new HTMLDocument();
                  IHTMLDocument2 ihd2 = (IHTMLDocument2)hd;

                        // ((IHTMLDocument2)ihd2).write("<html></html>");
                  ihd2.write(streamReader.ReadToEnd());
                  ihd2.url = url;
                  ihd2.close();

But I have two problems :
1- The addition of "ihd2.url = url" is causing the IE opens and goes to the URL !! :) weird, isn't it ?
I tried to do the write(html) thing that you see commented because I saw it somewhere else but it is the same.
If I put first the ihd2.url = url and then read from the stream I get an object reference error.


2.- The cookie thing had a little problem. It seems that webRequest does not contains a CookieContainer member, I'm in the help file now, in that section it does not say where this class belongs to.



Ad 2) - Sorry - my fault. The CookieContainer is a member of HttpWebRequest. So the code should look something like this:

WebRequest webRequest = WebRequest.Create(url);
((HttpWebRequest) webRequest).CookieContainer = jar;     // webRequest does not contain CookieContainer
WebResponse webResponse = webRequest.GetResponse();
 
Ad 1) - Yep, these classes are only wrappers for IE COM controls. Perhaps you really could use the html parser from the link above.

Best regards,
Matthias