Link to home
Start Free TrialLog in
Avatar of GaryRasmussen
GaryRasmussenFlag for United States of America

asked on

Simulating javascript and cookies enabled when using the WebRequest object

I am using WebRequest in a Windows form to retrun (scrape) the html of a web page.

The page I am using WebRequest to connect to has a check (onload I presume) to see if javascript is enabled and if it cannot verify that you have javascript enabled, then it will redirect you to a different page thus returning the HTML of the "Your javascript is not enabled" page.  If I visit the page using my javascript enabled browser, then it looks fine.

How do I simulate that javascript is enabled using the WebRequest object?
Avatar of QPR
QPR
Flag of New Zealand image

Don't quote me but I can't think of a way round this short of forging headers and even then the javascript check maybe some sort of challenge.
Perform this javascript function and respond to me.... no response, you can't have javascript enabled. Redirect.
Avatar of GaryRasmussen

ASKER

They use a meta tag within a <noscript> tag like ...
<meta http-equiv="Refresh" content="0; URL=https://login.live.com/jsDisabled.srf?lc=1033"/></noscript>

So any scripting the page tries do do will automatically redirect you if any script cannot run.  There are so many web page scraping programs available, surely somebody figured out a way around this type of obstacle?
Oh, and not to be rude but please don't respond to say you don't know because that just makes other people looking at the posts think somebody has already answered the question so they don't bother opening the thread.  I know that sounds really rude but I have seen this happen on this website alot.

Please understand that I hardly ever get a solution to my problems on this site, probably because I don't now the correct zones to post it to but I am desperate on this one and am running out of time.

Thanks,
Well then you don't understand Experts Exchange very well.
Most of us that come here do so to help. We have unlimited points so they are of little importance.
If a question is marked as "awaiting answer" then most experts will read regardless of whether anybody else has posted a potential solution.

You may or may not know that putting a redirect within a <noscript> tag is a common way to deter screen scraping and auto site downloading.
If you or another do find a way to bypass this I'd be happy to read about it. Not that I'm saying there isn't a way.
Thanks QPR,

Ok, well it just seemed like when I or others would post a question and 1 person would post an incorrect answer, the question would just go unanswered and I figured that the experts must have assumed the question was answered or being worked on ny somebody else because they never had any input after the initial reply.

Yes, I know that the noscript tag is designed to do exactly what they are using it for.  Whether they are using it because they truly do require javascript or to deter scraping is anyones guess.  Why all the hubba about scraping?  I mean I have a login and username so I have been given access to the site so that I can access my information.  I know I wouldn't care if someone scraped their information from one of my websites as opposed to viewing it in a browser.

Does anybody know how to simultate having javascript for web pages that require javascriipt whenn using a webrequest?  There are 3rd party screen scrapers that work.  Anybody know how they do it?

Thanks!
Avatar of tasky
tasky

Gary,

I really don't see the issue. WebRequest doesn't actually PARSE any HTML, so you won't be redirected. However, you may have to manually recreate what actions a browser takes, like loading an external Javascript and parsing out values to redirect yourself to another location. But, when the WebRequest makes the request, it does not tell the server whether Javascript is installed. You will have to manually follow the logic for that site (I recommend either IEWatch or Wireshark to log packets).

If you have any questions, I'll be happy to respond. I have coded a LOT of scrapers in my time.
If you use WebRequest to create a connection to a web page that has some javascript embedded in it like maybe, location.href="page.htm"

The WebRequest doesn't do anything with javascript, (and it shouldn't) so the page will run whatever it has between it's <noscript> tags which is to redirect the WebRequest to a different page which is not the page I want to scrape.

<noscript>
<meta http-equiv="Refresh" content="0; URL=https://login.live.com/jsDisabled.srf?lc=1033"/>
</noscript>

I don't know how to explain the issue and clearer.  Can you ask me whatever it is you need to help me explain the issue better?   I really want to figure this one out and could really use your help, especially if you coded alot of scrapers, this is probably a no brainer for you?

Thanks!

But thats the thing, the WebRequest doesn't execute ANYTHING. It just grabs the source. What you're going to have to do is parse out the location.href="" and manually grab that page next. But, if you are using a WebRequest object, it's important to know that it does not TOUCH the HTML at all. It merely grabs it.
In order to stream the page content of the page I want back to my application, I have to connect or "grab" the page I want first.  When I "grab" the page, the page notices that javascript cannot be run so it redirects the request to some other page.

So I parse out the javascript from the page content of the page I am getting redirected to and I see what the url is that they tried to send me to.  What does that do for me?  If I go there,  I am still going to be redirected back to the "Your browser does not support javascript" page.

No. The page does not "notice" that Javascript can't be run.  WebRequest does not execute anything. Ever. All WebRequest does is grab the source code to the page. The server does not know if the client has Javascript or not. That is up to the browser to decide. However, the HTML is the same nonetheless. If you are getting redirected, it's because of another issue (no cookies, etc) from the server using the "Location: x" header.
Now, if you are trying to pass the HTML from the WebRequest into a browser on a form, you're going to have a lot of issues. First of all, Cookies will not be shared. So any site that uses state-based logic (i.e. to log in) will be broken with this approach.
If you are NOT using a browser on the form, and are trying to make subsequent requests to the same site, preserving cookies is rather easy. You need to create an instance of a CookieContainer object and assign it to the WebRequest's CookieContainer. From there, it will preserve and update cookies automatically. Take the following code example:

Dim jar As New CookieContainer
Dim req As HttpWebRequest = HttpWebRequest.Create("https://login.live.com/")
req.CookieContainer = jar
Dim resp As HttpWebResponse = req.GetResponse
 
req = HttpWebRequest.Create("https://login.live.com/doSubmit")
req.CookieContainer = jar
req.Method = "POST"
' etc etc
resp = req.GetResponse

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of GaryRasmussen
GaryRasmussen
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial