Solved

Simulating javascript and cookies enabled when using the WebRequest object

Posted on 2008-10-01
13
3,322 Views
Last Modified: 2013-12-17
I am using WebRequest in a Windows form to retrun (scrape) the html of a web page.

The page I am using WebRequest to connect to has a check (onload I presume) to see if javascript is enabled and if it cannot verify that you have javascript enabled, then it will redirect you to a different page thus returning the HTML of the "Your javascript is not enabled" page.  If I visit the page using my javascript enabled browser, then it looks fine.

How do I simulate that javascript is enabled using the WebRequest object?
0
Comment
Question by:GaryRasmussen
  • 6
  • 5
  • 2
13 Comments
 
LVL 29

Expert Comment

by:QPR
ID: 22621688
Don't quote me but I can't think of a way round this short of forging headers and even then the javascript check maybe some sort of challenge.
Perform this javascript function and respond to me.... no response, you can't have javascript enabled. Redirect.
0
 
LVL 1

Author Comment

by:GaryRasmussen
ID: 22621929
They use a meta tag within a <noscript> tag like ...
<meta http-equiv="Refresh" content="0; URL=https://login.live.com/jsDisabled.srf?lc=1033"/></noscript>

So any scripting the page tries do do will automatically redirect you if any script cannot run.  There are so many web page scraping programs available, surely somebody figured out a way around this type of obstacle?
0
 
LVL 1

Author Comment

by:GaryRasmussen
ID: 22621948
Oh, and not to be rude but please don't respond to say you don't know because that just makes other people looking at the posts think somebody has already answered the question so they don't bother opening the thread.  I know that sounds really rude but I have seen this happen on this website alot.

Please understand that I hardly ever get a solution to my problems on this site, probably because I don't now the correct zones to post it to but I am desperate on this one and am running out of time.

Thanks,
0
 
LVL 29

Expert Comment

by:QPR
ID: 22622019
Well then you don't understand Experts Exchange very well.
Most of us that come here do so to help. We have unlimited points so they are of little importance.
If a question is marked as "awaiting answer" then most experts will read regardless of whether anybody else has posted a potential solution.

You may or may not know that putting a redirect within a <noscript> tag is a common way to deter screen scraping and auto site downloading.
If you or another do find a way to bypass this I'd be happy to read about it. Not that I'm saying there isn't a way.
0
 
LVL 1

Author Comment

by:GaryRasmussen
ID: 22625686
Thanks QPR,

Ok, well it just seemed like when I or others would post a question and 1 person would post an incorrect answer, the question would just go unanswered and I figured that the experts must have assumed the question was answered or being worked on ny somebody else because they never had any input after the initial reply.

Yes, I know that the noscript tag is designed to do exactly what they are using it for.  Whether they are using it because they truly do require javascript or to deter scraping is anyones guess.  Why all the hubba about scraping?  I mean I have a login and username so I have been given access to the site so that I can access my information.  I know I wouldn't care if someone scraped their information from one of my websites as opposed to viewing it in a browser.

Does anybody know how to simultate having javascript for web pages that require javascriipt whenn using a webrequest?  There are 3rd party screen scrapers that work.  Anybody know how they do it?

Thanks!
0
 
LVL 3

Expert Comment

by:tasky
ID: 22626594
Gary,

I really don't see the issue. WebRequest doesn't actually PARSE any HTML, so you won't be redirected. However, you may have to manually recreate what actions a browser takes, like loading an external Javascript and parsing out values to redirect yourself to another location. But, when the WebRequest makes the request, it does not tell the server whether Javascript is installed. You will have to manually follow the logic for that site (I recommend either IEWatch or Wireshark to log packets).

If you have any questions, I'll be happy to respond. I have coded a LOT of scrapers in my time.
0
Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

 
LVL 1

Author Comment

by:GaryRasmussen
ID: 22626884
If you use WebRequest to create a connection to a web page that has some javascript embedded in it like maybe, location.href="page.htm"

The WebRequest doesn't do anything with javascript, (and it shouldn't) so the page will run whatever it has between it's <noscript> tags which is to redirect the WebRequest to a different page which is not the page I want to scrape.

<noscript>
<meta http-equiv="Refresh" content="0; URL=https://login.live.com/jsDisabled.srf?lc=1033"/>
</noscript>

I don't know how to explain the issue and clearer.  Can you ask me whatever it is you need to help me explain the issue better?   I really want to figure this one out and could really use your help, especially if you coded alot of scrapers, this is probably a no brainer for you?

Thanks!

0
 
LVL 3

Expert Comment

by:tasky
ID: 22626939
But thats the thing, the WebRequest doesn't execute ANYTHING. It just grabs the source. What you're going to have to do is parse out the location.href="" and manually grab that page next. But, if you are using a WebRequest object, it's important to know that it does not TOUCH the HTML at all. It merely grabs it.
0
 
LVL 1

Author Comment

by:GaryRasmussen
ID: 22627059
In order to stream the page content of the page I want back to my application, I have to connect or "grab" the page I want first.  When I "grab" the page, the page notices that javascript cannot be run so it redirects the request to some other page.

So I parse out the javascript from the page content of the page I am getting redirected to and I see what the url is that they tried to send me to.  What does that do for me?  If I go there,  I am still going to be redirected back to the "Your browser does not support javascript" page.

0
 
LVL 3

Expert Comment

by:tasky
ID: 22627100
No. The page does not "notice" that Javascript can't be run.  WebRequest does not execute anything. Ever. All WebRequest does is grab the source code to the page. The server does not know if the client has Javascript or not. That is up to the browser to decide. However, the HTML is the same nonetheless. If you are getting redirected, it's because of another issue (no cookies, etc) from the server using the "Location: x" header.
0
 
LVL 3

Expert Comment

by:tasky
ID: 22627126
Now, if you are trying to pass the HTML from the WebRequest into a browser on a form, you're going to have a lot of issues. First of all, Cookies will not be shared. So any site that uses state-based logic (i.e. to log in) will be broken with this approach.
0
 
LVL 3

Expert Comment

by:tasky
ID: 22627181
If you are NOT using a browser on the form, and are trying to make subsequent requests to the same site, preserving cookies is rather easy. You need to create an instance of a CookieContainer object and assign it to the WebRequest's CookieContainer. From there, it will preserve and update cookies automatically. Take the following code example:

Dim jar As New CookieContainer

Dim req As HttpWebRequest = HttpWebRequest.Create("https://login.live.com/")

req.CookieContainer = jar

Dim resp As HttpWebResponse = req.GetResponse
 

req = HttpWebRequest.Create("https://login.live.com/doSubmit")

req.CookieContainer = jar

req.Method = "POST"

' etc etc

resp = req.GetResponse

Open in new window

0
 
LVL 1

Accepted Solution

by:
GaryRasmussen earned 0 total points
ID: 22627263
hmm, well, maybe I am going about this the wrong way?

I am not trying to pass the html into a browser.  I just wanted to save the page content to a string so I could parse the string to find a particular value with the following:

WebRequest myReq = WebRequest.Create(url);
WebResponse webResponse = myReq.GetResponse();
StreamReader webResponseStream = new StreamReader(webResponse.GetResponseStream());
string strContent = webResponseStream.ReadToEnd();
return strContent;

But when I do, the page detects that javascript is not enabled with this ...

<noscript>
<meta http-equiv="Refresh" content="0; URL=https://login.live.com/jsDisabled.srf?lc=1033"/>
</noscript>

And my string contains the html from https://login.live.com/jsDisabled.srf?lc=1033

This code works exactly as expected on web pages that don't redirect with a noscript condition?
0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

Suggested Solutions

Popularity Can Be Measured Sometimes we deal with questions of popularity, and we need a way to collect opinions from our clients.  This article shows a simple teaching example of how we might elect a favorite color by letting our clients vote for …
International Data Corporation (IDC) prognosticates that before the current the year gets over disbursing on IT framework products to be sent in cloud environs will be $37.1B.
This video teaches users how to migrate an existing Wordpress website to a new domain.
The viewer will learn how to dynamically set the form action using jQuery.

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now