?
Solved

Unsupported Browser when trying to screen scrape C#

Posted on 2009-04-27
19
Medium Priority
?
1,710 Views
Last Modified: 2013-12-17
Hello experts!

I'm trying to scrape this page:
string orginalPull = ScreenScrape("http://www.postescanada.ca//cpotools/apps/track/personal/findByTrackNumber?trackingNumber=7146410000045107&trackingType=on&LOCALE=en");


Here is my method:

private static string ScreenScrape(string url) {
       WebRequest req = WebRequest.Create(url);
       StreamReader stream = new StreamReader(req.GetResponse().GetResponseStream());

       System.Text.StringBuilder sb = new System.Text.StringBuilder();
       string strLine;
       while ((strLine = stream.ReadLine()) != null) {
          if (strLine.Length > 0)
             sb.Append(strLine);
       }
       stream.Close();
       return sb.ToString();
}

When I try to scrape the page I'm getting an error from the web page I'm trying to scrape:

Unsupported Browser
It appears that you are viewing this page with an unsupported web browser. This website works best with one of the following supported browsers:

Any idea how I can get around this?

Ghost
0
Comment
Question by:copyPasteGhost
  • 10
  • 7
  • 2
19 Comments
 
LVL 37

Expert Comment

by:gregoryyoung
ID: 24243957
webRequestObject.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
0
 
LVL 9

Expert Comment

by:tillgeffken
ID: 24244007
Use HttpWebRequest and set the UserAgent property to something your target site likes.
0
 
LVL 13

Author Comment

by:copyPasteGhost
ID: 24244016
when doing this:

WebRequest req = WebRequest.Create(url);
       req.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
       StreamReader stream = new StreamReader(req.GetResponse().GetResponseStream());

I get this error:

System.ArgumentException was unhandled by user code
  Message="This header must be modified using the appropriate property.\r\nParameter name: name"
  Source="System"
  ParamName="name"
  StackTrace:
       at System.Net.WebHeaderCollection.ThrowOnRestrictedHeader(String headerName)
       at System.Net.WebHeaderCollection.Add(String name, String value)
       at WiHCP.ScreenScrape(String url) in d:\Inetpub\wwwroot\WiBot\WiHCP.aspx.cs:line 65
       at WiHCP.btnScrap_Click(Object sender, EventArgs e) in d:\Inetpub\wwwroot\WiBot\WiHCP.aspx.cs:line 30
       at System.Web.UI.WebControls.Button.OnClick(EventArgs e)
       at System.Web.UI.WebControls.Button.RaisePostBackEvent(String eventArgument)
       at System.Web.UI.WebControls.Button.System.Web.UI.IPostBackEventHandler.RaisePostBackEvent(String eventArgument)
       at System.Web.UI.Page.RaisePostBackEvent(IPostBackEventHandler sourceControl, String eventArgument)
       at System.Web.UI.Page.RaisePostBackEvent(NameValueCollection postData)
       at System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint)
  InnerException:
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 13

Author Comment

by:copyPasteGhost
ID: 24244059
tried this:

 HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url);
       req.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)";
       //req.Headers.Add ("user-agent", "");
       StreamReader stream = new StreamReader(req.GetResponse().GetResponseStream());

       System.Text.StringBuilder sb = new System.Text.StringBuilder();
       string strLine;
       while ((strLine = stream.ReadLine()) != null) {
          if (strLine.Length > 0)
             sb.Append(strLine);
       }
       stream.Close();
       return sb.ToString();


I didn't get an error...but I didn't get the page either....
0
 
LVL 37

Expert Comment

by:gregoryyoung
ID: 24244101
is it a page that redirects? is it a page we can use from here to test?
0
 
LVL 13

Author Comment

by:copyPasteGhost
ID: 24244116
0
 
LVL 37

Expert Comment

by:gregoryyoung
ID: 24244223
if you notice its redirecting (with a check cookie page ... thats your problem) ...

Have you looked at whats being returned?

Cheers,

Greg
0
 
LVL 13

Author Comment

by:copyPasteGhost
ID: 24244247
this is what I get:

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" ><html xmlns="http://www.w3.org/1999/xhtml">            <META HTTP-EQUIV="Refresh" CONTENT="0;URL=/cpotools/mc/content/error/cookieDisabled.jsf" /></html>

So I guess I need to feed it a cookie :)

can I do that with the request object? also how do I know what to put in the cookie that I give to the page?
0
 
LVL 37

Expert Comment

by:gregoryyoung
ID: 24244320
yes you can ... there is a cookie collection on the request object ...

you can also tell the request object to follow redirects like this which can be useful.

Cheers,

Greg
0
 
LVL 13

Author Comment

by:copyPasteGhost
ID: 24244395
ok cool but the question is still how can I find out what you actually put in the cookie?
0
 
LVL 37

Expert Comment

by:gregoryyoung
ID: 24244576
the page they redirect to probably sets up a cookie ... which will be in response.cookies ... then you copy it to your next request's request.cookies.
0
 
LVL 13

Author Comment

by:copyPasteGhost
ID: 24244647
ok cool I'm here:

private string ScreenScrape(string url) {
       HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url);
       req.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)";
       //req.Headers.Add ("user-agent", "");
       req.AllowAutoRedirect = true;
       foreach (Cookie oCookie in Response.Cookies) {
          req.CookieContainer.Add(oCookie);
       }
       StreamReader stream = new StreamReader(req.GetResponse().GetResponseStream());

       System.Text.StringBuilder sb = new System.Text.StringBuilder();
       string strLine;
       while ((strLine = stream.ReadLine()) != null) {
          if (strLine.Length > 0)
             sb.Append(strLine);
       }
       stream.Close();
       return sb.ToString();
    }

It's still not working...what did I do wrong?
0
 
LVL 37

Assisted Solution

by:gregoryyoung
gregoryyoung earned 600 total points
ID: 24244911

    class Program
    {
        private static readonly CookieContainer Cookies = new CookieContainer();
        private static string ScreenScrape(string url)
        {
            var req = (HttpWebRequest)HttpWebRequest.Create(url);
            req.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)";
            req.AllowAutoRedirect = true;
            req.CookieContainer = Cookies;
            StreamReader stream = new StreamReader(req.GetResponse().GetResponseStream());

            var sb = new StringBuilder();
            string strLine;
            while ((strLine = stream.ReadLine()) != null)
            {
                if (strLine.Length > 0)
                    sb.Append(strLine);
            }
            stream.Close();
            return sb.ToString();
        }


        static void Main(string[] args)
        {
            string s =
                ScreenScrape(
                    "http://www.postescanada.ca//cpotools/apps/track/personal/findByTrackNumber?trackingNumber=7146410000045107&trackingType=on&LOCALE=en");
            string s2 =
                ScreenScrape(
                    "http://www.postescanada.ca//cpotools/apps/track/personal/findByTrackNumber?trackingNumber=7146410000045107&trackingType=on&LOCALE=en");
        }
    }


The first call through sets up the cookie properly ... it gets returned the meta-refresh ... the second call works properly (because the cookie is in the Cookies container when it does the call).

Cheers,

Greg
0
 
LVL 9

Accepted Solution

by:
tillgeffken earned 1400 total points
ID: 24245120
I found this challenging and hacked something together that actually parses the meta refresh tag.

using System;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
 
namespace ConsoleApplication2
{
	class Program
	{
		static void Main(string[] args)
		{
			string urlBase = "http://www.postescanada.ca";
			string urlPath = "/cpotools/apps/track/personal/findByTrackNumber?trackingNumber=7146410000045107&trackingType=on&LOCALE=en";
			string result = SreenScrape(urlBase, urlPath);
			Console.WriteLine(result);
			Console.ReadLine();
		}
 
		public static string SreenScrape(string urlBase, string urlPath)
		{
			CookieContainer cookieContainer = new CookieContainer();
			HttpWebRequest httpWebRequest = (HttpWebRequest)WebRequest.Create(urlBase + urlPath);
			httpWebRequest.CookieContainer = cookieContainer;
			httpWebRequest.UserAgent = "Mozilla/6.0 (Windows; U; Windows NT 7.0; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.9 (.NET CLR 3.5.30729)";
			WebResponse webResponse = httpWebRequest.GetResponse();
			string result = new System.IO.StreamReader(webResponse.GetResponseStream(), Encoding.Default).ReadToEnd();
			webResponse.Close();
 
			if (result.Contains("META HTTP-EQUIV=\"Refresh\""))
			{
				Regex metaregex = new Regex(@".?<meta http-equiv=""refresh"" content=""0;url=(?<url>[^""'<> ]+)""", RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);
				foreach (Match match in metaregex.Matches(result))
				{
					HttpWebRequest redirectHttpWebRequest = (HttpWebRequest)WebRequest.Create(urlBase + match.Groups["url"]);
					redirectHttpWebRequest.CookieContainer = cookieContainer;
					webResponse = redirectHttpWebRequest.GetResponse();
					string redirectResponse = new System.IO.StreamReader(webResponse.GetResponseStream(), Encoding.Default).ReadToEnd();
					webResponse.Close();
					return redirectResponse;
				}
 
			}
 
			return result;
 
		}
	}
}

Open in new window

0
 
LVL 13

Author Comment

by:copyPasteGhost
ID: 24245141
cool I'll test and our and get back to you guys tomorrow.

Thanks,
Ghost
0
 
LVL 37

Expert Comment

by:gregoryyoung
ID: 24245152
both should be working ... I would go with the second code though as it bothers to handle any refresh ... I was just showing the concept for you.
0
 
LVL 13

Author Comment

by:copyPasteGhost
ID: 24253727
oops my
0
 
LVL 13

Author Closing Comment

by:copyPasteGhost
ID: 31575055
Thanks for your help for this matter.
0
 
LVL 13

Author Comment

by:copyPasteGhost
ID: 24467879
Hey guys....it turns out this is still not working...

I posted another question...here:

http://www.experts-exchange.com/Programming/Languages/.NET/Visual_CSharp/Q_24436235.html

Any help would be awesome!

Thanks!
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

IntroductionWhile developing web applications, a single page might contain many regions and each region might contain many number of controls with the capability to perform  postback. Many times you might need to perform some action on an ASP.NET po…
Introduction This article shows how to use the open source plupload control to upload multiple images. The images are resized on the client side before uploading and the upload is done in chunks. Background I had to provide a way for user…
With just a little bit of  SQL and VBA, many doors open to cool things like synchronize a list box to display data relevant to other information on a form.  If you have never written code or looked at an SQL statement before, no problem! ...  give i…
The Relationships Diagram is a good way to get an overall view of what a database is keeping track of. It is also where relationships are defined. A relationship specifies how two tables connect to each other. As you build tables in Microsoft Ac…
Suggested Courses
Course of the Month9 days, 12 hours left to enroll

612 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question