screen scrape C# giving different page.

I'm trying to screen scrape a page...

http://www.postescanada.ca/cpotools/apps/track/personal/findByTrackNumber?trackingNumber=7146410000045107&trackingType=on&LOCALE=en

this is the code I'm using:

private string ScreenScrape(string urlBase, string urlPath) {
       string result = "";
       try {
          CookieContainer cookieContainer = new CookieContainer();
          HttpWebRequest httpWebRequest = (HttpWebRequest)WebRequest.Create(urlBase + urlPath);
          httpWebRequest.CookieContainer = cookieContainer;
          httpWebRequest.UserAgent = "Mozilla/6.0 (Windows; U; Windows NT 7.0; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.9 (.NET CLR 3.5.30729)";
          WebResponse webResponse = httpWebRequest.GetResponse();
          result = new System.IO.StreamReader(webResponse.GetResponseStream(), Encoding.Default).ReadToEnd();
          webResponse.Close();

          if (result.Contains("META HTTP-EQUIV=\"Refresh\"")) {
             Regex metaregex = new Regex(@".?<meta http-equiv=""refresh"" content=""0;url=(?<url>[^""'<> ]+)""", RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);
             foreach (Match match in metaregex.Matches(result)) {
                HttpWebRequest redirectHttpWebRequest = (HttpWebRequest)WebRequest.Create(urlBase + match.Groups["url"]);
                redirectHttpWebRequest.CookieContainer = cookieContainer;
                webResponse = redirectHttpWebRequest.GetResponse();
                string redirectResponse = new System.IO.StreamReader(webResponse.GetResponseStream(), Encoding.Default).ReadToEnd();
                webResponse.Close();
                return redirectResponse;
             }
          }
       }
       catch { }

       return result;
}

Unfortunatly the page that is being returned is not the same page is the returned when viewing through a web browser. I'l trying to get the tracking information.

Any ideas?
LVL 13
copyPasteGhostAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

GiftsonDJohnCommented:
Just checked the page url. First of all it is hitting another page to check the cookie and then redirecting to the target page. The url is different than requested.

Instead of using HttpWebRequest and HttpWebResponse try using WebClient.
0
copyPasteGhostAuthor Commented:
you wouldn't happen to have a sample of what would need to be changed would you?
0
Bob LearnedCommented:
1) The WebClient is a simple wrapper for an HttpWebRequest (over-simplified in some cases).

2) It does not provide all that the HttpWebRequest does.

3) It sounds like you need to use an HTTP debugger, like Fiddler, to determine how the web browser communicates with the web page.

0
Microsoft Azure 2017

Azure has a changed a lot since it was originally introduce by adding new services and features. Do you know everything you need to about Azure? This course will teach you about the Azure App Service, monitoring and application insights, DevOps, and Team Services.

copyPasteGhostAuthor Commented:
:) I appreciate not giving me the answer directly and making me go look for it :) That is the best way to learn I agree.

that being said :) I've never used fiddler. I don't really know what to look for...you wouldn't happen to know of any walk through or tutorial on using fiddler or know what to look for?

it is quite urgent that I get this working as soon as possible...(Which I'm sure you hear all the time :)

Thanks alot guys.
0
Bob LearnedCommented:
I attached a Fiddler tutorial.  There is a lot that you can do with Fiddler, and I have not discovered every subtle nuance.  Using an HTTP debuggers is good way to get an understanding of what kind of magic that Internet Explorer performs for you under the covers.


Fiddler-Tutorial.pdf
0
copyPasteGhostAuthor Commented:
Thanks bob.

I'll read it and get back to you with any questions I might have.
0
copyPasteGhostAuthor Commented:
I gota be honest...I'm more confused after reading the documentation then before I started :)

There seem to be two pages being requested before all the gifs and js files are downloaded... I'll bet the answer has to be found in those two pages...

The problem I don't know what I'm looking for? Any hints? I'll attach the two screen shots.
fiddler1.JPG
fiddler2.JPG
0
Bob LearnedCommented:
In all that information, are there two URLs accessed for one web request?  If so, do they show in the attached screen shots?
0
copyPasteGhostAuthor Commented:
0
Bob LearnedCommented:
I don't know what you are trying to extract, but I added a reference to Microsoft.mshtml.dll to a test project, and created this:

        public static mshtml.IHTMLElementCollection FindMetaEntries(string url)
        {

            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
            request.Timeout = 30000;

            HttpWebResponse response = (HttpWebResponse)request.GetResponse();

            Stream stream = response.GetResponseStream();

            using (StreamReader reader = new StreamReader(stream))
            {
                string pageText = reader.ReadToEnd();

                mshtml.HTMLDocument document = new mshtml.HTMLDocument();
                ((mshtml.IHTMLDocument2)document).write(pageText);

                mshtml.IHTMLElementCollection metaTagList = document.getElementsByTagName("meta");

                return metaTagList;
            }

        }

It appears to look right to me, except that I don't see a meta-refresh tag.
0
copyPasteGhostAuthor Commented:
I'm trying to get a section of the page content.

if you do a view source of the previous link I posted you will see this section:

<!-- RESULT DETAILS -->
            <div class="sectionBorder">
                  <div class="floatLeft">
                        <p><strong>Tracking Number</strong></p>
                        <p>7146410000045107</p>                                    
                        <p>Please note that this is the most up-to-date information available in our system. Our telephone agents have access to the same information presented here.      </p>
                        <h3>Track Status      </h3>
                        <!-- COD REMITTANCE AND ORIGINAL ITEM -->                        
                        <p>                        
                                    <strong>Product Type: </strong>Expedited Parcels<br />                              
                                    <strong>Service Standard Delivery Date       :      </strong>2009/04/01<br />
                                    <strong>Reference Number 1: </strong>25891<br />
                        </p>
                  </div><table id="tapListResultForm:j_id157" cellpadding="6" cellspacing="0" width="100%" class="floatLeft"><thead><tr><th class="noBorder">
                                    <b>Date      </b></th><th class="noBorder">
                                    <b>Time      </b></th><th class="noBorder">
                                    <b>Location</b></th><th class="noBorder">
                                    <b>Description</b></th><th class="noBorder">
                                    <b>Retail Location</b></th><th class="noBorder">
                                    <b>Signatory Name</b></th></tr></thead><tbody id="tapListResultForm:j_id157:tbody_element"><tr class="odd"><td>2009/04/01</td><td>06:15</td><td>ALMA</td><td>Item out for delivery</td><td>
                              <a target="_blank" href="/cpotools/apps/fpo/personal/viewPostOfficeDetails?postOfficeOutletNumber="></a></td><td><div id="tapListResultForm:j_id157:0:j_id177">
                                                <a onclick="trackDownload('signature')" href="/cpotools/servlet/ImageServlet?sigName=&amp;trackNum=7146410000045107&amp;sigDate=2009/04/01" target="_blank"></a></div></td></tr></tbody></table>
            </div>
            <div class="sectionBorder">
                  <h3>Track History      </h3><table id="tapListResultForm:j_id186" cellpadding="6" cellspacing="0" width="100%" class="floatLeft"><thead><tr><th class="noBorder">
                                                <b>Date      </b></th><th class="noBorder">
                                                <b>Time      </b></th><th class="noBorder">
                                                <b>Location</b></th><th class="noBorder">
                                                <b>Description</b></th><th class="noBorder">
                                                <b>Retail Location</b></th><th class="noBorder">
                                                <b>Signatory Name</b></th></tr></thead><tbody id="tapListResultForm:j_id186:tbody_element"><tr class="odd"><td>2009/04/01</td><td>06:15</td><td>ALMA</td><td>Item out for delivery</td><td>
                                          <a target="_blank" href="/cpotools/apps/fpo/personal/viewPostOfficeDetails?postOfficeOutletNumber="></a></td><td><div id="tapListResultForm:j_id186:0:j_id206">
                                                <a onclick="trackDownload('signature')" href="/cpotools/servlet/ImageServlet?sigName=&amp;trackNum=7146410000045107&amp;sigDate=2009/04/01" target="_blank"></a></div></td></tr><tr class="even"><td>2009/03/30</td><td>22:26</td><td>MONTREAL</td><td>Item processed at postal facility</td><td>
                                          <a target="_blank" href="/cpotools/apps/fpo/personal/viewPostOfficeDetails?postOfficeOutletNumber="></a></td><td><div id="tapListResultForm:j_id186:1:j_id206">
                                                <a onclick="trackDownload('signature')" href="/cpotools/servlet/ImageServlet?sigName=&amp;trackNum=7146410000045107&amp;sigDate=2009/03/30" target="_blank"></a></div></td></tr><tr class="odd"><td></td><td>15:01</td><td>STE THERESE DE BLAINVILLE</td><td>An order has been electronically submitted</td><td>
                                          <a target="_blank" href="/cpotools/apps/fpo/personal/viewPostOfficeDetails?postOfficeOutletNumber="></a></td><td><div id="tapListResultForm:j_id186:2:j_id206">
                                                <a onclick="trackDownload('signature')" href="/cpotools/servlet/ImageServlet?sigName=&amp;trackNum=7146410000045107&amp;sigDate=2009/03/30" target="_blank"></a></div></td></tr></tbody></table>
                        <p><div id="tapListResultForm:j_id216">
                                          <strong>Shipping Options and Features for this Item      </strong>
                                          <br /></div><div id="tapListResultForm:j_id218:0:j_id219">Do Not Safe Drop<br /></div>
                        </p>
            </div>
                                    
                                    <!-- OMNITURE FOR TRACKING NUMBER -->


it'ts really the data between the "<!-- RESULT DETAILS --> " and the "<!-- OMNITURE FOR TRACKING NUMBER -->" that I care about rthe rest of the page is useless for me.

Thanks Bob.
0
copyPasteGhostAuthor Commented:
any ideas?
0
Bob LearnedCommented:
Oooh, plenty!!  Here is something that you can copy, paste, and extend:

            CanadaPostParser.TrackInfo trackingInfo = CanadaPostParser.GetTrackingInfo("7146410000045107");

I had plenty of time on my hands today, because the chiefs put the brakes on my current project.
using System;
using System.Collections.Generic;
using System.Net;
using System.IO;
using System.Text;
using System.Runtime.InteropServices;
using System.Text.RegularExpressions;
 
public class CanadaPostParser
{
 
    private const string POST_URL = "http://www.postescanada.ca/cpotools/apps/track/personal/findByTrackNumber?trackingNumber={0}&trackingType=on&LOCALE=en";
 
    public static TrackInfo GetTrackingInfo(string trackingNumber)
    {
        return ParseWebText(GetWebPageText(trackingNumber));
    }
 
    private static string GetWebPageText(string trackingNumber)
    {
        string url = string.Format(POST_URL, trackingNumber);
        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
        request.Timeout = 30000;
        //request.UserAgent = "Mozilla/6.0 (Windows; U; Windows NT 7.0; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.9 (.NET CLR 3.5.30729)";
        request.UserAgent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; InfoPath.2; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)";
        request.CookieContainer = HtmlCookies.GetCookieContainer(url);
 
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
 
        Stream stream = response.GetResponseStream();
 
        using (StreamReader reader = new StreamReader(stream))
        {
            return reader.ReadToEnd();
        }
    }
 
    private static TrackInfo ParseWebText(string pageText)
    {
        mshtml.HTMLDocument document = new mshtml.HTMLDocument();
        ((mshtml.IHTMLDocument2)document).write(pageText);
 
        mshtml.HTMLDivElement resultsDiv = document.getElementById("results") as mshtml.HTMLDivElement;
 
        if (resultsDiv != null)
        {
            return FindTrackInfo(resultsDiv);
        }
 
        return null;
    }
 
    private static TrackInfo FindTrackInfo(mshtml.HTMLDivElement resultsDiv)
    {
        List<string> headerList = new List<string>();
 
        mshtml.IHTMLElementCollection phraseList = (mshtml.IHTMLElementCollection)resultsDiv.getElementsByTagName("STRONG");
        foreach (mshtml.HTMLPhraseElement phrase in phraseList)
        {
            headerList.Add(phrase.innerText);
        }
 
        List<string> list = new List<string>();
 
        mshtml.IHTMLElementCollection paraList = (mshtml.IHTMLElementCollection)resultsDiv.getElementsByTagName("p");
        foreach (mshtml.HTMLParaElement para in paraList)
        {
            if (!string.IsNullOrEmpty(para.innerText) && !headerList.Contains(para.innerText))
            {
                list.Add(para.innerText);
            }
        }
 
        TrackInfo info = new TrackInfo();
 
        info.TrackingNumber = list[0];
 
        StringBuilder sb = new StringBuilder(list[2]);
 
        foreach (string header in headerList)
        {
            sb.Replace(header, "");
        }
 
        string[] dimensionList = sb.ToString().Split("\r\n".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
 
        info.ProductType = dimensionList[0];
        info.DeliveryDate = dimensionList[1];
        info.ReferenceNumber = dimensionList[2];
 
        return info;
    }
 
    // Source:
    // http://www.rendelmann.info/blog/CommentView.aspx?guid=bd99bcd5-7088-4d46-801e-c0fe622dc2e5
 
    public class HtmlCookies
    {
 
        [DllImport("wininet.dll", CharSet = CharSet.Auto, SetLastError = true)]
        private static extern bool InternetGetCookie(string lpszUrlName,
            string lpszCookieName, StringBuilder lpszCookieData,
            [MarshalAs(UnmanagedType.U4)] ref int lpdwSize);
 
        private static string RetrieveIECookiesForUrl(string url)
        {
            StringBuilder cookieHeader = new StringBuilder(new string(' ', 256), 256);
 
            int datasize = cookieHeader.Length;
 
            if (!(InternetGetCookie(url, null, cookieHeader, ref datasize)))
            {
                if (datasize < 0)
                    return string.Empty;
 
                cookieHeader = new StringBuilder(datasize);
 
                InternetGetCookie(url, null, cookieHeader, ref datasize);
 
            }
 
            return cookieHeader.ToString();
        }
 
        public static CookieContainer GetCookieContainer(string url)
        {
 
            CookieContainer container = new CookieContainer();
            Uri uri = new Uri(url);
 
            // CookieContainer.SetCookies expects a string with cookie key+value pairs separated
            // by commas, and not semi-colons.
            string cookieHeaders = RetrieveIECookiesForUrl(uri.AbsoluteUri).Replace(";", ",");
            if (cookieHeaders.Length > 0)
                container.SetCookies(uri, cookieHeaders);
 
            return container;
 
        }
 
    }
 
    public class TrackInfo
    {
        public string TrackingNumber;
        public string ProductType;
        public string DeliveryDate;
        public string ReferenceNumber;
        public List<StatusInfo> HistoryList = new List<StatusInfo>();
    }
 
    public class StatusInfo
    {
        public string DateTime;
        public string Location;
        public string Description;
        public string RetailLocation;
        public string SignatoryName;
    }
 
}

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
copyPasteGhostAuthor Commented:
You rock!
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
.NET Programming

From novice to tech pro — start learning today.