Solved

extracting hyperlinks using htmlagilitypack

Posted on 2012-12-28
6
661 Views
Last Modified: 2012-12-29
how to get all the title hyperlinks on this web page (http://www.scie-socialcareonline.org.uk/topic.asp?guid=3aca5bbd-bc85-11d4-ba18-009027f63525) present inside the paging table e.g. The first hyperlink for the title is "http://www.scie-socialcareonline.org.uk/profile.asp?guid=81199db9-4835-4df3-be46-603e44fc20b9"

using htmlagilitypack
0
Comment
Question by:mmalik15
  • 3
  • 3
6 Comments
 
LVL 75

Accepted Solution

by:
käµfm³d   👽 earned 500 total points
ID: 38727851
I only mildly tested this, but it should work. At the least, it should get you started.

using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using HtmlAgilityPack;

namespace ConsoleApplication41
{
    class Program
    {
        /// <summary>
        /// Need this to share the extracted page number from the Javascript to the PreRequest handler below
        /// </summary>
        private static string pageNumber;

        static void Main(string[] args)
        {
            string baseUrl = @"http://www.scie-socialcareonline.org.uk/topic.asp?guid=3aca5bbd-bc85-11d4-ba18-009027f63525";
            HtmlWeb web = new HtmlWeb();
            HtmlDocument doc = web.Load(baseUrl);   // Load the base page
            HtmlNodeCollection pages = doc.DocumentNode.SelectNodes("//a[starts-with(@href, 'javascript:gotoPage')]");  // Find the Javascript function calls to change page
            List<int> visitedPages = new List<int>(pages.Count / 2);    // Since there are two places where a user can select pages, we make sure we don't get the same page twice by populating this
            Regex regPage = new Regex(@"\d+", RegexOptions.Compiled);   // For locating the page number in the javascript function call

            web.PreRequest = new HtmlWeb.PreRequestHandler(PreRequest); // Attach the handler for subsequent POST requests

            foreach (HtmlNode page in pages)
            {
                HtmlAttribute href = page.Attributes["href"];
                string javascript = href.Value;
                Match matchPage = regPage.Match(javascript);

                if (matchPage.Success)  // Did we find a page number in the javascript function call?
                {
                    HtmlNodeCollection titleAnchors;

                    pageNumber = matchPage.Value;
                    doc = web.Load(baseUrl, "POST");
                    titleAnchors = doc.DocumentNode.SelectNodes("//table//p[@class='list' and strong='Title: ']/a");

                    foreach (HtmlNode anchor in titleAnchors)
                    {
                        href = anchor.Attributes["href"];
                        Console.WriteLine(href.Value);
                    }
                }
            }
        }

        /// <summary>
        /// Need this in order to set the body of POST request
        /// </summary>
        static bool PreRequest(HttpWebRequest request)
        {
            Stream strReqeust;
            byte[] data = Encoding.ASCII.GetBytes("start=" + pageNumber);

            request.ContentLength = data.Length;
            request.ContentType = "application/x-www-form-urlencoded";
            strReqeust = request.GetRequestStream();
            strReqeust.Write(data, 0, data.Length);
            strReqeust.Close();

            return true;
        }
    }
}

Open in new window


Also, I only quickly glanced over the Terms of Service for the site. I don't see anything prohibiting programmatic (i.e. non-browser) access to the site. Make sure you are not in violation of the site's TOS before you use the above.
0
 

Author Comment

by:mmalik15
ID: 38729009
many thanks for the comment kaufmed.

The only issue i m facing now is to get the total number of pages. I m trying to use this regex
(?s)(?i)Page 1.*([0-9]{3}).*

Open in new window

 and its returning me 103 which is correct.

Can I ask how can we tweak this regex to pick 3 or more digit numbers as well?
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 38729284
Wht you have is close. I would ditch the dot-stars:

Page\s+1\s+of\s+([0-9]{3})

Open in new window

0
Master Your Team's Linux and Cloud Stack

Come see why top tech companies like Mailchimp and Media Temple use Linux Academy to build their employee training programs.

 

Author Comment

by:mmalik15
ID: 38729396
thanks for the comment again but what I m asking is if in future we have values like below

Page 1 of 1037 or

Page 1 of 19

then what shall be our regex as this Page\s+1\s+of\s+([0-9]{3}) will always return three digits
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 38729451
Ah, sorry. Use:

Page\s+1\s+of\s+([0-9]+)

Open in new window

0
 

Author Closing Comment

by:mmalik15
ID: 38729704
thanks
0

Featured Post

Master Your Team's Linux and Cloud Stack

Come see why top tech companies like Mailchimp and Media Temple use Linux Academy to build their employee training programs.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A long time ago (May 2011), I have written an article showing you how to create a DLL using Visual Studio 2005 to be hosted in SQL Server 2005. That was valid at that time and it is still valid if you are still using these versions. You can still re…
Entity Framework is a powerful tool to help you interact with the DataBase but still doesn't help much when we have a Stored Procedure that returns more than one resultset. The solution takes some of out-of-the-box thinking; read on!
This Micro Tutorial will give you a basic overview how to record your screen with Microsoft Expression Encoder. This program is still free and open for the public to download. This will be demonstrated using Microsoft Expression Encoder 4.
With Secure Portal Encryption, the recipient is sent a link to their email address directing them to the email laundry delivery page. From there, the recipient will be required to enter a user name and password to enter the page. Once the recipient …

831 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question