Solved

extracting hyperlinks using htmlagilitypack

Posted on 2012-12-28
6
676 Views
Last Modified: 2012-12-29
how to get all the title hyperlinks on this web page (http://www.scie-socialcareonline.org.uk/topic.asp?guid=3aca5bbd-bc85-11d4-ba18-009027f63525) present inside the paging table e.g. The first hyperlink for the title is "http://www.scie-socialcareonline.org.uk/profile.asp?guid=81199db9-4835-4df3-be46-603e44fc20b9"

using htmlagilitypack
0
Comment
Question by:mmalik15
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
6 Comments
 
LVL 75

Accepted Solution

by:
käµfm³d   👽 earned 500 total points
ID: 38727851
I only mildly tested this, but it should work. At the least, it should get you started.

using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using HtmlAgilityPack;

namespace ConsoleApplication41
{
    class Program
    {
        /// <summary>
        /// Need this to share the extracted page number from the Javascript to the PreRequest handler below
        /// </summary>
        private static string pageNumber;

        static void Main(string[] args)
        {
            string baseUrl = @"http://www.scie-socialcareonline.org.uk/topic.asp?guid=3aca5bbd-bc85-11d4-ba18-009027f63525";
            HtmlWeb web = new HtmlWeb();
            HtmlDocument doc = web.Load(baseUrl);   // Load the base page
            HtmlNodeCollection pages = doc.DocumentNode.SelectNodes("//a[starts-with(@href, 'javascript:gotoPage')]");  // Find the Javascript function calls to change page
            List<int> visitedPages = new List<int>(pages.Count / 2);    // Since there are two places where a user can select pages, we make sure we don't get the same page twice by populating this
            Regex regPage = new Regex(@"\d+", RegexOptions.Compiled);   // For locating the page number in the javascript function call

            web.PreRequest = new HtmlWeb.PreRequestHandler(PreRequest); // Attach the handler for subsequent POST requests

            foreach (HtmlNode page in pages)
            {
                HtmlAttribute href = page.Attributes["href"];
                string javascript = href.Value;
                Match matchPage = regPage.Match(javascript);

                if (matchPage.Success)  // Did we find a page number in the javascript function call?
                {
                    HtmlNodeCollection titleAnchors;

                    pageNumber = matchPage.Value;
                    doc = web.Load(baseUrl, "POST");
                    titleAnchors = doc.DocumentNode.SelectNodes("//table//p[@class='list' and strong='Title: ']/a");

                    foreach (HtmlNode anchor in titleAnchors)
                    {
                        href = anchor.Attributes["href"];
                        Console.WriteLine(href.Value);
                    }
                }
            }
        }

        /// <summary>
        /// Need this in order to set the body of POST request
        /// </summary>
        static bool PreRequest(HttpWebRequest request)
        {
            Stream strReqeust;
            byte[] data = Encoding.ASCII.GetBytes("start=" + pageNumber);

            request.ContentLength = data.Length;
            request.ContentType = "application/x-www-form-urlencoded";
            strReqeust = request.GetRequestStream();
            strReqeust.Write(data, 0, data.Length);
            strReqeust.Close();

            return true;
        }
    }
}

Open in new window


Also, I only quickly glanced over the Terms of Service for the site. I don't see anything prohibiting programmatic (i.e. non-browser) access to the site. Make sure you are not in violation of the site's TOS before you use the above.
0
 

Author Comment

by:mmalik15
ID: 38729009
many thanks for the comment kaufmed.

The only issue i m facing now is to get the total number of pages. I m trying to use this regex
(?s)(?i)Page 1.*([0-9]{3}).*

Open in new window

 and its returning me 103 which is correct.

Can I ask how can we tweak this regex to pick 3 or more digit numbers as well?
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 38729284
Wht you have is close. I would ditch the dot-stars:

Page\s+1\s+of\s+([0-9]{3})

Open in new window

0
PeopleSoft Has Never Been Easier

PeopleSoft Adoption Made Smooth & Simple!

On-The-Job Training Is made Intuitive & Easy With WalkMe's On-Screen Guidance Tool.  Claim Your Free WalkMe Account Now

 

Author Comment

by:mmalik15
ID: 38729396
thanks for the comment again but what I m asking is if in future we have values like below

Page 1 of 1037 or

Page 1 of 19

then what shall be our regex as this Page\s+1\s+of\s+([0-9]{3}) will always return three digits
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 38729451
Ah, sorry. Use:

Page\s+1\s+of\s+([0-9]+)

Open in new window

0
 

Author Closing Comment

by:mmalik15
ID: 38729704
thanks
0

Featured Post

DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Problem Hi all,    While many today have fast Internet connection, there are many still who do not, or are connecting through devices with a slower connect, so light web pages and fast load times are still popular.    If your ASP.NET page …
This article is for Object-Oriented Programming (OOP) beginners. An Interface contains declarations of events, indexers, methods and/or properties. Any class which implements the Interface should provide the concrete implementation for each Inter…
A short tutorial showing how to set up an email signature in Outlook on the Web (previously known as OWA). For free email signatures designs, visit https://www.mail-signatures.com/articles/signature-templates/?sts=6651 If you want to manage em…

738 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question