Solved

extracting hyperlinks using htmlagilitypack

Posted on 2012-12-28
6
683 Views
Last Modified: 2012-12-29
how to get all the title hyperlinks on this web page (http://www.scie-socialcareonline.org.uk/topic.asp?guid=3aca5bbd-bc85-11d4-ba18-009027f63525) present inside the paging table e.g. The first hyperlink for the title is "http://www.scie-socialcareonline.org.uk/profile.asp?guid=81199db9-4835-4df3-be46-603e44fc20b9"

using htmlagilitypack
0
Comment
Question by:mmalik15
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
6 Comments
 
LVL 75

Accepted Solution

by:
käµfm³d   👽 earned 500 total points
ID: 38727851
I only mildly tested this, but it should work. At the least, it should get you started.

using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using HtmlAgilityPack;

namespace ConsoleApplication41
{
    class Program
    {
        /// <summary>
        /// Need this to share the extracted page number from the Javascript to the PreRequest handler below
        /// </summary>
        private static string pageNumber;

        static void Main(string[] args)
        {
            string baseUrl = @"http://www.scie-socialcareonline.org.uk/topic.asp?guid=3aca5bbd-bc85-11d4-ba18-009027f63525";
            HtmlWeb web = new HtmlWeb();
            HtmlDocument doc = web.Load(baseUrl);   // Load the base page
            HtmlNodeCollection pages = doc.DocumentNode.SelectNodes("//a[starts-with(@href, 'javascript:gotoPage')]");  // Find the Javascript function calls to change page
            List<int> visitedPages = new List<int>(pages.Count / 2);    // Since there are two places where a user can select pages, we make sure we don't get the same page twice by populating this
            Regex regPage = new Regex(@"\d+", RegexOptions.Compiled);   // For locating the page number in the javascript function call

            web.PreRequest = new HtmlWeb.PreRequestHandler(PreRequest); // Attach the handler for subsequent POST requests

            foreach (HtmlNode page in pages)
            {
                HtmlAttribute href = page.Attributes["href"];
                string javascript = href.Value;
                Match matchPage = regPage.Match(javascript);

                if (matchPage.Success)  // Did we find a page number in the javascript function call?
                {
                    HtmlNodeCollection titleAnchors;

                    pageNumber = matchPage.Value;
                    doc = web.Load(baseUrl, "POST");
                    titleAnchors = doc.DocumentNode.SelectNodes("//table//p[@class='list' and strong='Title: ']/a");

                    foreach (HtmlNode anchor in titleAnchors)
                    {
                        href = anchor.Attributes["href"];
                        Console.WriteLine(href.Value);
                    }
                }
            }
        }

        /// <summary>
        /// Need this in order to set the body of POST request
        /// </summary>
        static bool PreRequest(HttpWebRequest request)
        {
            Stream strReqeust;
            byte[] data = Encoding.ASCII.GetBytes("start=" + pageNumber);

            request.ContentLength = data.Length;
            request.ContentType = "application/x-www-form-urlencoded";
            strReqeust = request.GetRequestStream();
            strReqeust.Write(data, 0, data.Length);
            strReqeust.Close();

            return true;
        }
    }
}

Open in new window


Also, I only quickly glanced over the Terms of Service for the site. I don't see anything prohibiting programmatic (i.e. non-browser) access to the site. Make sure you are not in violation of the site's TOS before you use the above.
0
 

Author Comment

by:mmalik15
ID: 38729009
many thanks for the comment kaufmed.

The only issue i m facing now is to get the total number of pages. I m trying to use this regex
(?s)(?i)Page 1.*([0-9]{3}).*

Open in new window

 and its returning me 103 which is correct.

Can I ask how can we tweak this regex to pick 3 or more digit numbers as well?
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 38729284
Wht you have is close. I would ditch the dot-stars:

Page\s+1\s+of\s+([0-9]{3})

Open in new window

0
Learn by Doing. Anytime. Anywhere.

Do you like to learn by doing?
Our labs and exercises give you the chance to do just that: Learn by performing actions on real environments.

Hands-on, scenario-based labs give you experience on real environments provided by us so you don't have to worry about breaking anything.

 

Author Comment

by:mmalik15
ID: 38729396
thanks for the comment again but what I m asking is if in future we have values like below

Page 1 of 1037 or

Page 1 of 19

then what shall be our regex as this Page\s+1\s+of\s+([0-9]{3}) will always return three digits
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 38729451
Ah, sorry. Use:

Page\s+1\s+of\s+([0-9]+)

Open in new window

0
 

Author Closing Comment

by:mmalik15
ID: 38729704
thanks
0

Featured Post

Tutorials alone can't teach real engineering

So we built better training tools.

-Hands-on Labs
-Instructor Mentoring
-Scenario-Based Tests
-Dedicated Cloud Servers

All at your fingertips. What are you waiting for?

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Exception Handling is in the core of any application that is able to dignify its name. In this article, I'll guide you through the process of writing a DRY (Don't Repeat Yourself) Exception Handling mechanism, using Aspect Oriented Programming.
Calculating holidays and working days is a function that is often needed yet it is not one found within the Framework. This article presents one approach to building a working-day calculator for use in .NET.
Michael from AdRem Software explains how to view the most utilized and worst performing nodes in your network, by accessing the Top Charts view in NetCrunch network monitor (https://www.adremsoft.com/). Top Charts is a view in which you can set seve…
Monitoring a network: how to monitor network services and why? Michael Kulchisky, MCSE, MCSA, MCP, VTSP, VSP, CCSP outlines the philosophy behind service monitoring and why a handshake validation is critical in network monitoring. Software utilized …

688 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question