Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

extracting hyperlinks using htmlagilitypack

Posted on 2012-12-28
6
667 Views
Last Modified: 2012-12-29
how to get all the title hyperlinks on this web page (http://www.scie-socialcareonline.org.uk/topic.asp?guid=3aca5bbd-bc85-11d4-ba18-009027f63525) present inside the paging table e.g. The first hyperlink for the title is "http://www.scie-socialcareonline.org.uk/profile.asp?guid=81199db9-4835-4df3-be46-603e44fc20b9"

using htmlagilitypack
0
Comment
Question by:mmalik15
  • 3
  • 3
6 Comments
 
LVL 75

Accepted Solution

by:
käµfm³d   👽 earned 500 total points
ID: 38727851
I only mildly tested this, but it should work. At the least, it should get you started.

using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using HtmlAgilityPack;

namespace ConsoleApplication41
{
    class Program
    {
        /// <summary>
        /// Need this to share the extracted page number from the Javascript to the PreRequest handler below
        /// </summary>
        private static string pageNumber;

        static void Main(string[] args)
        {
            string baseUrl = @"http://www.scie-socialcareonline.org.uk/topic.asp?guid=3aca5bbd-bc85-11d4-ba18-009027f63525";
            HtmlWeb web = new HtmlWeb();
            HtmlDocument doc = web.Load(baseUrl);   // Load the base page
            HtmlNodeCollection pages = doc.DocumentNode.SelectNodes("//a[starts-with(@href, 'javascript:gotoPage')]");  // Find the Javascript function calls to change page
            List<int> visitedPages = new List<int>(pages.Count / 2);    // Since there are two places where a user can select pages, we make sure we don't get the same page twice by populating this
            Regex regPage = new Regex(@"\d+", RegexOptions.Compiled);   // For locating the page number in the javascript function call

            web.PreRequest = new HtmlWeb.PreRequestHandler(PreRequest); // Attach the handler for subsequent POST requests

            foreach (HtmlNode page in pages)
            {
                HtmlAttribute href = page.Attributes["href"];
                string javascript = href.Value;
                Match matchPage = regPage.Match(javascript);

                if (matchPage.Success)  // Did we find a page number in the javascript function call?
                {
                    HtmlNodeCollection titleAnchors;

                    pageNumber = matchPage.Value;
                    doc = web.Load(baseUrl, "POST");
                    titleAnchors = doc.DocumentNode.SelectNodes("//table//p[@class='list' and strong='Title: ']/a");

                    foreach (HtmlNode anchor in titleAnchors)
                    {
                        href = anchor.Attributes["href"];
                        Console.WriteLine(href.Value);
                    }
                }
            }
        }

        /// <summary>
        /// Need this in order to set the body of POST request
        /// </summary>
        static bool PreRequest(HttpWebRequest request)
        {
            Stream strReqeust;
            byte[] data = Encoding.ASCII.GetBytes("start=" + pageNumber);

            request.ContentLength = data.Length;
            request.ContentType = "application/x-www-form-urlencoded";
            strReqeust = request.GetRequestStream();
            strReqeust.Write(data, 0, data.Length);
            strReqeust.Close();

            return true;
        }
    }
}

Open in new window


Also, I only quickly glanced over the Terms of Service for the site. I don't see anything prohibiting programmatic (i.e. non-browser) access to the site. Make sure you are not in violation of the site's TOS before you use the above.
0
 

Author Comment

by:mmalik15
ID: 38729009
many thanks for the comment kaufmed.

The only issue i m facing now is to get the total number of pages. I m trying to use this regex
(?s)(?i)Page 1.*([0-9]{3}).*

Open in new window

 and its returning me 103 which is correct.

Can I ask how can we tweak this regex to pick 3 or more digit numbers as well?
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 38729284
Wht you have is close. I would ditch the dot-stars:

Page\s+1\s+of\s+([0-9]{3})

Open in new window

0
The New “Normal” in Modern Enterprise Operations

DevOps for the modern enterprise offers many benefits — increased agility, productivity, and more, but digital transformation isn’t easy, especially if you’re not addressing the right issues. Register for the webinar to dive into the “new normal” for enterprise modern ops.

 

Author Comment

by:mmalik15
ID: 38729396
thanks for the comment again but what I m asking is if in future we have values like below

Page 1 of 1037 or

Page 1 of 19

then what shall be our regex as this Page\s+1\s+of\s+([0-9]{3}) will always return three digits
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 38729451
Ah, sorry. Use:

Page\s+1\s+of\s+([0-9]+)

Open in new window

0
 

Author Closing Comment

by:mmalik15
ID: 38729704
thanks
0

Featured Post

Master Your Team's Linux and Cloud Stack

Come see why top tech companies like Mailchimp and Media Temple use Linux Academy to build their employee training programs.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

ASP.Net to Oracle Connectivity Recently I had to develop an ASP.NET application connecting to an Oracle database.As I am doing it first time ,I had to solve several problems. This article will help to such developers  to develop an ASP.NET client…
International Data Corporation (IDC) prognosticates that before the current the year gets over disbursing on IT framework products to be sent in cloud environs will be $37.1B.
The Email Laundry PDF encryption service allows companies to send confidential encrypted  emails to anybody. The PDF document can also contain attachments that are embedded in the encrypted PDF. The password is randomly generated by The Email Laundr…
A short tutorial showing how to set up an email signature in Outlook on the Web (previously known as OWA). For free email signatures designs, visit https://www.mail-signatures.com/articles/signature-templates/?sts=6651 If you want to manage em…

828 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question