[2 days left] What’s wrong with your cloud strategy? Learn why multicloud solutions matter with Nimble Storage.Register Now

x
?
Solved

extracting hyperlinks using htmlagilitypack

Posted on 2012-12-28
6
Medium Priority
?
704 Views
Last Modified: 2012-12-29
how to get all the title hyperlinks on this web page (http://www.scie-socialcareonline.org.uk/topic.asp?guid=3aca5bbd-bc85-11d4-ba18-009027f63525) present inside the paging table e.g. The first hyperlink for the title is "http://www.scie-socialcareonline.org.uk/profile.asp?guid=81199db9-4835-4df3-be46-603e44fc20b9"

using htmlagilitypack
0
Comment
Question by:mmalik15
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
6 Comments
 
LVL 75

Accepted Solution

by:
käµfm³d   👽 earned 2000 total points
ID: 38727851
I only mildly tested this, but it should work. At the least, it should get you started.

using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using HtmlAgilityPack;

namespace ConsoleApplication41
{
    class Program
    {
        /// <summary>
        /// Need this to share the extracted page number from the Javascript to the PreRequest handler below
        /// </summary>
        private static string pageNumber;

        static void Main(string[] args)
        {
            string baseUrl = @"http://www.scie-socialcareonline.org.uk/topic.asp?guid=3aca5bbd-bc85-11d4-ba18-009027f63525";
            HtmlWeb web = new HtmlWeb();
            HtmlDocument doc = web.Load(baseUrl);   // Load the base page
            HtmlNodeCollection pages = doc.DocumentNode.SelectNodes("//a[starts-with(@href, 'javascript:gotoPage')]");  // Find the Javascript function calls to change page
            List<int> visitedPages = new List<int>(pages.Count / 2);    // Since there are two places where a user can select pages, we make sure we don't get the same page twice by populating this
            Regex regPage = new Regex(@"\d+", RegexOptions.Compiled);   // For locating the page number in the javascript function call

            web.PreRequest = new HtmlWeb.PreRequestHandler(PreRequest); // Attach the handler for subsequent POST requests

            foreach (HtmlNode page in pages)
            {
                HtmlAttribute href = page.Attributes["href"];
                string javascript = href.Value;
                Match matchPage = regPage.Match(javascript);

                if (matchPage.Success)  // Did we find a page number in the javascript function call?
                {
                    HtmlNodeCollection titleAnchors;

                    pageNumber = matchPage.Value;
                    doc = web.Load(baseUrl, "POST");
                    titleAnchors = doc.DocumentNode.SelectNodes("//table//p[@class='list' and strong='Title: ']/a");

                    foreach (HtmlNode anchor in titleAnchors)
                    {
                        href = anchor.Attributes["href"];
                        Console.WriteLine(href.Value);
                    }
                }
            }
        }

        /// <summary>
        /// Need this in order to set the body of POST request
        /// </summary>
        static bool PreRequest(HttpWebRequest request)
        {
            Stream strReqeust;
            byte[] data = Encoding.ASCII.GetBytes("start=" + pageNumber);

            request.ContentLength = data.Length;
            request.ContentType = "application/x-www-form-urlencoded";
            strReqeust = request.GetRequestStream();
            strReqeust.Write(data, 0, data.Length);
            strReqeust.Close();

            return true;
        }
    }
}

Open in new window


Also, I only quickly glanced over the Terms of Service for the site. I don't see anything prohibiting programmatic (i.e. non-browser) access to the site. Make sure you are not in violation of the site's TOS before you use the above.
0
 

Author Comment

by:mmalik15
ID: 38729009
many thanks for the comment kaufmed.

The only issue i m facing now is to get the total number of pages. I m trying to use this regex
(?s)(?i)Page 1.*([0-9]{3}).*

Open in new window

 and its returning me 103 which is correct.

Can I ask how can we tweak this regex to pick 3 or more digit numbers as well?
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 38729284
Wht you have is close. I would ditch the dot-stars:

Page\s+1\s+of\s+([0-9]{3})

Open in new window

0
Windows Server 2016: All you need to know

Learn about Hyper-V features that increase functionality and usability of Microsoft Windows Server 2016. Also, throughout this eBook, you’ll find some basic PowerShell examples that will help you leverage the scripts in your environments!

 

Author Comment

by:mmalik15
ID: 38729396
thanks for the comment again but what I m asking is if in future we have values like below

Page 1 of 1037 or

Page 1 of 19

then what shall be our regex as this Page\s+1\s+of\s+([0-9]{3}) will always return three digits
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 38729451
Ah, sorry. Use:

Page\s+1\s+of\s+([0-9]+)

Open in new window

0
 

Author Closing Comment

by:mmalik15
ID: 38729704
thanks
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Problem Hi all,    While many today have fast Internet connection, there are many still who do not, or are connecting through devices with a slower connect, so light web pages and fast load times are still popular.    If your ASP.NET page …
This article shows how to deploy dynamic backgrounds to computers depending on the aspect ratio of display
Sometimes it takes a new vantage point, apart from our everyday security practices, to truly see our Active Directory (AD) vulnerabilities. We get used to implementing the same techniques and checking the same areas for a breach. This pattern can re…
In this video, Percona Director of Solution Engineering Jon Tobin discusses the function and features of Percona Server for MongoDB. How Percona can help Percona can help you determine if Percona Server for MongoDB is the right solution for …

649 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question