C#: extracting links from html using HtmlAgilityPack

trevor1940
trevor1940 used Ask the Experts™
on
Hi

I'm extracting links from html using HtmlAgilityPack  but i'm getting a lot of unwanted rubbish

I have a list of wanted hosts how might I filter the links against this list?

using System;
using System.Xml;
using HtmlAgilityPack;
using System.Text.RegularExpressions;

					
public class Program
{
	public static void Main()
	{
		var html =
        @"<body>
    <div class='content'>
      <div id='post_message_30829575'>
        <a href='https://example.com/foo/123adj'><img src='https://example.com/foo/123adj/thumb.jpg'></a>
        <a href='https://example.com/foo/bar'>Link to foo bar</a>
        <a href='https://wanted.com/foo/bar'>Link to foo bar</a>
        <a href='https://rubish.com/foo/bar'>Link to foo bar</a>
      </div>
    </div>  
  </body>";

        var htmlDoc = new HtmlDocument();
        htmlDoc.LoadHtml(html);
		string [] hosts = {"exasmple", "wanted"};

         var HTMLBody = htmlDoc.DocumentNode.SelectSingleNode("//body");
            foreach (var Node in HTMLBody.DescendantsAndSelf())
			{
				var linkNodes = Node.Descendants("a");
				foreach (HtmlNode linkNode in linkNodes)
				{
					HtmlAttribute link = linkNode.Attributes["href"];
					HtmlNode imageNode = linkNode.SelectSingleNode(".//img");
                                       // Don't need images
					if(imageNode == null)
					{
						Console.WriteLine("Link value: {0}", link.Value);
					}
				}
			}
		
		
	
	}
}

Open in new window

Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
hosts.Where(myHost => linkNodes.Any(myLinkNode => myLinkNode.Attributes["href"].SelectSingleNode(".//img").Equals(null) && myLinkNode.Attributes["href"].Value.StartsWith($"https://"{myHost})));

Author

Commented:
Erm

That produces loads of errors
erros.JPG
If I can test for hosts in  reality they don't have images  

I made a fiddle
var filtered = linkNodes.Where(n => hosts.Any(h => n.Attributes["href"].Value.StartsWith("https://" + h)));

link to fiddle

Thanks for reminding me about fiddle.
Why Diversity in Tech Matters

Kesha Williams, certified professional and software developer, explores the imbalance of diversity in the world of technology -- especially when it comes to hiring women. She showcases ways she's making a difference through the Colors of STEM program.

Author

Commented:
Your fiddle isn't working
Did you save it?
My oridgenal fiddle with your post gives

Compilation error (line 33, col 30): 'System.Collections.Generic.IEnumerable<HtmlAgilityPack.HtmlNode>' does not contain a definition for 'Where' and no extension method 'Where' accepting a first argument of type 'System.Collections.Generic.IEnumerable<HtmlAgilityPack.HtmlNode>' could be found (are you missing a using directive or an assembly reference?)
Compilation error (line 33, col 47): 'System.Array' does not contain a definition for 'Any' and no extension method 'Any' accepting a first argument of type 'System.Array' could be found (are you missing a using directive or an assembly reference?)

Open in new window

Sorry. Here's the fiddle https://dotnetfiddle.net/qY9nPp

Author

Commented:
Thank you

FYI

I had to alter the link container to get it working on real html documents

var linkNodes = htmlDoc.DocumentNode.SelectNodes("//a[@href]");

Open in new window

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial