Link to home
Start Free TrialLog in
Avatar of trevor1940
trevor1940

asked on

C#: extracting links from html using HtmlAgilityPack

Hi

I'm extracting links from html using HtmlAgilityPack  but i'm getting a lot of unwanted rubbish

I have a list of wanted hosts how might I filter the links against this list?

using System;
using System.Xml;
using HtmlAgilityPack;
using System.Text.RegularExpressions;

					
public class Program
{
	public static void Main()
	{
		var html =
        @"<body>
    <div class='content'>
      <div id='post_message_30829575'>
        <a href='https://example.com/foo/123adj'><img src='https://example.com/foo/123adj/thumb.jpg'></a>
        <a href='https://example.com/foo/bar'>Link to foo bar</a>
        <a href='https://wanted.com/foo/bar'>Link to foo bar</a>
        <a href='https://rubish.com/foo/bar'>Link to foo bar</a>
      </div>
    </div>  
  </body>";

        var htmlDoc = new HtmlDocument();
        htmlDoc.LoadHtml(html);
		string [] hosts = {"exasmple", "wanted"};

         var HTMLBody = htmlDoc.DocumentNode.SelectSingleNode("//body");
            foreach (var Node in HTMLBody.DescendantsAndSelf())
			{
				var linkNodes = Node.Descendants("a");
				foreach (HtmlNode linkNode in linkNodes)
				{
					HtmlAttribute link = linkNode.Attributes["href"];
					HtmlNode imageNode = linkNode.SelectSingleNode(".//img");
                                       // Don't need images
					if(imageNode == null)
					{
						Console.WriteLine("Link value: {0}", link.Value);
					}
				}
			}
		
		
	
	}
}

Open in new window

Avatar of Mark Bullock
Mark Bullock
Flag of United States of America image

hosts.Where(myHost => linkNodes.Any(myLinkNode => myLinkNode.Attributes["href"].SelectSingleNode(".//img").Equals(null) && myLinkNode.Attributes["href"].Value.StartsWith($"https://"{myHost})));
Avatar of trevor1940
trevor1940

ASKER

Erm

That produces loads of errors
User generated image
If I can test for hosts in  reality they don't have images  

I made a fiddle
var filtered = linkNodes.Where(n => hosts.Any(h => n.Attributes["href"].Value.StartsWith("https://" + h)));

link to fiddle

Thanks for reminding me about fiddle.
Your fiddle isn't working
Did you save it?
My oridgenal fiddle with your post gives

Compilation error (line 33, col 30): 'System.Collections.Generic.IEnumerable<HtmlAgilityPack.HtmlNode>' does not contain a definition for 'Where' and no extension method 'Where' accepting a first argument of type 'System.Collections.Generic.IEnumerable<HtmlAgilityPack.HtmlNode>' could be found (are you missing a using directive or an assembly reference?)
Compilation error (line 33, col 47): 'System.Array' does not contain a definition for 'Any' and no extension method 'Any' accepting a first argument of type 'System.Array' could be found (are you missing a using directive or an assembly reference?)

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Mark Bullock
Mark Bullock
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thank you

FYI

I had to alter the link container to get it working on real html documents

var linkNodes = htmlDoc.DocumentNode.SelectNodes("//a[@href]");

Open in new window