C#: extracting links from html using HtmlAgilityPack

Hi

I'm extracting links from html using HtmlAgilityPack  but i'm getting a lot of unwanted rubbish

I have a list of wanted hosts how might I filter the links against this list?

using System;
using System.Xml;
using HtmlAgilityPack;
using System.Text.RegularExpressions;

					
public class Program
{
	public static void Main()
	{
		var html =
        @"<body>
    <div class='content'>
      <div id='post_message_30829575'>
        <a href='https://example.com/foo/123adj'><img src='https://example.com/foo/123adj/thumb.jpg'></a>
        <a href='https://example.com/foo/bar'>Link to foo bar</a>
        <a href='https://wanted.com/foo/bar'>Link to foo bar</a>
        <a href='https://rubish.com/foo/bar'>Link to foo bar</a>
      </div>
    </div>  
  </body>";

        var htmlDoc = new HtmlDocument();
        htmlDoc.LoadHtml(html);
		string [] hosts = {"exasmple", "wanted"};

         var HTMLBody = htmlDoc.DocumentNode.SelectSingleNode("//body");
            foreach (var Node in HTMLBody.DescendantsAndSelf())
			{
				var linkNodes = Node.Descendants("a");
				foreach (HtmlNode linkNode in linkNodes)
				{
					HtmlAttribute link = linkNode.Attributes["href"];
					HtmlNode imageNode = linkNode.SelectSingleNode(".//img");
                                       // Don't need images
					if(imageNode == null)
					{
						Console.WriteLine("Link value: {0}", link.Value);
					}
				}
			}
		
		
	
	}
}

Open in new window

LVL 1
trevor1940Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Mark BullockQA EngineerCommented:
hosts.Where(myHost => linkNodes.Any(myLinkNode => myLinkNode.Attributes["href"].SelectSingleNode(".//img").Equals(null) && myLinkNode.Attributes["href"].Value.StartsWith($"https://"{myHost})));
trevor1940Author Commented:
Erm

That produces loads of errors
erros.JPG
If I can test for hosts in  reality they don't have images  

I made a fiddle
Mark BullockQA EngineerCommented:
var filtered = linkNodes.Where(n => hosts.Any(h => n.Attributes["href"].Value.StartsWith("https://" + h)));

link to fiddle

Thanks for reminding me about fiddle.
Become a Certified Penetration Testing Engineer

This CPTE Certified Penetration Testing Engineer course covers everything you need to know about becoming a Certified Penetration Testing Engineer. Career Path: Professional roles include Ethical Hackers, Security Consultants, System Administrators, and Chief Security Officers.

trevor1940Author Commented:
Your fiddle isn't working
Did you save it?
My oridgenal fiddle with your post gives

Compilation error (line 33, col 30): 'System.Collections.Generic.IEnumerable<HtmlAgilityPack.HtmlNode>' does not contain a definition for 'Where' and no extension method 'Where' accepting a first argument of type 'System.Collections.Generic.IEnumerable<HtmlAgilityPack.HtmlNode>' could be found (are you missing a using directive or an assembly reference?)
Compilation error (line 33, col 47): 'System.Array' does not contain a definition for 'Any' and no extension method 'Any' accepting a first argument of type 'System.Array' could be found (are you missing a using directive or an assembly reference?)

Open in new window

Mark BullockQA EngineerCommented:
Sorry. Here's the fiddle https://dotnetfiddle.net/qY9nPp

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
trevor1940Author Commented:
Thank you

FYI

I had to alter the link container to get it working on real html documents

var linkNodes = htmlDoc.DocumentNode.SelectNodes("//a[@href]");

Open in new window

It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
.NET Programming

From novice to tech pro — start learning today.