trevor1940
asked on
C#: extracting links from html using HtmlAgilityPack
Hi
I'm extracting links from html using HtmlAgilityPack but i'm getting a lot of unwanted rubbish
I have a list of wanted hosts how might I filter the links against this list?
I'm extracting links from html using HtmlAgilityPack but i'm getting a lot of unwanted rubbish
I have a list of wanted hosts how might I filter the links against this list?
using System;
using System.Xml;
using HtmlAgilityPack;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var html =
@"<body>
<div class='content'>
<div id='post_message_30829575'>
<a href='https://example.com/foo/123adj'><img src='https://example.com/foo/123adj/thumb.jpg'></a>
<a href='https://example.com/foo/bar'>Link to foo bar</a>
<a href='https://wanted.com/foo/bar'>Link to foo bar</a>
<a href='https://rubish.com/foo/bar'>Link to foo bar</a>
</div>
</div>
</body>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string [] hosts = {"exasmple", "wanted"};
var HTMLBody = htmlDoc.DocumentNode.SelectSingleNode("//body");
foreach (var Node in HTMLBody.DescendantsAndSelf())
{
var linkNodes = Node.Descendants("a");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.Attributes["href"];
HtmlNode imageNode = linkNode.SelectSingleNode(".//img");
// Don't need images
if(imageNode == null)
{
Console.WriteLine("Link value: {0}", link.Value);
}
}
}
}
}
hosts.Where(myHost => linkNodes.Any(myLinkNode => myLinkNode.Attributes["hre f"].Select SingleNode (".//img") .Equals(nu ll) && myLinkNode.Attributes["hre f"].Value. StartsWith ($"https://"{myHos t})));
ASKER
Erm
That produces loads of errors
If I can test for hosts in reality they don't have images
I made a fiddle
That produces loads of errors
If I can test for hosts in reality they don't have images
I made a fiddle
ASKER
Your fiddle isn't working
Did you save it?
My oridgenal fiddle with your post gives
Did you save it?
My oridgenal fiddle with your post gives
Compilation error (line 33, col 30): 'System.Collections.Generic.IEnumerable<HtmlAgilityPack.HtmlNode>' does not contain a definition for 'Where' and no extension method 'Where' accepting a first argument of type 'System.Collections.Generic.IEnumerable<HtmlAgilityPack.HtmlNode>' could be found (are you missing a using directive or an assembly reference?)
Compilation error (line 33, col 47): 'System.Array' does not contain a definition for 'Any' and no extension method 'Any' accepting a first argument of type 'System.Array' could be found (are you missing a using directive or an assembly reference?)
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thank you
FYI
I had to alter the link container to get it working on real html documents
FYI
I had to alter the link container to get it working on real html documents
var linkNodes = htmlDoc.DocumentNode.SelectNodes("//a[@href]");