trevor1940
asked on
C#: HtmlAgilityPack getting elements using Xpath
Hi
I'm trying to use HtmlAgilityPack to Travers some HTML Test.html
Under each b-post will be a single picture, a group of pictures in a slide show or a video I'm struggling to extract the elements like Title for each post & img src
I'm trying to use HtmlAgilityPack to Travers some HTML Test.html
Under each b-post will be a single picture, a group of pictures in a slide show or a video I'm struggling to extract the elements like Title for each post & img src
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Hello World!");
string Root = @"H:\TopTotty\CarrieLaChance\onlyfans\Vids\";
string CarrieOF = Root + "Test.html"; // "CarrieOF.html";
var htmlDoc = new HtmlDocument();
htmlDoc.Load(CarrieOF);
var bPostNodes = htmlDoc.DocumentNode.SelectNodes(".//div[contains(@class,'b-post')]");
string Title = "";
string VidSrc = "";
string Poster = "";
foreach (var divNodes in bPostNodes)
{
///html/body/div/main/div/div/div[2]/div/div[2]/div[2]/div[1]/div[1]/div/a/span
//Title = divNodes.SelectSingleNode(".//span").Attributes["title"].Value;
try
{
///html/body/div/main/div/div/div[2]/div/div[2]/div[5]/div[1]/div[1]/div/a/span
if (divNodes.SelectSingleNode(".//span").Attributes["title"] != null)
{
Title = divNodes.SelectSingleNode(".//span").Attributes["title"].Value;
}
// /html/body/main/div/div/div/div/div/div/div/div/div[1]/div[3]/div[1]/figure/div/div[2]/video/source
if (divNodes.SelectSingleNode(".//video/source").Attributes["src"].Value != null)
{
VidSrc = divNodes.SelectSingleNode(".//video/source").Attributes["src"].Value;
Uri uri = new Uri(VidSrc);
string LocalFile = Root + "\\" + System.IO.Path.GetFileName(uri.LocalPath);
if (File.Exists(LocalFile))
{
Console.WriteLine("Title {0} , {1} ", Title, LocalFile);
}
if (divNodes.SelectSingleNode(".//video").Attributes["poster"].Value != null)
{
Poster = divNodes.SelectSingleNode(".//video").Attributes["poster"].Value;
Console.WriteLine("poster {0}", Poster);
}
}
// Single image
Console.WriteLine("Title {0} Before Single image", Title);
// /html/body/div/main/div/div/div[2]/div/div[2]/div[5]/div[3]/div[1]
if (divNodes.SelectSingleNode(".//div[starts-with(@class,'post_img_block'])") != null)
{
// /html/body/div/main/div/div/div[2]/div/div[2]/div[5]/div[3]/div[1]/div/img
var imgNode = divNodes.SelectSingleNode(".//div[starts-with(@class,'post_img_block')]");
string ImgSrc = imgNode.SelectSingleNode(".//img").Attributes["src"].Value;
Console.WriteLine("Title {0} , Image src: {1} ", Title, ImgSrc);
}
//Slideshow
Console.WriteLine("Title {0} Before Slideshow", Title);
if (divNodes.SelectSingleNode(".//div[starts-with(@class,'swiper'])") != null)
{
var SlideNode = divNodes.SelectSingleNode(".//div[starts-with(@class,'swiper-slide')]");
var Slides = SlideNode.SelectNodes(".//img");
foreach (var Slide in Slides)
{
string ImgSrc = Slide.Attributes["src"].Value;
Console.WriteLine("Title in Slideshow {0} , Image src: {1} ", Title, ImgSrc);
}
}
}
catch (Exception)
{
continue;
}
}
Console.WriteLine("I'm Done");
}
}
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
Hi,
I do not see something that would really need it (but maybe i am wrong). Can you use it? Yes, you can, but still not sure if you really need it.
From what i see, divNodes is an element from a foreach iteration that i doubt would return a null object since you have requested objects with a specific property from your xpath. So SlideNode won't be null since you have the if that would exclude all nulls.
If Slides is null then you do get an exception which may not cause any real issues to you since this block is near the end. Still, i would check if Slides is not null to continue cause i do not like relying on the catch.
Another issue could be a Slide without an 'src' attribute where you try to get its value. That could cause a problem as well and here you could really use Slide.Attributes["src"]?.V alue .
In general ?. is a nice way of avoiding to constantly check if an object is null before trying to access one of its members. Other than that, you should probably check for nulls one way or the other just to be on the safe side.
I think you have it a bit wrong. You think you iterate only the b-post classes, but instead you iterate all divs that have a class that contains b-post. That means, that b-post__text is also a valid result for your xpath. You may find how to work it out more properly here. Even with that way though, there is still a div with class b-post nested in the first div with class b-post and that makes the hole thing run twice.
I am not sure if i have helped or puzzled you more....
Giannis
I do not see something that would really need it (but maybe i am wrong). Can you use it? Yes, you can, but still not sure if you really need it.
From what i see, divNodes is an element from a foreach iteration that i doubt would return a null object since you have requested objects with a specific property from your xpath. So SlideNode won't be null since you have the if that would exclude all nulls.
If Slides is null then you do get an exception which may not cause any real issues to you since this block is near the end. Still, i would check if Slides is not null to continue cause i do not like relying on the catch.
Another issue could be a Slide without an 'src' attribute where you try to get its value. That could cause a problem as well and here you could really use Slide.Attributes["src"]?.V
In general ?. is a nice way of avoiding to constantly check if an object is null before trying to access one of its members. Other than that, you should probably check for nulls one way or the other just to be on the safe side.
I think you have it a bit wrong. You think you iterate only the b-post classes, but instead you iterate all divs that have a class that contains b-post. That means, that b-post__text is also a valid result for your xpath. You may find how to work it out more properly here. Even with that way though, there is still a div with class b-post nested in the first div with class b-post and that makes the hole thing run twice.
I am not sure if i have helped or puzzled you more....
Giannis
ASKER
I get what your saying and last para about b-post explains why it's traversing twice
However I still don't understand why
Only has 1 ?
I thought SelectNodes meant Select All Nodes that are img bellow the current div or SlideNode
However I still don't understand why
var Slides = SlideNode.SelectNodes(".//img");
Only has 1 ?
I thought SelectNodes meant Select All Nodes that are img bellow the current div or SlideNode
ASKER
Thanx for your help
ASKER
adding the "?" worked is it possible to do similar here?
Open in new window
Also in that block I'm only getting the first slide Slides.count only ever 1 it should be 5 (Admittedly with duplicates )