We help IT Professionals succeed at work.

C#: HtmlAgilityPack getting elements using Xpath

trevor1940
trevor1940 used Ask the Experts™
on
Hi
I'm trying to use HtmlAgilityPack to Travers some  HTML Test.html

Under each b-post will be a single picture, a group of pictures in a slide show or a video I'm struggling to extract the elements like Title for  each post &  img src

  class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Hello World!");
            string Root = @"H:\TopTotty\CarrieLaChance\onlyfans\Vids\";
            string CarrieOF = Root + "Test.html"; // "CarrieOF.html";
            var htmlDoc = new HtmlDocument();
            htmlDoc.Load(CarrieOF);


                var bPostNodes = htmlDoc.DocumentNode.SelectNodes(".//div[contains(@class,'b-post')]");

            string Title = "";
            string VidSrc = "";
            string Poster = "";
            foreach (var divNodes in bPostNodes)
                {
                ///html/body/div/main/div/div/div[2]/div/div[2]/div[2]/div[1]/div[1]/div/a/span
                //Title = divNodes.SelectSingleNode(".//span").Attributes["title"].Value;
                try
                {
                    ///html/body/div/main/div/div/div[2]/div/div[2]/div[5]/div[1]/div[1]/div/a/span
                    if (divNodes.SelectSingleNode(".//span").Attributes["title"] != null)
                    {
                        Title = divNodes.SelectSingleNode(".//span").Attributes["title"].Value;
                    }

                    // /html/body/main/div/div/div/div/div/div/div/div/div[1]/div[3]/div[1]/figure/div/div[2]/video/source
                    if (divNodes.SelectSingleNode(".//video/source").Attributes["src"].Value != null)
                    {

                        VidSrc = divNodes.SelectSingleNode(".//video/source").Attributes["src"].Value;
                        Uri uri = new Uri(VidSrc);

                        string LocalFile = Root + "\\" + System.IO.Path.GetFileName(uri.LocalPath);
                        if (File.Exists(LocalFile))
                        {
                            Console.WriteLine("Title  {0} , {1} ", Title, LocalFile);
                        }
                        if (divNodes.SelectSingleNode(".//video").Attributes["poster"].Value != null)
                        {
                            Poster = divNodes.SelectSingleNode(".//video").Attributes["poster"].Value;
                            Console.WriteLine("poster {0}", Poster);
                        } 
                    }


                    //  Single image
                    Console.WriteLine("Title  {0} Before  Single image", Title);
                    // /html/body/div/main/div/div/div[2]/div/div[2]/div[5]/div[3]/div[1]
                    if (divNodes.SelectSingleNode(".//div[starts-with(@class,'post_img_block'])") != null)
                    {

                        // /html/body/div/main/div/div/div[2]/div/div[2]/div[5]/div[3]/div[1]/div/img
                        var imgNode = divNodes.SelectSingleNode(".//div[starts-with(@class,'post_img_block')]");
                        string ImgSrc = imgNode.SelectSingleNode(".//img").Attributes["src"].Value;
                        Console.WriteLine("Title   {0} , Image src: {1} ", Title, ImgSrc);

                    }

                    //Slideshow

                    Console.WriteLine("Title  {0} Before  Slideshow", Title);
                    if (divNodes.SelectSingleNode(".//div[starts-with(@class,'swiper'])") != null)
                    {
                        var SlideNode = divNodes.SelectSingleNode(".//div[starts-with(@class,'swiper-slide')]");
                        var Slides = SlideNode.SelectNodes(".//img");
                        foreach (var Slide in Slides)
                        {
                        string ImgSrc = Slide.Attributes["src"].Value;
                        Console.WriteLine("Title in Slideshow {0} , Image src: {1} ", Title, ImgSrc);
                        }
                        

                    }

                }
                catch (Exception)
                {

                    continue;
                }
            }
                Console.WriteLine("I'm Done");

            
        }
    }

Open in new window

Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Hi,

I see you have at least two issues. Both issues would be easily found if you had not used continue in your catch block but output in the console the exception you were getting.

First issue is that you are trying to access properties of potentially null objects. For instance you are trying to access Attributes in your if statement here:

if (divNodes.SelectSingleNode(".//span").Attributes["title"] != null)

Open in new window


Since divNodes.SelectSingleNode(".//span") is null it throws instead of carrying on to check the rest of the conditions. You should check if the element is itself null before accessing its properties, or better use ?. to access members of potentially null objects like so:

if (divNodes.SelectSingleNode(".//span")?.Attributes["title"] != null)

Open in new window


?.
is going to return either null if the object is null or the member value if the object is not null.

The other issue that would be shown if you would log in the console the exceptions is an anagram in your xpath in two cases:

(".//div[starts-with(@class,'post_img_block'])")

Open in new window


should be:

(".//div[starts-with(@class,'post_img_block')]")

Open in new window



and

(".//div[starts-with(@class,'swiper'])")

Open in new window



should be

(".//div[starts-with(@class,'swiper')]")

Open in new window


in lines 52 and 65 respectfully in your sample code. The error is that you close the the bracket before the parenthesis.

Have a great 2020...

Giannis

Author

Commented:
Giannis Thanx for your help

adding the "?" worked is it possible to do similar here?

                     if (divNodes.SelectSingleNode(".//div[starts-with(@class,'swiper-slide')]") != null)
                    {
                        var SlideNode = divNodes.SelectSingleNode(".//div[starts-with(@class,'swiper-slide')]");
                        var Slides = SlideNode.SelectNodes(".//img");
                        foreach (var Slide in Slides)
                        {
                        string ImgSrc = Slide.Attributes["src"].Value;
                        Console.WriteLine("Title in Slideshow {0} , Image src: {1} ", Title, ImgSrc);
                        }
                        

                    }

Open in new window


Also in that block I'm only getting the first slide Slides.count only ever 1 it should be 5 (Admittedly with duplicates )
Hi,

I do not see something that would really need it (but maybe i am wrong). Can you use it? Yes, you can, but still not sure if you really need it.

From what i see, divNodes is an element from a foreach iteration that i doubt would return a null object since you have requested objects with a specific property from your xpath. So SlideNode won't be null since you have the if that would exclude all nulls.

If Slides is null then you do get an exception which may not cause any real issues to you since this block is near the end. Still, i would check if Slides is not null to continue cause i do not like relying on the catch.

Another issue could be a Slide without an 'src' attribute where you try to get its value. That could cause a problem as well and here you could really use Slide.Attributes["src"]?.Value .

In general ?. is a nice way of avoiding to constantly check if an object is null before trying to access one of its members. Other than that, you should probably check for nulls one way or the other just to be on the safe side.



I think you have it a bit wrong. You think you iterate only the b-post classes, but instead you iterate all divs that have a class that contains b-post. That means, that b-post__text is also a valid result for your xpath. You may find how to work it out more properly here. Even with that way though, there is still a div with class b-post nested in the first div with class b-post and that makes the hole thing run twice.

I am not sure if i have helped or puzzled you more....

Giannis

Author

Commented:
I get what your saying and last para about b-post explains why it's traversing twice

However I still don't understand why
 var Slides = SlideNode.SelectNodes(".//img");

Open in new window


Only has 1 ?

I thought SelectNodes meant Select All Nodes that are img bellow the current div or SlideNode

Author

Commented:
Thanx for your help