Link to home
Create AccountLog in
Avatar of trevor1940
trevor1940

asked on

C#: HtmlAgilityPack getting elements using Xpath

Hi
I'm trying to use HtmlAgilityPack to Travers some  HTML Test.html

Under each b-post will be a single picture, a group of pictures in a slide show or a video I'm struggling to extract the elements like Title for  each post &  img src

  class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Hello World!");
            string Root = @"H:\TopTotty\CarrieLaChance\onlyfans\Vids\";
            string CarrieOF = Root + "Test.html"; // "CarrieOF.html";
            var htmlDoc = new HtmlDocument();
            htmlDoc.Load(CarrieOF);


                var bPostNodes = htmlDoc.DocumentNode.SelectNodes(".//div[contains(@class,'b-post')]");

            string Title = "";
            string VidSrc = "";
            string Poster = "";
            foreach (var divNodes in bPostNodes)
                {
                ///html/body/div/main/div/div/div[2]/div/div[2]/div[2]/div[1]/div[1]/div/a/span
                //Title = divNodes.SelectSingleNode(".//span").Attributes["title"].Value;
                try
                {
                    ///html/body/div/main/div/div/div[2]/div/div[2]/div[5]/div[1]/div[1]/div/a/span
                    if (divNodes.SelectSingleNode(".//span").Attributes["title"] != null)
                    {
                        Title = divNodes.SelectSingleNode(".//span").Attributes["title"].Value;
                    }

                    // /html/body/main/div/div/div/div/div/div/div/div/div[1]/div[3]/div[1]/figure/div/div[2]/video/source
                    if (divNodes.SelectSingleNode(".//video/source").Attributes["src"].Value != null)
                    {

                        VidSrc = divNodes.SelectSingleNode(".//video/source").Attributes["src"].Value;
                        Uri uri = new Uri(VidSrc);

                        string LocalFile = Root + "\\" + System.IO.Path.GetFileName(uri.LocalPath);
                        if (File.Exists(LocalFile))
                        {
                            Console.WriteLine("Title  {0} , {1} ", Title, LocalFile);
                        }
                        if (divNodes.SelectSingleNode(".//video").Attributes["poster"].Value != null)
                        {
                            Poster = divNodes.SelectSingleNode(".//video").Attributes["poster"].Value;
                            Console.WriteLine("poster {0}", Poster);
                        } 
                    }


                    //  Single image
                    Console.WriteLine("Title  {0} Before  Single image", Title);
                    // /html/body/div/main/div/div/div[2]/div/div[2]/div[5]/div[3]/div[1]
                    if (divNodes.SelectSingleNode(".//div[starts-with(@class,'post_img_block'])") != null)
                    {

                        // /html/body/div/main/div/div/div[2]/div/div[2]/div[5]/div[3]/div[1]/div/img
                        var imgNode = divNodes.SelectSingleNode(".//div[starts-with(@class,'post_img_block')]");
                        string ImgSrc = imgNode.SelectSingleNode(".//img").Attributes["src"].Value;
                        Console.WriteLine("Title   {0} , Image src: {1} ", Title, ImgSrc);

                    }

                    //Slideshow

                    Console.WriteLine("Title  {0} Before  Slideshow", Title);
                    if (divNodes.SelectSingleNode(".//div[starts-with(@class,'swiper'])") != null)
                    {
                        var SlideNode = divNodes.SelectSingleNode(".//div[starts-with(@class,'swiper-slide')]");
                        var Slides = SlideNode.SelectNodes(".//img");
                        foreach (var Slide in Slides)
                        {
                        string ImgSrc = Slide.Attributes["src"].Value;
                        Console.WriteLine("Title in Slideshow {0} , Image src: {1} ", Title, ImgSrc);
                        }
                        

                    }

                }
                catch (Exception)
                {

                    continue;
                }
            }
                Console.WriteLine("I'm Done");

            
        }
    }

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Ioannis Paraskevopoulos
Ioannis Paraskevopoulos
Flag of Greece image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
Avatar of trevor1940
trevor1940

ASKER

Giannis Thanx for your help

adding the "?" worked is it possible to do similar here?

                     if (divNodes.SelectSingleNode(".//div[starts-with(@class,'swiper-slide')]") != null)
                    {
                        var SlideNode = divNodes.SelectSingleNode(".//div[starts-with(@class,'swiper-slide')]");
                        var Slides = SlideNode.SelectNodes(".//img");
                        foreach (var Slide in Slides)
                        {
                        string ImgSrc = Slide.Attributes["src"].Value;
                        Console.WriteLine("Title in Slideshow {0} , Image src: {1} ", Title, ImgSrc);
                        }
                        

                    }

Open in new window


Also in that block I'm only getting the first slide Slides.count only ever 1 it should be 5 (Admittedly with duplicates )
Hi,

I do not see something that would really need it (but maybe i am wrong). Can you use it? Yes, you can, but still not sure if you really need it.

From what i see, divNodes is an element from a foreach iteration that i doubt would return a null object since you have requested objects with a specific property from your xpath. So SlideNode won't be null since you have the if that would exclude all nulls.

If Slides is null then you do get an exception which may not cause any real issues to you since this block is near the end. Still, i would check if Slides is not null to continue cause i do not like relying on the catch.

Another issue could be a Slide without an 'src' attribute where you try to get its value. That could cause a problem as well and here you could really use Slide.Attributes["src"]?.Value .

In general ?. is a nice way of avoiding to constantly check if an object is null before trying to access one of its members. Other than that, you should probably check for nulls one way or the other just to be on the safe side.



I think you have it a bit wrong. You think you iterate only the b-post classes, but instead you iterate all divs that have a class that contains b-post. That means, that b-post__text is also a valid result for your xpath. You may find how to work it out more properly here. Even with that way though, there is still a div with class b-post nested in the first div with class b-post and that makes the hole thing run twice.

I am not sure if i have helped or puzzled you more....

Giannis
I get what your saying and last para about b-post explains why it's traversing twice

However I still don't understand why
 var Slides = SlideNode.SelectNodes(".//img");

Open in new window


Only has 1 ?

I thought SelectNodes meant Select All Nodes that are img bellow the current div or SlideNode
Thanx for your help