[Webinar] Streamline your web hosting managementRegister Today

x
?
Solved

HtmlAgilityPack question!

Posted on 2014-02-17
7
Medium Priority
?
485 Views
Last Modified: 2014-03-09
Hi!

I'm parsing some webpages and found HtmlAgilityPack. the Documentation is not the best i've seen so i'll ask here!

i'm trying to get the data from  html looking like this

...
<h2> Some header 1 </h2>
<p> some text...1.....  </p>
...
<h2> Some header 2 </h2>
<p> some text...2.....  </p>

the page have lots of these.
i need to get the text from every <p> that comes after the <h2> tags

trying these snippets but i guess i'm missing something
i was hoping the Nextsibling would have fixed it but it doesnt.

For Each laiskuri In testdoc.DocumentNode.Descendants("h2")
            Dim test1 = laiskuri.InnerHtml
            Dim tets3 = laiskuri.NextSibling.InnerHtml
            Dim test4 = laiskuri.NextSibling.OuterHtml
            Dim test5 = laiskuri.ParentNode
            Dim test6 = laiskuri.PreviousSibling
            Dim test7 = laiskuri.PreviousSibling.InnerHtml
            Dim test8 = laiskuri.HasChildNodes
            Dim test9 = laiskuri.FirstChild.InnerHtml
            Dim test10 = laiskuri.FirstChild.InnerText
            Dim test11 = laiskuri.FirstChild.OuterHtml
        Next
0
Comment
Question by:jamppi
  • 3
  • 3
7 Comments
 
LVL 19

Expert Comment

by:Ken Butters
ID: 39865537
In your case P would not be a descendant of h2.

a Descendant (or child) is embedded ... and would look like this:

<h2>
      <p> some text</p>
</h2>

you need following-sibling...

if you think of it as an sort of outline... your <h2> and your <p> are at the same outline level... so they would be considered "siblings"... you want the sibling of <p> that follows <h2>

Which you would define like this: following-sibling::p

here is an example:

http://stackoverflow.com/questions/14929921/htmlagilitypack-xml-capturing-following-sibling
where "p" is the "sibling that follows h2.

Note: the example is xpath... but xpath can be used in selectNodes which is part of the HTML Agility pack.

so something like this in your example:

testdoc.DocumentNode.selectNodes("./h2/following-sibling::p")
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 39866394
Can you provide an example of the exact HTML structure you are working with? I tried your code in a new project, and it seems to work fine for me.
0
 

Author Comment

by:jamppi
ID: 39867261
Ken:

tried  testdoc.DocumentNode.selectNodes("./h2/following-sibling::p") but i get
{"Object reference not set to an instance of an object."}

testdoc is not 'nothing'. so that is not the reason to the error.
0
The new generation of project management tools

With monday.com’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.

 

Author Comment

by:jamppi
ID: 39867283
kaufmed:  

<div class="cl"></div>
 
<h2> Some header 1 </h2>
<p> some text...1.....  </p>

<a name="xxxx"></a>

<div class="cl"></div>

<h2> Some header 2 </h2>
<p> some text...2.....  </p>

<a name="xxxx"></a>
0
 
LVL 19

Expert Comment

by:Ken Butters
ID: 39867455
Couple of things...

First I'm assuming that this is just a snippet of your html doc right?   so at some point you have everything enclosed in a single tag... something like this:
<html>
  <div class="cl"></div>
  <h2> Some header 1 </h2>
  <p> some text...1.....  </p>
  <a name="xxxx"></a>
  <div class="cl"></div>
  <h2> Some header 2 </h2>
  <p> some text...2.....  </p>
  <a name="xxxx"></a>
</html>

Open in new window


second...I'm assuming that you have defined "testdoc" and done a "load" on it of your html page?  (if you stop in trace does testDoc contain the html of the page?)

Try making this slight change:

Instead of : testdoc.DocumentNode.selectNodes("./h2/following-sibling::p")
Try using : testdoc.DocumentNode.selectNodes("//h2/following-sibling::p")

(changing the leading dot-slash to slash-slash)
0
 

Author Comment

by:jamppi
ID: 39870017
That workt just fine!  Can i get the <h2> innerhtml at the same time ?
0
 
LVL 19

Accepted Solution

by:
Ken Butters earned 1200 total points
ID: 39870320
When you say "at the same time" do you mean you want them returned in the same array?

You can do that.... but seems to be like that would be cumbersome, because as you loop through the result, you have to check to see if you are working with an H2 item or a P item.

However... if that is what you are looking for...  you'd use the "|" operator.

testdoc.DocumentNode.selectNodes("//h2|//h2/following-sibling::p")
0

Featured Post

Never miss a deadline with monday.com

The revolutionary project management tool is here!   Plan visually with a single glance and make sure your projects get done.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Calculating holidays and working days is a function that is often needed yet it is not one found within the Framework. This article presents one approach to building a working-day calculator for use in .NET.
It was really hard time for me to get the understanding of Delegates in C#. I went through many websites and articles but I found them very clumsy. After going through those sites, I noted down the points in a easy way so here I am sharing that unde…
Planning to migrate your EDB file(s) to a new or an existing Outlook PST file? This video will guide you how to convert EDB file(s) to PST. Besides this, it also describes, how one can easily search any item(s) from multiple folders or mailboxes…
This video tutorial shows you the steps to go through to set up what I believe to be the best email app on the android platform to read Exchange mail.  Get the app on your phone: The first step is to make sure you have the Samsung Email app on your …

591 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question