Solved

HtmlAgilityPack question!

Posted on 2014-02-17
7
459 Views
Last Modified: 2014-03-09
Hi!

I'm parsing some webpages and found HtmlAgilityPack. the Documentation is not the best i've seen so i'll ask here!

i'm trying to get the data from  html looking like this

...
<h2> Some header 1 </h2>
<p> some text...1.....  </p>
...
<h2> Some header 2 </h2>
<p> some text...2.....  </p>

the page have lots of these.
i need to get the text from every <p> that comes after the <h2> tags

trying these snippets but i guess i'm missing something
i was hoping the Nextsibling would have fixed it but it doesnt.

For Each laiskuri In testdoc.DocumentNode.Descendants("h2")
            Dim test1 = laiskuri.InnerHtml
            Dim tets3 = laiskuri.NextSibling.InnerHtml
            Dim test4 = laiskuri.NextSibling.OuterHtml
            Dim test5 = laiskuri.ParentNode
            Dim test6 = laiskuri.PreviousSibling
            Dim test7 = laiskuri.PreviousSibling.InnerHtml
            Dim test8 = laiskuri.HasChildNodes
            Dim test9 = laiskuri.FirstChild.InnerHtml
            Dim test10 = laiskuri.FirstChild.InnerText
            Dim test11 = laiskuri.FirstChild.OuterHtml
        Next
0
Comment
Question by:jamppi
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
7 Comments
 
LVL 19

Expert Comment

by:Ken Butters
ID: 39865537
In your case P would not be a descendant of h2.

a Descendant (or child) is embedded ... and would look like this:

<h2>
      <p> some text</p>
</h2>

you need following-sibling...

if you think of it as an sort of outline... your <h2> and your <p> are at the same outline level... so they would be considered "siblings"... you want the sibling of <p> that follows <h2>

Which you would define like this: following-sibling::p

here is an example:

http://stackoverflow.com/questions/14929921/htmlagilitypack-xml-capturing-following-sibling
where "p" is the "sibling that follows h2.

Note: the example is xpath... but xpath can be used in selectNodes which is part of the HTML Agility pack.

so something like this in your example:

testdoc.DocumentNode.selectNodes("./h2/following-sibling::p")
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 39866394
Can you provide an example of the exact HTML structure you are working with? I tried your code in a new project, and it seems to work fine for me.
0
 

Author Comment

by:jamppi
ID: 39867261
Ken:

tried  testdoc.DocumentNode.selectNodes("./h2/following-sibling::p") but i get
{"Object reference not set to an instance of an object."}

testdoc is not 'nothing'. so that is not the reason to the error.
0
Quiz: What Do These Organizations Have In Common?

Hint: Their teams ended up taking quizzes, too.

 

Author Comment

by:jamppi
ID: 39867283
kaufmed:  

<div class="cl"></div>
 
<h2> Some header 1 </h2>
<p> some text...1.....  </p>

<a name="xxxx"></a>

<div class="cl"></div>

<h2> Some header 2 </h2>
<p> some text...2.....  </p>

<a name="xxxx"></a>
0
 
LVL 19

Expert Comment

by:Ken Butters
ID: 39867455
Couple of things...

First I'm assuming that this is just a snippet of your html doc right?   so at some point you have everything enclosed in a single tag... something like this:
<html>
  <div class="cl"></div>
  <h2> Some header 1 </h2>
  <p> some text...1.....  </p>
  <a name="xxxx"></a>
  <div class="cl"></div>
  <h2> Some header 2 </h2>
  <p> some text...2.....  </p>
  <a name="xxxx"></a>
</html>

Open in new window


second...I'm assuming that you have defined "testdoc" and done a "load" on it of your html page?  (if you stop in trace does testDoc contain the html of the page?)

Try making this slight change:

Instead of : testdoc.DocumentNode.selectNodes("./h2/following-sibling::p")
Try using : testdoc.DocumentNode.selectNodes("//h2/following-sibling::p")

(changing the leading dot-slash to slash-slash)
0
 

Author Comment

by:jamppi
ID: 39870017
That workt just fine!  Can i get the <h2> innerhtml at the same time ?
0
 
LVL 19

Accepted Solution

by:
Ken Butters earned 300 total points
ID: 39870320
When you say "at the same time" do you mean you want them returned in the same array?

You can do that.... but seems to be like that would be cumbersome, because as you loop through the result, you have to check to see if you are working with an H2 item or a P item.

However... if that is what you are looking for...  you'd use the "|" operator.

testdoc.DocumentNode.selectNodes("//h2|//h2/following-sibling::p")
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

For those of you who don't follow the news, or just happen to live under rocks, Microsoft Research released a beta SDK (http://www.microsoft.com/en-us/download/details.aspx?id=27876) for the Xbox 360 Kinect. If you don't know what a Kinect is (http:…
Real-time is more about the business, not the technology. In day-to-day life, to make real-time decisions like buying or investing, business needs the latest information(e.g. Gold Rate/Stock Rate). Unlike traditional days, you need not wait for a fe…
In this video, viewers are given an introduction to using the Windows 10 Snipping Tool, how to quickly locate it when it's needed and also how make it always available with a single click of a mouse button, by pinning it to the Desktop Task Bar. Int…
In this brief tutorial Pawel from AdRem Software explains how you can quickly find out which services are running on your network, or what are the IP addresses of servers responsible for each service. Software used is freeware NetCrunch Tools (https…

635 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question