Solved

HtmlAgilityPack question!

Posted on 2014-02-17
7
453 Views
Last Modified: 2014-03-09
Hi!

I'm parsing some webpages and found HtmlAgilityPack. the Documentation is not the best i've seen so i'll ask here!

i'm trying to get the data from  html looking like this

...
<h2> Some header 1 </h2>
<p> some text...1.....  </p>
...
<h2> Some header 2 </h2>
<p> some text...2.....  </p>

the page have lots of these.
i need to get the text from every <p> that comes after the <h2> tags

trying these snippets but i guess i'm missing something
i was hoping the Nextsibling would have fixed it but it doesnt.

For Each laiskuri In testdoc.DocumentNode.Descendants("h2")
            Dim test1 = laiskuri.InnerHtml
            Dim tets3 = laiskuri.NextSibling.InnerHtml
            Dim test4 = laiskuri.NextSibling.OuterHtml
            Dim test5 = laiskuri.ParentNode
            Dim test6 = laiskuri.PreviousSibling
            Dim test7 = laiskuri.PreviousSibling.InnerHtml
            Dim test8 = laiskuri.HasChildNodes
            Dim test9 = laiskuri.FirstChild.InnerHtml
            Dim test10 = laiskuri.FirstChild.InnerText
            Dim test11 = laiskuri.FirstChild.OuterHtml
        Next
0
Comment
Question by:jamppi
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
7 Comments
 
LVL 19

Expert Comment

by:Ken Butters
ID: 39865537
In your case P would not be a descendant of h2.

a Descendant (or child) is embedded ... and would look like this:

<h2>
      <p> some text</p>
</h2>

you need following-sibling...

if you think of it as an sort of outline... your <h2> and your <p> are at the same outline level... so they would be considered "siblings"... you want the sibling of <p> that follows <h2>

Which you would define like this: following-sibling::p

here is an example:

http://stackoverflow.com/questions/14929921/htmlagilitypack-xml-capturing-following-sibling
where "p" is the "sibling that follows h2.

Note: the example is xpath... but xpath can be used in selectNodes which is part of the HTML Agility pack.

so something like this in your example:

testdoc.DocumentNode.selectNodes("./h2/following-sibling::p")
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 39866394
Can you provide an example of the exact HTML structure you are working with? I tried your code in a new project, and it seems to work fine for me.
0
 

Author Comment

by:jamppi
ID: 39867261
Ken:

tried  testdoc.DocumentNode.selectNodes("./h2/following-sibling::p") but i get
{"Object reference not set to an instance of an object."}

testdoc is not 'nothing'. so that is not the reason to the error.
0
Webinar: Aligning, Automating, Winning

Join Dan Russo, Senior Manager of Operations Intelligence, for an in-depth discussion on how Dealertrack, leading provider of integrated digital solutions for the automotive industry, transformed their DevOps processes to increase collaboration and move with greater velocity.

 

Author Comment

by:jamppi
ID: 39867283
kaufmed:  

<div class="cl"></div>
 
<h2> Some header 1 </h2>
<p> some text...1.....  </p>

<a name="xxxx"></a>

<div class="cl"></div>

<h2> Some header 2 </h2>
<p> some text...2.....  </p>

<a name="xxxx"></a>
0
 
LVL 19

Expert Comment

by:Ken Butters
ID: 39867455
Couple of things...

First I'm assuming that this is just a snippet of your html doc right?   so at some point you have everything enclosed in a single tag... something like this:
<html>
  <div class="cl"></div>
  <h2> Some header 1 </h2>
  <p> some text...1.....  </p>
  <a name="xxxx"></a>
  <div class="cl"></div>
  <h2> Some header 2 </h2>
  <p> some text...2.....  </p>
  <a name="xxxx"></a>
</html>

Open in new window


second...I'm assuming that you have defined "testdoc" and done a "load" on it of your html page?  (if you stop in trace does testDoc contain the html of the page?)

Try making this slight change:

Instead of : testdoc.DocumentNode.selectNodes("./h2/following-sibling::p")
Try using : testdoc.DocumentNode.selectNodes("//h2/following-sibling::p")

(changing the leading dot-slash to slash-slash)
0
 

Author Comment

by:jamppi
ID: 39870017
That workt just fine!  Can i get the <h2> innerhtml at the same time ?
0
 
LVL 19

Accepted Solution

by:
Ken Butters earned 300 total points
ID: 39870320
When you say "at the same time" do you mean you want them returned in the same array?

You can do that.... but seems to be like that would be cumbersome, because as you loop through the result, you have to check to see if you are working with an H2 item or a P item.

However... if that is what you are looking for...  you'd use the "|" operator.

testdoc.DocumentNode.selectNodes("//h2|//h2/following-sibling::p")
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

In my previous article (http://www.experts-exchange.com/Programming/Languages/.NET/.NET_Framework_3.x/A_4362-Serialization-in-NET-1.html) we saw the basics of serialization and how types/objects can be serialized to Binary format. In this blog we wi…
Parsing a CSV file is a task that we are confronted with regularly, and although there are a vast number of means to do this, as a newbie, the field can be confusing and the tools can seem complex. A simple solution to parsing a customized CSV fi…
The Email Laundry PDF encryption service allows companies to send confidential encrypted  emails to anybody. The PDF document can also contain attachments that are embedded in the encrypted PDF. The password is randomly generated by The Email Laundr…

710 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question