Solved

HtmlAgilityPack question!

Posted on 2014-02-17
7
433 Views
Last Modified: 2014-03-09
Hi!

I'm parsing some webpages and found HtmlAgilityPack. the Documentation is not the best i've seen so i'll ask here!

i'm trying to get the data from  html looking like this

...
<h2> Some header 1 </h2>
<p> some text...1.....  </p>
...
<h2> Some header 2 </h2>
<p> some text...2.....  </p>

the page have lots of these.
i need to get the text from every <p> that comes after the <h2> tags

trying these snippets but i guess i'm missing something
i was hoping the Nextsibling would have fixed it but it doesnt.

For Each laiskuri In testdoc.DocumentNode.Descendants("h2")
            Dim test1 = laiskuri.InnerHtml
            Dim tets3 = laiskuri.NextSibling.InnerHtml
            Dim test4 = laiskuri.NextSibling.OuterHtml
            Dim test5 = laiskuri.ParentNode
            Dim test6 = laiskuri.PreviousSibling
            Dim test7 = laiskuri.PreviousSibling.InnerHtml
            Dim test8 = laiskuri.HasChildNodes
            Dim test9 = laiskuri.FirstChild.InnerHtml
            Dim test10 = laiskuri.FirstChild.InnerText
            Dim test11 = laiskuri.FirstChild.OuterHtml
        Next
0
Comment
Question by:jamppi
  • 3
  • 3
7 Comments
 
LVL 19

Expert Comment

by:Ken Butters
ID: 39865537
In your case P would not be a descendant of h2.

a Descendant (or child) is embedded ... and would look like this:

<h2>
      <p> some text</p>
</h2>

you need following-sibling...

if you think of it as an sort of outline... your <h2> and your <p> are at the same outline level... so they would be considered "siblings"... you want the sibling of <p> that follows <h2>

Which you would define like this: following-sibling::p

here is an example:

http://stackoverflow.com/questions/14929921/htmlagilitypack-xml-capturing-following-sibling
where "p" is the "sibling that follows h2.

Note: the example is xpath... but xpath can be used in selectNodes which is part of the HTML Agility pack.

so something like this in your example:

testdoc.DocumentNode.selectNodes("./h2/following-sibling::p")
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 39866394
Can you provide an example of the exact HTML structure you are working with? I tried your code in a new project, and it seems to work fine for me.
0
 

Author Comment

by:jamppi
ID: 39867261
Ken:

tried  testdoc.DocumentNode.selectNodes("./h2/following-sibling::p") but i get
{"Object reference not set to an instance of an object."}

testdoc is not 'nothing'. so that is not the reason to the error.
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 

Author Comment

by:jamppi
ID: 39867283
kaufmed:  

<div class="cl"></div>
 
<h2> Some header 1 </h2>
<p> some text...1.....  </p>

<a name="xxxx"></a>

<div class="cl"></div>

<h2> Some header 2 </h2>
<p> some text...2.....  </p>

<a name="xxxx"></a>
0
 
LVL 19

Expert Comment

by:Ken Butters
ID: 39867455
Couple of things...

First I'm assuming that this is just a snippet of your html doc right?   so at some point you have everything enclosed in a single tag... something like this:
<html>
  <div class="cl"></div>
  <h2> Some header 1 </h2>
  <p> some text...1.....  </p>
  <a name="xxxx"></a>
  <div class="cl"></div>
  <h2> Some header 2 </h2>
  <p> some text...2.....  </p>
  <a name="xxxx"></a>
</html>

Open in new window


second...I'm assuming that you have defined "testdoc" and done a "load" on it of your html page?  (if you stop in trace does testDoc contain the html of the page?)

Try making this slight change:

Instead of : testdoc.DocumentNode.selectNodes("./h2/following-sibling::p")
Try using : testdoc.DocumentNode.selectNodes("//h2/following-sibling::p")

(changing the leading dot-slash to slash-slash)
0
 

Author Comment

by:jamppi
ID: 39870017
That workt just fine!  Can i get the <h2> innerhtml at the same time ?
0
 
LVL 19

Accepted Solution

by:
Ken Butters earned 300 total points
ID: 39870320
When you say "at the same time" do you mean you want them returned in the same array?

You can do that.... but seems to be like that would be cumbersome, because as you loop through the result, you have to check to see if you are working with an H2 item or a P item.

However... if that is what you are looking for...  you'd use the "|" operator.

testdoc.DocumentNode.selectNodes("//h2|//h2/following-sibling::p")
0

Featured Post

What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

Join & Write a Comment

This document covers how to connect to SQL Server and browse its contents.  It is meant for those new to Visual Studio and/or working with Microsoft SQL Server.  It is not a guide to building SQL Server database connections in your code.  This is mo…
Calculating holidays and working days is a function that is often needed yet it is not one found within the Framework. This article presents one approach to building a working-day calculator for use in .NET.
Internet Business Fax to Email Made Easy - With eFax Corporate (http://www.enterprise.efax.com), you'll receive a dedicated online fax number, which is used the same way as a typical analog fax number. You'll receive secure faxes in your email, fr…
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now