Solved

Parse inner HTML tables to a DataTable issue from a web page

Posted on 2014-09-03
4
595 Views
Last Modified: 2014-09-05
Hi All,

         I need to parse inner HTML tables from a web page and I am using HTML Agility pack for it. Here is my HTML table and C# code.
Here is the issue. I have 3 Inner HTML tables.

1st Inner table - 3 rd level
2 nd HTML table  - 3 rd level
3 rd HTML table - 2 nd Level

When looping through the Main HTML table, it has to stop at the third level and has to populate the DataTable as there is the first inner table. But instead it's going to the next level and populating the table.

Same thing with the 2 nd table.

For 3 HTML table, it has to come out the 3 rd loop and has to populate the DataTable.

Can someone guide me, where exactly i ma doing mistake in my code and what is the error.






<table>                  
 <tr>
       <td>
       <table>
            <tr>
                  <td>
                        <table>
                              <tr>
                              <td><b>Daily Backup Failed Client Report - Corporate</b></font></td>
                              </tr>
                        </table>
                  </td>
                  <td>
                        <table>
                              <tr>
                              <td>Last Day: 8/28/14 09:06 - 8/29/14 09:06</td>
                              </tr>
                              <tr>
                              <td>NetBackup Master and Media Servers</td>
                              </tr>
                        </table>
                  </td>
            </tr>
      </table>
      </td>
</tr>            
<tr>
      <td>
      <!-------------------- the actual table ------------------------>
      <table>
            <tr>
                        <td>1</td>
                        <td>2</td>
                        <td>3</td>
            </tr>
            <tr>
                        <td>1</td>
                        <td>2</td>
                        <td>3</td>
            </tr>
            <tr>
                        <td>1</td>
                        <td>2</td>
                        <td>3</td>
            </tr>
            <tr>
                        <td>1</td>
                        <td>2</td>
                        <td>3</td>
            </tr>

      </table>
      </td>
</tr>
      
<tr>
      <td><br><hr>Generated by Data Protection Advisor v6.1.0 (Build 85670)<br>Date: 8/29/14 09:08</td>
</tr>
</table>


                    foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
                    {
                        ///This is the table.
                        foreach (HtmlNode row in table.SelectNodes("tr").Skip(1))
                        {
                            ///This is the row.
                           foreach (HtmlNode cell in row.SelectNodes("td"))
                             ///can also use "th|td", but right now we ONLY need td
                             {
                                 //This is the cell.
                                if (cell.InnerHtml.Contains("table"))
                                 {
                                     foreach (HtmlNode subtable in cell.SelectNodes("//table"))
                                     {
                                         foreach (HtmlNode subrow in subtable.SelectNodes("tr").Skip(1))
                                         {
                                             foreach (HtmlNode subcell in subrow.SelectNodes("td"))
                                             {
                                                 if (subcell.InnerHtml.Contains("table"))
                                                 {
                                                     foreach (HtmlNode subsubtable in subcell.SelectNodes("//table"))
                                                     {
                                                         foreach (HtmlNode subsubrow in subsubtable.SelectNodes("tr").Skip(1))
                                                         {
                                                             foreach (HtmlNode subsubcell in subsubrow.SelectNodes("td"))
                                                             {
                                                                 if (subsubcell.InnerHtml.Contains("table"))
                                                                 {
                                                                     foreach (HtmlNode subsubsubtable in subsubcell.SelectNodes("//table"))
                                                                     {
                                                                         foreach (HtmlNode subsubsubrow in subsubsubtable.SelectNodes("tr").Skip(1))
                                                                         {
                                                                             foreach (HtmlNode subsubsubcell in subsubsubrow.SelectNodes("td"))
                                                                             {                                                                              
                                                                                 dccc.Columns.Add(subsubcell.InnerText);
                                                                                 dataGridView1.DataSource = dccc;
                                                                             }
                                                                           
                                                                         }
                                                                     }
                                                                 }
                                                                 else
                                                                 {
                                                                     if (dm1.Rows.Count == 0)
                                                                     {
                                                                         dm1.Columns.Add(subsubcell.InnerText);
                                                                         dataGridView1.DataSource = dm1;
                                                                     }
                                                                     else
                                                                     {
                                                                         dm2.Columns.Add(subsubcell.InnerText);
                                                                         dataGridView1.DataSource = dm2;                                                                    
                                                                     }                                      
                                                                 }
                                                             }                                                          
                                                         }
                                                         dc.Columns.Add(subcell.InnerText);
                                                         dataGridView2.DataSource = dc;
                                                     }
                                                 }
                                                 else
                                                 {
                                                     dm1.Columns.Add(subcell.InnerText);
                                                     dataGridView1.DataSource = dm1;                                                    
                                                 }
                                             }
                                         }
                                     }
                                 }
                                 else
                                 {                                    
                                     dmm.Columns.Add(cell.InnerText);
                                 }
                             }
                         }
                     }
0
Comment
Question by:pothireddysunil
  • 2
  • 2
4 Comments
 
LVL 30

Expert Comment

by:MlandaT
ID: 40303814
If all you are interested in is really just that last table, which you have marked the actual table (I must admit not fully understanding your narration), then I would suggest you rather just rely on XPath to do the work for you ... after all ... that is what it is meant for...

"the actual table" meets two criteria (of course I assume that the file structure here is pretty standard):
1 - it is a leaf table i.e. it has not nested table
2 - of such tables, it is the last one

Based on that, you just need to use this xpath: (//table[not(.//table)])[last()]

To process the TR and TD once you've found the table becomes trivial.
Dim htdoc As New HtmlDocument
htdoc.LoadHtml(File.ReadAllText("c:\rdp\tbl.html"))

for each n in htdoc.DocumentNode.SelectNodes("(//table[not(.//table)])[last()]") 'take out the [last()] to see all leaf tables

	'you could just say dim n = htdoc.DocumentNode.SelectNodes("(//table[not(.//table)])[last()]")
	Debug.WriteLine(n.OuterHtml)
	
	for each td in n.SelectNodes(".//tr")
		Debug.WriteLine(td.InnerText)
	next
	
next

Open in new window

tbl.html
0
 
LVL 1

Author Comment

by:pothireddysunil
ID: 40304144
I need to read all the 3 tables and need to get the information. The thing is, on the web page we don't have id's assigned to the tables. If they are, then it is very easy for me to read the tables.

The logic which i am using here is to loop through the tags using html agility pack and when ever it sees a td, check whether that td has any table tag, if not then write them to them to the DataTable.

The issue is, at third level it should write to the DataTable, instead it is going to the fourth level and writing.

is there any issue with my for each looping. I want to confirm that from experts.

  foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
                     {
                         ///This is the table.
                         foreach (HtmlNode row in table.SelectNodes("tr").Skip(1))
                         {
                             ///This is the row.
                            foreach (HtmlNode cell in row.SelectNodes("td"))
                              ///can also use "th|td", but right now we ONLY need td
                              {
                                  //This is the cell.
                                 if (cell.InnerHtml.Contains("table"))
                                  {
                                      foreach (HtmlNode subtable in cell.SelectNodes("//table"))
                                      {
                                          foreach (HtmlNode subrow in subtable.SelectNodes("tr").Skip(1))
                                          {
                                              foreach (HtmlNode subcell in subrow.SelectNodes("td"))
                                              {
                                                  if (subcell.InnerHtml.Contains("table"))
                                                  {
                                                      foreach (HtmlNode subsubtable in subcell.SelectNodes("//table"))
                                                      {
                                                          foreach (HtmlNode subsubrow in subsubtable.SelectNodes("tr").Skip(1))
                                                          {
                                                              foreach (HtmlNode subsubcell in subsubrow.SelectNodes("td"))
                                                              {
                                                                  if (subsubcell.InnerHtml.Contains("table"))
                                                                  {
0
 
LVL 30

Accepted Solution

by:
MlandaT earned 500 total points
ID: 40304708
If you take out the "[last()]" from the code I gave you above, then it will extract the 3 nodes for each of the tables that you are interested in. You can then process them as you wish. He is example of my output - you can see the tables that it extracts at the bottom of the screenshot below.
Snapshot of output in LinqPad   HtmlAgilityPackIMHO: your loops are not very easy to follow. The Xpath approach captures your logic in a clean and concise manner.
for each n in htdoc.DocumentNode.SelectNodes("(//table[not(.//table)])")

Open in new window

0
 
LVL 1

Author Closing Comment

by:pothireddysunil
ID: 40306761
Thanks It works
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
Need help with a query 6 55
Object Oriented Best Practice 5 35
Create XML 5 34
t-sql sp_addlinkedserver possible risks 3 13
Exception Handling is in the core of any application that is able to dignify its name. In this article, I'll guide you through the process of writing a DRY (Don't Repeat Yourself) Exception Handling mechanism, using Aspect Oriented Programming.
Real-time is more about the business, not the technology. In day-to-day life, to make real-time decisions like buying or investing, business needs the latest information(e.g. Gold Rate/Stock Rate). Unlike traditional days, you need not wait for a fe…
The goal of the video will be to teach the user the concept of local variables and scope. An example of a locally defined variable will be given as well as an explanation of what scope is in C++. The local variable and concept of scope will be relat…
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now