?
Solved

Parse inner HTML tables to a DataTable issue from a web page

Posted on 2014-09-03
4
Medium Priority
?
883 Views
Last Modified: 2014-09-05
Hi All,

         I need to parse inner HTML tables from a web page and I am using HTML Agility pack for it. Here is my HTML table and C# code.
Here is the issue. I have 3 Inner HTML tables.

1st Inner table - 3 rd level
2 nd HTML table  - 3 rd level
3 rd HTML table - 2 nd Level

When looping through the Main HTML table, it has to stop at the third level and has to populate the DataTable as there is the first inner table. But instead it's going to the next level and populating the table.

Same thing with the 2 nd table.

For 3 HTML table, it has to come out the 3 rd loop and has to populate the DataTable.

Can someone guide me, where exactly i ma doing mistake in my code and what is the error.






<table>                  
 <tr>
       <td>
       <table>
            <tr>
                  <td>
                        <table>
                              <tr>
                              <td><b>Daily Backup Failed Client Report - Corporate</b></font></td>
                              </tr>
                        </table>
                  </td>
                  <td>
                        <table>
                              <tr>
                              <td>Last Day: 8/28/14 09:06 - 8/29/14 09:06</td>
                              </tr>
                              <tr>
                              <td>NetBackup Master and Media Servers</td>
                              </tr>
                        </table>
                  </td>
            </tr>
      </table>
      </td>
</tr>            
<tr>
      <td>
      <!-------------------- the actual table ------------------------>
      <table>
            <tr>
                        <td>1</td>
                        <td>2</td>
                        <td>3</td>
            </tr>
            <tr>
                        <td>1</td>
                        <td>2</td>
                        <td>3</td>
            </tr>
            <tr>
                        <td>1</td>
                        <td>2</td>
                        <td>3</td>
            </tr>
            <tr>
                        <td>1</td>
                        <td>2</td>
                        <td>3</td>
            </tr>

      </table>
      </td>
</tr>
      
<tr>
      <td><br><hr>Generated by Data Protection Advisor v6.1.0 (Build 85670)<br>Date: 8/29/14 09:08</td>
</tr>
</table>


                    foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
                    {
                        ///This is the table.
                        foreach (HtmlNode row in table.SelectNodes("tr").Skip(1))
                        {
                            ///This is the row.
                           foreach (HtmlNode cell in row.SelectNodes("td"))
                             ///can also use "th|td", but right now we ONLY need td
                             {
                                 //This is the cell.
                                if (cell.InnerHtml.Contains("table"))
                                 {
                                     foreach (HtmlNode subtable in cell.SelectNodes("//table"))
                                     {
                                         foreach (HtmlNode subrow in subtable.SelectNodes("tr").Skip(1))
                                         {
                                             foreach (HtmlNode subcell in subrow.SelectNodes("td"))
                                             {
                                                 if (subcell.InnerHtml.Contains("table"))
                                                 {
                                                     foreach (HtmlNode subsubtable in subcell.SelectNodes("//table"))
                                                     {
                                                         foreach (HtmlNode subsubrow in subsubtable.SelectNodes("tr").Skip(1))
                                                         {
                                                             foreach (HtmlNode subsubcell in subsubrow.SelectNodes("td"))
                                                             {
                                                                 if (subsubcell.InnerHtml.Contains("table"))
                                                                 {
                                                                     foreach (HtmlNode subsubsubtable in subsubcell.SelectNodes("//table"))
                                                                     {
                                                                         foreach (HtmlNode subsubsubrow in subsubsubtable.SelectNodes("tr").Skip(1))
                                                                         {
                                                                             foreach (HtmlNode subsubsubcell in subsubsubrow.SelectNodes("td"))
                                                                             {                                                                              
                                                                                 dccc.Columns.Add(subsubcell.InnerText);
                                                                                 dataGridView1.DataSource = dccc;
                                                                             }
                                                                           
                                                                         }
                                                                     }
                                                                 }
                                                                 else
                                                                 {
                                                                     if (dm1.Rows.Count == 0)
                                                                     {
                                                                         dm1.Columns.Add(subsubcell.InnerText);
                                                                         dataGridView1.DataSource = dm1;
                                                                     }
                                                                     else
                                                                     {
                                                                         dm2.Columns.Add(subsubcell.InnerText);
                                                                         dataGridView1.DataSource = dm2;                                                                    
                                                                     }                                      
                                                                 }
                                                             }                                                          
                                                         }
                                                         dc.Columns.Add(subcell.InnerText);
                                                         dataGridView2.DataSource = dc;
                                                     }
                                                 }
                                                 else
                                                 {
                                                     dm1.Columns.Add(subcell.InnerText);
                                                     dataGridView1.DataSource = dm1;                                                    
                                                 }
                                             }
                                         }
                                     }
                                 }
                                 else
                                 {                                    
                                     dmm.Columns.Add(cell.InnerText);
                                 }
                             }
                         }
                     }
0
Comment
Question by:pothireddysunil
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
4 Comments
 
LVL 31

Expert Comment

by:MlandaT
ID: 40303814
If all you are interested in is really just that last table, which you have marked the actual table (I must admit not fully understanding your narration), then I would suggest you rather just rely on XPath to do the work for you ... after all ... that is what it is meant for...

"the actual table" meets two criteria (of course I assume that the file structure here is pretty standard):
1 - it is a leaf table i.e. it has not nested table
2 - of such tables, it is the last one

Based on that, you just need to use this xpath: (//table[not(.//table)])[last()]

To process the TR and TD once you've found the table becomes trivial.
Dim htdoc As New HtmlDocument
htdoc.LoadHtml(File.ReadAllText("c:\rdp\tbl.html"))

for each n in htdoc.DocumentNode.SelectNodes("(//table[not(.//table)])[last()]") 'take out the [last()] to see all leaf tables

	'you could just say dim n = htdoc.DocumentNode.SelectNodes("(//table[not(.//table)])[last()]")
	Debug.WriteLine(n.OuterHtml)
	
	for each td in n.SelectNodes(".//tr")
		Debug.WriteLine(td.InnerText)
	next
	
next

Open in new window

tbl.html
0
 
LVL 1

Author Comment

by:pothireddysunil
ID: 40304144
I need to read all the 3 tables and need to get the information. The thing is, on the web page we don't have id's assigned to the tables. If they are, then it is very easy for me to read the tables.

The logic which i am using here is to loop through the tags using html agility pack and when ever it sees a td, check whether that td has any table tag, if not then write them to them to the DataTable.

The issue is, at third level it should write to the DataTable, instead it is going to the fourth level and writing.

is there any issue with my for each looping. I want to confirm that from experts.

  foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
                     {
                         ///This is the table.
                         foreach (HtmlNode row in table.SelectNodes("tr").Skip(1))
                         {
                             ///This is the row.
                            foreach (HtmlNode cell in row.SelectNodes("td"))
                              ///can also use "th|td", but right now we ONLY need td
                              {
                                  //This is the cell.
                                 if (cell.InnerHtml.Contains("table"))
                                  {
                                      foreach (HtmlNode subtable in cell.SelectNodes("//table"))
                                      {
                                          foreach (HtmlNode subrow in subtable.SelectNodes("tr").Skip(1))
                                          {
                                              foreach (HtmlNode subcell in subrow.SelectNodes("td"))
                                              {
                                                  if (subcell.InnerHtml.Contains("table"))
                                                  {
                                                      foreach (HtmlNode subsubtable in subcell.SelectNodes("//table"))
                                                      {
                                                          foreach (HtmlNode subsubrow in subsubtable.SelectNodes("tr").Skip(1))
                                                          {
                                                              foreach (HtmlNode subsubcell in subsubrow.SelectNodes("td"))
                                                              {
                                                                  if (subsubcell.InnerHtml.Contains("table"))
                                                                  {
0
 
LVL 31

Accepted Solution

by:
MlandaT earned 2000 total points
ID: 40304708
If you take out the "[last()]" from the code I gave you above, then it will extract the 3 nodes for each of the tables that you are interested in. You can then process them as you wish. He is example of my output - you can see the tables that it extracts at the bottom of the screenshot below.
Snapshot of output in LinqPad   HtmlAgilityPackIMHO: your loops are not very easy to follow. The Xpath approach captures your logic in a clean and concise manner.
for each n in htdoc.DocumentNode.SelectNodes("(//table[not(.//table)])")

Open in new window

0
 
LVL 1

Author Closing Comment

by:pothireddysunil
ID: 40306761
Thanks It works
0

Featured Post

Prepare for your VMware VCP6-DCV exam.

Josh Coen and Jason Langer have prepared the latest edition of VCP study guide. Both authors have been working in the IT field for more than a decade, and both hold VMware certifications. This 163-page guide covers all 10 of the exam blueprint sections.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

When we want to run, execute or repeat a statement multiple times, a loop is necessary. This article covers the two types of loops in Python: the while loop and the for loop.
This article aims to explain the working of CircularLogArchiver. This tool was designed to solve the buildup of log file in cases where systems do not support circular logging or where circular logging is not enabled
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
The viewer will learn how to use the return statement in functions in C++. The video will also teach the user how to pass data to a function and have the function return data back for further processing.
Suggested Courses

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question