Solved

Parse inner HTML tables to a DataTable issue from a web page

Posted on 2014-09-03
4
771 Views
Last Modified: 2014-09-05
Hi All,

         I need to parse inner HTML tables from a web page and I am using HTML Agility pack for it. Here is my HTML table and C# code.
Here is the issue. I have 3 Inner HTML tables.

1st Inner table - 3 rd level
2 nd HTML table  - 3 rd level
3 rd HTML table - 2 nd Level

When looping through the Main HTML table, it has to stop at the third level and has to populate the DataTable as there is the first inner table. But instead it's going to the next level and populating the table.

Same thing with the 2 nd table.

For 3 HTML table, it has to come out the 3 rd loop and has to populate the DataTable.

Can someone guide me, where exactly i ma doing mistake in my code and what is the error.






<table>                  
 <tr>
       <td>
       <table>
            <tr>
                  <td>
                        <table>
                              <tr>
                              <td><b>Daily Backup Failed Client Report - Corporate</b></font></td>
                              </tr>
                        </table>
                  </td>
                  <td>
                        <table>
                              <tr>
                              <td>Last Day: 8/28/14 09:06 - 8/29/14 09:06</td>
                              </tr>
                              <tr>
                              <td>NetBackup Master and Media Servers</td>
                              </tr>
                        </table>
                  </td>
            </tr>
      </table>
      </td>
</tr>            
<tr>
      <td>
      <!-------------------- the actual table ------------------------>
      <table>
            <tr>
                        <td>1</td>
                        <td>2</td>
                        <td>3</td>
            </tr>
            <tr>
                        <td>1</td>
                        <td>2</td>
                        <td>3</td>
            </tr>
            <tr>
                        <td>1</td>
                        <td>2</td>
                        <td>3</td>
            </tr>
            <tr>
                        <td>1</td>
                        <td>2</td>
                        <td>3</td>
            </tr>

      </table>
      </td>
</tr>
      
<tr>
      <td><br><hr>Generated by Data Protection Advisor v6.1.0 (Build 85670)<br>Date: 8/29/14 09:08</td>
</tr>
</table>


                    foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
                    {
                        ///This is the table.
                        foreach (HtmlNode row in table.SelectNodes("tr").Skip(1))
                        {
                            ///This is the row.
                           foreach (HtmlNode cell in row.SelectNodes("td"))
                             ///can also use "th|td", but right now we ONLY need td
                             {
                                 //This is the cell.
                                if (cell.InnerHtml.Contains("table"))
                                 {
                                     foreach (HtmlNode subtable in cell.SelectNodes("//table"))
                                     {
                                         foreach (HtmlNode subrow in subtable.SelectNodes("tr").Skip(1))
                                         {
                                             foreach (HtmlNode subcell in subrow.SelectNodes("td"))
                                             {
                                                 if (subcell.InnerHtml.Contains("table"))
                                                 {
                                                     foreach (HtmlNode subsubtable in subcell.SelectNodes("//table"))
                                                     {
                                                         foreach (HtmlNode subsubrow in subsubtable.SelectNodes("tr").Skip(1))
                                                         {
                                                             foreach (HtmlNode subsubcell in subsubrow.SelectNodes("td"))
                                                             {
                                                                 if (subsubcell.InnerHtml.Contains("table"))
                                                                 {
                                                                     foreach (HtmlNode subsubsubtable in subsubcell.SelectNodes("//table"))
                                                                     {
                                                                         foreach (HtmlNode subsubsubrow in subsubsubtable.SelectNodes("tr").Skip(1))
                                                                         {
                                                                             foreach (HtmlNode subsubsubcell in subsubsubrow.SelectNodes("td"))
                                                                             {                                                                              
                                                                                 dccc.Columns.Add(subsubcell.InnerText);
                                                                                 dataGridView1.DataSource = dccc;
                                                                             }
                                                                           
                                                                         }
                                                                     }
                                                                 }
                                                                 else
                                                                 {
                                                                     if (dm1.Rows.Count == 0)
                                                                     {
                                                                         dm1.Columns.Add(subsubcell.InnerText);
                                                                         dataGridView1.DataSource = dm1;
                                                                     }
                                                                     else
                                                                     {
                                                                         dm2.Columns.Add(subsubcell.InnerText);
                                                                         dataGridView1.DataSource = dm2;                                                                    
                                                                     }                                      
                                                                 }
                                                             }                                                          
                                                         }
                                                         dc.Columns.Add(subcell.InnerText);
                                                         dataGridView2.DataSource = dc;
                                                     }
                                                 }
                                                 else
                                                 {
                                                     dm1.Columns.Add(subcell.InnerText);
                                                     dataGridView1.DataSource = dm1;                                                    
                                                 }
                                             }
                                         }
                                     }
                                 }
                                 else
                                 {                                    
                                     dmm.Columns.Add(cell.InnerText);
                                 }
                             }
                         }
                     }
0
Comment
Question by:pothireddysunil
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
4 Comments
 
LVL 30

Expert Comment

by:MlandaT
ID: 40303814
If all you are interested in is really just that last table, which you have marked the actual table (I must admit not fully understanding your narration), then I would suggest you rather just rely on XPath to do the work for you ... after all ... that is what it is meant for...

"the actual table" meets two criteria (of course I assume that the file structure here is pretty standard):
1 - it is a leaf table i.e. it has not nested table
2 - of such tables, it is the last one

Based on that, you just need to use this xpath: (//table[not(.//table)])[last()]

To process the TR and TD once you've found the table becomes trivial.
Dim htdoc As New HtmlDocument
htdoc.LoadHtml(File.ReadAllText("c:\rdp\tbl.html"))

for each n in htdoc.DocumentNode.SelectNodes("(//table[not(.//table)])[last()]") 'take out the [last()] to see all leaf tables

	'you could just say dim n = htdoc.DocumentNode.SelectNodes("(//table[not(.//table)])[last()]")
	Debug.WriteLine(n.OuterHtml)
	
	for each td in n.SelectNodes(".//tr")
		Debug.WriteLine(td.InnerText)
	next
	
next

Open in new window

tbl.html
0
 
LVL 1

Author Comment

by:pothireddysunil
ID: 40304144
I need to read all the 3 tables and need to get the information. The thing is, on the web page we don't have id's assigned to the tables. If they are, then it is very easy for me to read the tables.

The logic which i am using here is to loop through the tags using html agility pack and when ever it sees a td, check whether that td has any table tag, if not then write them to them to the DataTable.

The issue is, at third level it should write to the DataTable, instead it is going to the fourth level and writing.

is there any issue with my for each looping. I want to confirm that from experts.

  foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
                     {
                         ///This is the table.
                         foreach (HtmlNode row in table.SelectNodes("tr").Skip(1))
                         {
                             ///This is the row.
                            foreach (HtmlNode cell in row.SelectNodes("td"))
                              ///can also use "th|td", but right now we ONLY need td
                              {
                                  //This is the cell.
                                 if (cell.InnerHtml.Contains("table"))
                                  {
                                      foreach (HtmlNode subtable in cell.SelectNodes("//table"))
                                      {
                                          foreach (HtmlNode subrow in subtable.SelectNodes("tr").Skip(1))
                                          {
                                              foreach (HtmlNode subcell in subrow.SelectNodes("td"))
                                              {
                                                  if (subcell.InnerHtml.Contains("table"))
                                                  {
                                                      foreach (HtmlNode subsubtable in subcell.SelectNodes("//table"))
                                                      {
                                                          foreach (HtmlNode subsubrow in subsubtable.SelectNodes("tr").Skip(1))
                                                          {
                                                              foreach (HtmlNode subsubcell in subsubrow.SelectNodes("td"))
                                                              {
                                                                  if (subsubcell.InnerHtml.Contains("table"))
                                                                  {
0
 
LVL 30

Accepted Solution

by:
MlandaT earned 500 total points
ID: 40304708
If you take out the "[last()]" from the code I gave you above, then it will extract the 3 nodes for each of the tables that you are interested in. You can then process them as you wish. He is example of my output - you can see the tables that it extracts at the bottom of the screenshot below.
Snapshot of output in LinqPad   HtmlAgilityPackIMHO: your loops are not very easy to follow. The Xpath approach captures your logic in a clean and concise manner.
for each n in htdoc.DocumentNode.SelectNodes("(//table[not(.//table)])")

Open in new window

0
 
LVL 1

Author Closing Comment

by:pothireddysunil
ID: 40306761
Thanks It works
0

Featured Post

Salesforce Has Never Been Easier

Improve and reinforce salesforce training & adoption using WalkMe's digital adoption platform. Start saving on costly employee training by creating fast intuitive Walk-Thrus for Salesforce. Claim your Free Account Now

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Entity Framework is a powerful tool to help you interact with the DataBase but still doesn't help much when we have a Stored Procedure that returns more than one resultset. The solution takes some of out-of-the-box thinking; read on!
Exception Handling is in the core of any application that is able to dignify its name. In this article, I'll guide you through the process of writing a DRY (Don't Repeat Yourself) Exception Handling mechanism, using Aspect Oriented Programming.
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.
In a recent question (https://www.experts-exchange.com/questions/29004105/Run-AutoHotkey-script-directly-from-Notepad.html) here at Experts Exchange, a member asked how to run an AutoHotkey script (.AHK) directly from Notepad++ (aka NPP). This video…

737 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question