Solved

Parse inner HTML tables to a DataTable issue from a web page

Posted on 2014-09-03
4
646 Views
Last Modified: 2014-09-05
Hi All,

         I need to parse inner HTML tables from a web page and I am using HTML Agility pack for it. Here is my HTML table and C# code.
Here is the issue. I have 3 Inner HTML tables.

1st Inner table - 3 rd level
2 nd HTML table  - 3 rd level
3 rd HTML table - 2 nd Level

When looping through the Main HTML table, it has to stop at the third level and has to populate the DataTable as there is the first inner table. But instead it's going to the next level and populating the table.

Same thing with the 2 nd table.

For 3 HTML table, it has to come out the 3 rd loop and has to populate the DataTable.

Can someone guide me, where exactly i ma doing mistake in my code and what is the error.






<table>                  
 <tr>
       <td>
       <table>
            <tr>
                  <td>
                        <table>
                              <tr>
                              <td><b>Daily Backup Failed Client Report - Corporate</b></font></td>
                              </tr>
                        </table>
                  </td>
                  <td>
                        <table>
                              <tr>
                              <td>Last Day: 8/28/14 09:06 - 8/29/14 09:06</td>
                              </tr>
                              <tr>
                              <td>NetBackup Master and Media Servers</td>
                              </tr>
                        </table>
                  </td>
            </tr>
      </table>
      </td>
</tr>            
<tr>
      <td>
      <!-------------------- the actual table ------------------------>
      <table>
            <tr>
                        <td>1</td>
                        <td>2</td>
                        <td>3</td>
            </tr>
            <tr>
                        <td>1</td>
                        <td>2</td>
                        <td>3</td>
            </tr>
            <tr>
                        <td>1</td>
                        <td>2</td>
                        <td>3</td>
            </tr>
            <tr>
                        <td>1</td>
                        <td>2</td>
                        <td>3</td>
            </tr>

      </table>
      </td>
</tr>
      
<tr>
      <td><br><hr>Generated by Data Protection Advisor v6.1.0 (Build 85670)<br>Date: 8/29/14 09:08</td>
</tr>
</table>


                    foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
                    {
                        ///This is the table.
                        foreach (HtmlNode row in table.SelectNodes("tr").Skip(1))
                        {
                            ///This is the row.
                           foreach (HtmlNode cell in row.SelectNodes("td"))
                             ///can also use "th|td", but right now we ONLY need td
                             {
                                 //This is the cell.
                                if (cell.InnerHtml.Contains("table"))
                                 {
                                     foreach (HtmlNode subtable in cell.SelectNodes("//table"))
                                     {
                                         foreach (HtmlNode subrow in subtable.SelectNodes("tr").Skip(1))
                                         {
                                             foreach (HtmlNode subcell in subrow.SelectNodes("td"))
                                             {
                                                 if (subcell.InnerHtml.Contains("table"))
                                                 {
                                                     foreach (HtmlNode subsubtable in subcell.SelectNodes("//table"))
                                                     {
                                                         foreach (HtmlNode subsubrow in subsubtable.SelectNodes("tr").Skip(1))
                                                         {
                                                             foreach (HtmlNode subsubcell in subsubrow.SelectNodes("td"))
                                                             {
                                                                 if (subsubcell.InnerHtml.Contains("table"))
                                                                 {
                                                                     foreach (HtmlNode subsubsubtable in subsubcell.SelectNodes("//table"))
                                                                     {
                                                                         foreach (HtmlNode subsubsubrow in subsubsubtable.SelectNodes("tr").Skip(1))
                                                                         {
                                                                             foreach (HtmlNode subsubsubcell in subsubsubrow.SelectNodes("td"))
                                                                             {                                                                              
                                                                                 dccc.Columns.Add(subsubcell.InnerText);
                                                                                 dataGridView1.DataSource = dccc;
                                                                             }
                                                                           
                                                                         }
                                                                     }
                                                                 }
                                                                 else
                                                                 {
                                                                     if (dm1.Rows.Count == 0)
                                                                     {
                                                                         dm1.Columns.Add(subsubcell.InnerText);
                                                                         dataGridView1.DataSource = dm1;
                                                                     }
                                                                     else
                                                                     {
                                                                         dm2.Columns.Add(subsubcell.InnerText);
                                                                         dataGridView1.DataSource = dm2;                                                                    
                                                                     }                                      
                                                                 }
                                                             }                                                          
                                                         }
                                                         dc.Columns.Add(subcell.InnerText);
                                                         dataGridView2.DataSource = dc;
                                                     }
                                                 }
                                                 else
                                                 {
                                                     dm1.Columns.Add(subcell.InnerText);
                                                     dataGridView1.DataSource = dm1;                                                    
                                                 }
                                             }
                                         }
                                     }
                                 }
                                 else
                                 {                                    
                                     dmm.Columns.Add(cell.InnerText);
                                 }
                             }
                         }
                     }
0
Comment
Question by:pothireddysunil
  • 2
  • 2
4 Comments
 
LVL 30

Expert Comment

by:MlandaT
ID: 40303814
If all you are interested in is really just that last table, which you have marked the actual table (I must admit not fully understanding your narration), then I would suggest you rather just rely on XPath to do the work for you ... after all ... that is what it is meant for...

"the actual table" meets two criteria (of course I assume that the file structure here is pretty standard):
1 - it is a leaf table i.e. it has not nested table
2 - of such tables, it is the last one

Based on that, you just need to use this xpath: (//table[not(.//table)])[last()]

To process the TR and TD once you've found the table becomes trivial.
Dim htdoc As New HtmlDocument
htdoc.LoadHtml(File.ReadAllText("c:\rdp\tbl.html"))

for each n in htdoc.DocumentNode.SelectNodes("(//table[not(.//table)])[last()]") 'take out the [last()] to see all leaf tables

	'you could just say dim n = htdoc.DocumentNode.SelectNodes("(//table[not(.//table)])[last()]")
	Debug.WriteLine(n.OuterHtml)
	
	for each td in n.SelectNodes(".//tr")
		Debug.WriteLine(td.InnerText)
	next
	
next

Open in new window

tbl.html
0
 
LVL 1

Author Comment

by:pothireddysunil
ID: 40304144
I need to read all the 3 tables and need to get the information. The thing is, on the web page we don't have id's assigned to the tables. If they are, then it is very easy for me to read the tables.

The logic which i am using here is to loop through the tags using html agility pack and when ever it sees a td, check whether that td has any table tag, if not then write them to them to the DataTable.

The issue is, at third level it should write to the DataTable, instead it is going to the fourth level and writing.

is there any issue with my for each looping. I want to confirm that from experts.

  foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
                     {
                         ///This is the table.
                         foreach (HtmlNode row in table.SelectNodes("tr").Skip(1))
                         {
                             ///This is the row.
                            foreach (HtmlNode cell in row.SelectNodes("td"))
                              ///can also use "th|td", but right now we ONLY need td
                              {
                                  //This is the cell.
                                 if (cell.InnerHtml.Contains("table"))
                                  {
                                      foreach (HtmlNode subtable in cell.SelectNodes("//table"))
                                      {
                                          foreach (HtmlNode subrow in subtable.SelectNodes("tr").Skip(1))
                                          {
                                              foreach (HtmlNode subcell in subrow.SelectNodes("td"))
                                              {
                                                  if (subcell.InnerHtml.Contains("table"))
                                                  {
                                                      foreach (HtmlNode subsubtable in subcell.SelectNodes("//table"))
                                                      {
                                                          foreach (HtmlNode subsubrow in subsubtable.SelectNodes("tr").Skip(1))
                                                          {
                                                              foreach (HtmlNode subsubcell in subsubrow.SelectNodes("td"))
                                                              {
                                                                  if (subsubcell.InnerHtml.Contains("table"))
                                                                  {
0
 
LVL 30

Accepted Solution

by:
MlandaT earned 500 total points
ID: 40304708
If you take out the "[last()]" from the code I gave you above, then it will extract the 3 nodes for each of the tables that you are interested in. You can then process them as you wish. He is example of my output - you can see the tables that it extracts at the bottom of the screenshot below.
Snapshot of output in LinqPad   HtmlAgilityPackIMHO: your loops are not very easy to follow. The Xpath approach captures your logic in a clean and concise manner.
for each n in htdoc.DocumentNode.SelectNodes("(//table[not(.//table)])")

Open in new window

0
 
LVL 1

Author Closing Comment

by:pothireddysunil
ID: 40306761
Thanks It works
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

If you haven’t already, I encourage you to read the first article (http://www.experts-exchange.com/articles/18680/An-Introduction-to-R-Programming-and-R-Studio.html) in my series to gain a basic foundation of R and R Studio.  You will also find the …
This article is meant to give a basic understanding of how to use R Sweave as a way to merge LaTeX and R code seamlessly into one presentable document.
This tutorial explains how to use the VisualVM tool for the Java platform application. This video goes into detail on the Threads, Sampler, and Profiler tabs.
The viewer will be introduced to the technique of using vectors in C++. The video will cover how to define a vector, store values in the vector and retrieve data from the values stored in the vector.

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question