Parse inner HTML tables to a DataTable issue from a web page

Hi All,

         I need to parse inner HTML tables from a web page and I am using HTML Agility pack for it. Here is my HTML table and C# code.
Here is the issue. I have 3 Inner HTML tables.

1st Inner table - 3 rd level
2 nd HTML table  - 3 rd level
3 rd HTML table - 2 nd Level

When looping through the Main HTML table, it has to stop at the third level and has to populate the DataTable as there is the first inner table. But instead it's going to the next level and populating the table.

Same thing with the 2 nd table.

For 3 HTML table, it has to come out the 3 rd loop and has to populate the DataTable.

Can someone guide me, where exactly i ma doing mistake in my code and what is the error.






<table>                  
 <tr>
       <td>
       <table>
            <tr>
                  <td>
                        <table>
                              <tr>
                              <td><b>Daily Backup Failed Client Report - Corporate</b></font></td>
                              </tr>
                        </table>
                  </td>
                  <td>
                        <table>
                              <tr>
                              <td>Last Day: 8/28/14 09:06 - 8/29/14 09:06</td>
                              </tr>
                              <tr>
                              <td>NetBackup Master and Media Servers</td>
                              </tr>
                        </table>
                  </td>
            </tr>
      </table>
      </td>
</tr>            
<tr>
      <td>
      <!-------------------- the actual table ------------------------>
      <table>
            <tr>
                        <td>1</td>
                        <td>2</td>
                        <td>3</td>
            </tr>
            <tr>
                        <td>1</td>
                        <td>2</td>
                        <td>3</td>
            </tr>
            <tr>
                        <td>1</td>
                        <td>2</td>
                        <td>3</td>
            </tr>
            <tr>
                        <td>1</td>
                        <td>2</td>
                        <td>3</td>
            </tr>

      </table>
      </td>
</tr>
      
<tr>
      <td><br><hr>Generated by Data Protection Advisor v6.1.0 (Build 85670)<br>Date: 8/29/14 09:08</td>
</tr>
</table>


                    foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
                    {
                        ///This is the table.
                        foreach (HtmlNode row in table.SelectNodes("tr").Skip(1))
                        {
                            ///This is the row.
                           foreach (HtmlNode cell in row.SelectNodes("td"))
                             ///can also use "th|td", but right now we ONLY need td
                             {
                                 //This is the cell.
                                if (cell.InnerHtml.Contains("table"))
                                 {
                                     foreach (HtmlNode subtable in cell.SelectNodes("//table"))
                                     {
                                         foreach (HtmlNode subrow in subtable.SelectNodes("tr").Skip(1))
                                         {
                                             foreach (HtmlNode subcell in subrow.SelectNodes("td"))
                                             {
                                                 if (subcell.InnerHtml.Contains("table"))
                                                 {
                                                     foreach (HtmlNode subsubtable in subcell.SelectNodes("//table"))
                                                     {
                                                         foreach (HtmlNode subsubrow in subsubtable.SelectNodes("tr").Skip(1))
                                                         {
                                                             foreach (HtmlNode subsubcell in subsubrow.SelectNodes("td"))
                                                             {
                                                                 if (subsubcell.InnerHtml.Contains("table"))
                                                                 {
                                                                     foreach (HtmlNode subsubsubtable in subsubcell.SelectNodes("//table"))
                                                                     {
                                                                         foreach (HtmlNode subsubsubrow in subsubsubtable.SelectNodes("tr").Skip(1))
                                                                         {
                                                                             foreach (HtmlNode subsubsubcell in subsubsubrow.SelectNodes("td"))
                                                                             {                                                                              
                                                                                 dccc.Columns.Add(subsubcell.InnerText);
                                                                                 dataGridView1.DataSource = dccc;
                                                                             }
                                                                           
                                                                         }
                                                                     }
                                                                 }
                                                                 else
                                                                 {
                                                                     if (dm1.Rows.Count == 0)
                                                                     {
                                                                         dm1.Columns.Add(subsubcell.InnerText);
                                                                         dataGridView1.DataSource = dm1;
                                                                     }
                                                                     else
                                                                     {
                                                                         dm2.Columns.Add(subsubcell.InnerText);
                                                                         dataGridView1.DataSource = dm2;                                                                    
                                                                     }                                      
                                                                 }
                                                             }                                                          
                                                         }
                                                         dc.Columns.Add(subcell.InnerText);
                                                         dataGridView2.DataSource = dc;
                                                     }
                                                 }
                                                 else
                                                 {
                                                     dm1.Columns.Add(subcell.InnerText);
                                                     dataGridView1.DataSource = dm1;                                                    
                                                 }
                                             }
                                         }
                                     }
                                 }
                                 else
                                 {                                    
                                     dmm.Columns.Add(cell.InnerText);
                                 }
                             }
                         }
                     }
LVL 1
pothireddysunilAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

MlandaTCommented:
If all you are interested in is really just that last table, which you have marked the actual table (I must admit not fully understanding your narration), then I would suggest you rather just rely on XPath to do the work for you ... after all ... that is what it is meant for...

"the actual table" meets two criteria (of course I assume that the file structure here is pretty standard):
1 - it is a leaf table i.e. it has not nested table
2 - of such tables, it is the last one

Based on that, you just need to use this xpath: (//table[not(.//table)])[last()]

To process the TR and TD once you've found the table becomes trivial.
Dim htdoc As New HtmlDocument
htdoc.LoadHtml(File.ReadAllText("c:\rdp\tbl.html"))

for each n in htdoc.DocumentNode.SelectNodes("(//table[not(.//table)])[last()]") 'take out the [last()] to see all leaf tables

	'you could just say dim n = htdoc.DocumentNode.SelectNodes("(//table[not(.//table)])[last()]")
	Debug.WriteLine(n.OuterHtml)
	
	for each td in n.SelectNodes(".//tr")
		Debug.WriteLine(td.InnerText)
	next
	
next

Open in new window

tbl.html
0
pothireddysunilAuthor Commented:
I need to read all the 3 tables and need to get the information. The thing is, on the web page we don't have id's assigned to the tables. If they are, then it is very easy for me to read the tables.

The logic which i am using here is to loop through the tags using html agility pack and when ever it sees a td, check whether that td has any table tag, if not then write them to them to the DataTable.

The issue is, at third level it should write to the DataTable, instead it is going to the fourth level and writing.

is there any issue with my for each looping. I want to confirm that from experts.

  foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
                     {
                         ///This is the table.
                         foreach (HtmlNode row in table.SelectNodes("tr").Skip(1))
                         {
                             ///This is the row.
                            foreach (HtmlNode cell in row.SelectNodes("td"))
                              ///can also use "th|td", but right now we ONLY need td
                              {
                                  //This is the cell.
                                 if (cell.InnerHtml.Contains("table"))
                                  {
                                      foreach (HtmlNode subtable in cell.SelectNodes("//table"))
                                      {
                                          foreach (HtmlNode subrow in subtable.SelectNodes("tr").Skip(1))
                                          {
                                              foreach (HtmlNode subcell in subrow.SelectNodes("td"))
                                              {
                                                  if (subcell.InnerHtml.Contains("table"))
                                                  {
                                                      foreach (HtmlNode subsubtable in subcell.SelectNodes("//table"))
                                                      {
                                                          foreach (HtmlNode subsubrow in subsubtable.SelectNodes("tr").Skip(1))
                                                          {
                                                              foreach (HtmlNode subsubcell in subsubrow.SelectNodes("td"))
                                                              {
                                                                  if (subsubcell.InnerHtml.Contains("table"))
                                                                  {
0
MlandaTCommented:
If you take out the "[last()]" from the code I gave you above, then it will extract the 3 nodes for each of the tables that you are interested in. You can then process them as you wish. He is example of my output - you can see the tables that it extracts at the bottom of the screenshot below.
Snapshot of output in LinqPad   HtmlAgilityPackIMHO: your loops are not very easy to follow. The Xpath approach captures your logic in a clean and concise manner.
for each n in htdoc.DocumentNode.SelectNodes("(//table[not(.//table)])")

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
pothireddysunilAuthor Commented:
Thanks It works
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft Development

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.