Extracting data from html using c#

Hi,
I have an active directory report in html that I need to extract data from. The html file contains a lot of tables but I'm only looking for tables that have the following header
<tr><th scope="col">Policy</th><th scope="col">Setting</th><th scope="col">Winning GPO</th></tr>

Open in new window

if we find a table that contains the above header then I want to add it to a  master list. The end result will be to have a master list that contains the data from all the tables that contain the above header.  Below is an example of the html I need to process
<table class="info3" cellpadding="0" cellspacing="0">
<tr><th scope="col">Policy</th><th scope="col">Setting</th><th scope="col">Winning GPO</th></tr>
<tr><td>Enforce password history</td><td>4 passwords remembered</td><td>Default Domain Policy</td></tr>
<tr><td>Maximum password age</td><td>0 days</td><td>Default Domain Policy</td></tr>
<tr><td>Minimum password age</td><td>1 days</td><td>Default Domain Policy</td></tr>
<tr><td>Minimum password length</td><td>8 characters</td><td>Default Domain Policy</td></tr>
<tr><td>Password must meet complexity requirements</td><td>Enabled</td><td>Default Domain Policy</td></tr>
<tr><td>Store passwords using reversible encryption</td><td>Disabled</td><td>Default Domain Policy</td></tr>
</table>
</div></div><div class="he3"><span class="sectionTitle" tabindex="0">Account Policies/Account Lockout Policy</span><a class="expando" href="#"></a></div>
<div class="container"><div class="he4i"><table class="info3" cellpadding="0" cellspacing="0">
<tr><th scope="col">Policy</th><th scope="col">Setting</th><th scope="col">Winning GPO</th></tr>
<tr><td>Account lockout duration</td><td>30 minutes</td><td>Default Domain Policy</td></tr>
<tr><td>Account lockout threshold</td><td>6 invalid logon attempts</td><td>Default Domain Policy</td></tr>
<tr><td>Reset account lockout counter after</td><td>30 minutes</td><td>Default Domain Policy</td></tr>
</table>
</div></div><div class="he3"><span class="sectionTitle" tabindex="0">Local Policies/Audit Policy</span><a class="expando" href="#"></a></div>
<div class="container"><div class="he4i"><table class="info3" cellpadding="0" cellspacing="0">
<tr><th scope="col">Policy</th><th scope="col">Setting</th><th scope="col">Winning GPO</th></tr>
<tr><td>Audit process tracking</td><td>Success, Failure</td><td>Workstations Audit Policies</td></tr>
</table>
</div></div><div class="he3"><span class="sectionTitle" tabindex="0">Local Policies/Security Options</span><a class="expando" href="#"></a></div>
<div class="container"><div class="he4h"><span class="sectionTitle" tabindex="0">Interactive Logon</span><a class="expando" href="#"></a></div>
<div class="container"><div class="he4i"><table class="info3" cellpadding="0" cellspacing="0">
<tr><th scope="col">Policy</th><th scope="col">Setting</th><th scope="col">Winning GPO</th></tr>
<tr><td>Interactive logon: Do not display last user name</td><td>Enabled</td><td>Default Domain Policy</td></tr>
<tr><td>Interactive logon: Message text for users attempting to log on</td><td>This is a private enterprise computer system limited to business use.  Access to and use of this system requires explicit and current authorization.  All users expressly consent to monitoring by system personnel to detect improper access or use.  If such monitoring reveals possible criminal activity or improper access or use,system personnel may provide evidence of such conduct to law enforcement officials and/or company management.</td><td>Workstations</td></tr>
<tr><td>Interactive logon: Message title for users attempting to log on</td><td>Important Notice:</td><td>Workstations</td></tr>
<tr><td>Interactive logon: Number of previous logons to cache (in case domain controller is not available)</td><td>10 logons</td><td>Workstations</td></tr>
</table>
</div></div><div class="he4h"><span class="sectionTitle" tabindex="0">Network Security</span><a class="expando" href="#"></a></div>
<div class="container"><div class="he4i"><table class="info3" cellpadding="0" cellspacing="0">
<tr><th scope="col">Policy</th><th scope="col">Setting</th><th scope="col">Winning GPO</th></tr>
<tr><td>Network security: Force logoff when logon hours expire</td><td>Enabled</td><td>Default Domain Policy</td></tr>
</table>
</div></div><div class="he4h"><span class="sectionTitle" tabindex="0">Other</span><a class="expando" href="#"></a></div>
<div class="container"><div class="he4i"><table class="info3" cellpadding="0" cellspacing="0">
<tr><th scope="col">Policy</th><th scope="col">Setting</th><th scope="col">Winning GPO</th></tr>
<tr><td>Network security: Allow Local System to use computer identity for NTLM</td><td>Enabled</td><td>Workstations Tablet Windows 81 - GPO WF Deny</td></tr>
<tr><td>Network security: Allow LocalSystem NULL session fallback</td><td>Disabled</td><td>Workstations Tablet Windows 81 - GPO WF Deny</td></tr>
</table>

Open in new window

matthew phungAsked:
Who is Participating?
 
apeterCommented:
All you need is a xpath to get the tables. Any external library you use you have to use xpath at the end. But you can use the same in .net library itself. Hope this link helps, https://msdn.microsoft.com/en-us/library/bb341675(v=vs.110).aspx
0
 
matthew phungAuthor Commented:
Some of the tables have nested tables too. I would like to ignore the nested tables. I attached an example  below
<table class="info3" cellpadding="0" cellspacing="0">
<tr><th scope="col">Policy</th><th scope="col">Setting</th><th scope="col">Winning GPO</th></tr>
<tr><td>Automatic certificate management</td><td>Enabled</td><td>[Default setting]</td></tr>
<tr><td colspan="3"><table class="subtable3" cellpadding="0" cellspacing="0">
<tr><th scope="col">Option</th><th scope="col">Setting</th></tr>
<tr><td scope="row">Enroll new certificates, renew expired certificates, process pending certificate requests and remove revoked certificates</td><td>Disabled</td></tr>
<tr><td scope="row">Update and manage certificates that use certificate templates from Active Directory</td><td>Disabled</td></tr>
</table></td></tr></table>

Open in new window

0
 
Ioannis ParaskevopoulosCommented:
Hi,

For parsing html there is a powerful library called HtmlAgilityPack available on NuGet.

Check it out.

Giannis
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.