Solved

Parse HTML tags from text file.

Posted on 2014-03-27
5
294 Views
Last Modified: 2014-03-28
Hi Experts,

If I wanted to retrieve the <TR> tags from a web page I would do it like this:

Dim tagCollection As HtmlElementCollection
tagCollection = WebBrowser1.Document.Body.Document.GetElementsByTagName("tr")

How do I retrieve the tags from the html file text something like this?

tagCollection = hmtlFileText.GetElementsByTagName("tr")
0
Comment
Question by:DColin
  • 3
  • 2
5 Comments
 
LVL 23

Accepted Solution

by:
Jens Fiederer earned 500 total points
ID: 39959839
Arbitrary files are not structured as HTML.  It is not hard to find all instances of "<TR>" doing simple text searches, or maybe "<TR" if you don't want to miss TR elements with attributes.  But to really structure it you need to have it parsed.

If you are fortunate enough to be using XHTML, you can use .NET XML parsing functions.

Otherwise you'll probably need a 3rd party library like HTML Agility

See http://htmlagilitypack.codeplex.com/
0
 

Author Comment

by:DColin
ID: 39959909
jensfiederer,

The text file will be an HTML file.
0
 
LVL 23

Expert Comment

by:Jens Fiederer
ID: 39959990
Yes, but from the point of view of .NET, it is just a sequence of characters (that just HAPPEN to satisfy HTML syntax).  

ASP.Net needs to GENERATE HTML, but it doesn't usually need to read it - that's the browsers job.  That's why libraries like HTML Agility exist.
0
 

Author Comment

by:DColin
ID: 39960024
jensfiederer,

I was thinking that the WebBrowser control uses the HtmlDocument class to hold the html text. So how do I load the html text into an HtmlDocument object without having to use a WebBrowser control?
0
 
LVL 23

Expert Comment

by:Jens Fiederer
ID: 39960057
Like I said, you can use HTML agility pack.  It's free, it's available at the URI I provided, and it supports code like:

HtmlDocument doc = new HtmlDocument();
 doc.Load("file.htm");
 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
 {
    HtmlAttribute att = link["href"];
    att.Value = FixLink(att);
 }
 doc.Save("file.htm");

Open in new window


Note: I'm not involved in any way with the HTML Agility project, except that I needed to parse HTML files at one point a year or two ago and that is what I ended up using.
0

Featured Post

DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Icons and Colors for Terms 3 24
Tool Box 2 35
Resolve Dependency Issues 4 48
Automated testing suggestions? 2 29
Wouldn’t it be nice if you could test whether an element is contained in an array by using a Contains method just like the one available on List objects? Wouldn’t it be good if you could write code like this? (CODE) In .NET 3.5, this is possible…
A long time ago (May 2011), I have written an article showing you how to create a DLL using Visual Studio 2005 to be hosted in SQL Server 2005. That was valid at that time and it is still valid if you are still using these versions. You can still re…
This tutorial gives a high-level tour of the interface of Marketo (a marketing automation tool to help businesses track and engage prospective customers and drive them to purchase). You will see the main areas including Marketing Activities, Design …
Internet Business Fax to Email Made Easy - With  eFax Corporate (http://www.enterprise.efax.com), you'll receive a dedicated online fax number, which is used the same way as a typical analog fax number. You'll receive secure faxes in your email, f…

863 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

27 Experts available now in Live!

Get 1:1 Help Now