Read text from HTML

Posted on 2012-08-16
Medium Priority
Last Modified: 2012-09-04
I have a .NET web application using which I am trying to obtain some data out of a webpage. I used StreamReader to obtain the source code of the webpage in HTML. Somewhere in the middle is a table which has the data that I need. The entire HTML looks too complicated to be read using Regex etc.
Is it possible to  extract only text out of HTML or are there better ways to parse a HTML webpage?
Question by:Angel02
LVL 35

Accepted Solution

Terry Woods earned 1200 total points
ID: 38302609
Apparently the HTML Agiility Pack is good for this. It's given as the solution in at least 2 other similar EE questions.
LVL 28

Assisted Solution

Ark earned 800 total points
ID: 38309152
Retrive outreHTML (bInner=false) or innerHTML(bInner=true) from specific tag and id.
You can call it like
MsgBox GetTagContent(yourHtmlSource,"TABLE",yourTableID,True)
        Public Shared Function GetTagContent(ByVal strHTML As String, _
                                             ByVal tagName As String, _
                                    Optional ByVal id As String = "", _
                                    Optional ByVal bInner As Boolean = False) As String
            If String.IsNullOrEmpty(strHTML) Then Return ""
            Dim pattern As String
            If id <> "" Then
                pattern = String.Format("<{0}[^>]*id[^=]*=[^'|^""]*[""|']{1}['|""][^>]*>(.*?)</{0}>", tagName.ToLower, id.ToLower)
                pattern = String.Format("<{0}[^>]*>(.*?)</{0}>", tagName.ToLower)
            End If
            Dim rgx As New Regex(pattern, RegularExpressions.RegexOptions.IgnoreCase + RegularExpressions.RegexOptions.Singleline)
            Dim m As Match = rgx.Match(strHTML)
            If Not m.Success Then Return ""
            If Not bInner Then Return m.Groups(0).Value
            If m.Groups.Count < 2 Then Return ""
            Return m.Groups(1).Value
        End Function

Open in new window


Featured Post

NEW Veeam Agent for Microsoft Windows

Backup and recover physical and cloud-based servers and workstations, as well as endpoint devices that belong to remote users. Avoid downtime and data loss quickly and easily for Windows-based physical or public cloud-based workloads!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Finding original email is quite difficult due to their duplicates. From this article, you will come to know why multiple duplicates of same emails appear and how to delete duplicate emails from Outlook securely and instantly while vital emails remai…
This article shows how to deploy dynamic backgrounds to computers depending on the aspect ratio of display
Video by: Mark
This lesson goes over how to construct ordered and unordered lists and how to create hyperlinks.
Learn how to create flexible layouts using relative units in CSS.  New relative units added in CSS3 include vw(viewports width), vh(viewports height), vmin(minimum of viewports height and width), and vmax (maximum of viewports height and width).

809 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question