We help IT Professionals succeed at work.

Read text from HTML

I have a .NET web application using which I am trying to obtain some data out of a webpage. I used StreamReader to obtain the source code of the webpage in HTML. Somewhere in the middle is a table which has the data that I need. The entire HTML looks too complicated to be read using Regex etc.
Is it possible to  extract only text out of HTML or are there better ways to parse a HTML webpage?
Watch Question

Web Developer, specialising in WordPress
Most Valuable Expert 2011
Apparently the HTML Agiility Pack is good for this. It's given as the solution in at least 2 other similar EE questions.
Retrive outreHTML (bInner=false) or innerHTML(bInner=true) from specific tag and id.
You can call it like
MsgBox GetTagContent(yourHtmlSource,"TABLE",yourTableID,True)
        Public Shared Function GetTagContent(ByVal strHTML As String, _
                                             ByVal tagName As String, _
                                    Optional ByVal id As String = "", _
                                    Optional ByVal bInner As Boolean = False) As String
            If String.IsNullOrEmpty(strHTML) Then Return ""
            Dim pattern As String
            If id <> "" Then
                pattern = String.Format("<{0}[^>]*id[^=]*=[^'|^""]*[""|']{1}['|""][^>]*>(.*?)</{0}>", tagName.ToLower, id.ToLower)
                pattern = String.Format("<{0}[^>]*>(.*?)</{0}>", tagName.ToLower)
            End If
            Dim rgx As New Regex(pattern, RegularExpressions.RegexOptions.IgnoreCase + RegularExpressions.RegexOptions.Singleline)
            Dim m As Match = rgx.Match(strHTML)
            If Not m.Success Then Return ""
            If Not bInner Then Return m.Groups(0).Value
            If m.Groups.Count < 2 Then Return ""
            Return m.Groups(1).Value
        End Function

Open in new window

Explore More ContentExplore courses, solutions, and other research materials related to this topic.