Read text from HTML

Posted on 2012-08-16
Last Modified: 2012-09-04
I have a .NET web application using which I am trying to obtain some data out of a webpage. I used StreamReader to obtain the source code of the webpage in HTML. Somewhere in the middle is a table which has the data that I need. The entire HTML looks too complicated to be read using Regex etc.
Is it possible to  extract only text out of HTML or are there better ways to parse a HTML webpage?
Question by:Angel02
    LVL 34

    Accepted Solution

    Apparently the HTML Agiility Pack is good for this. It's given as the solution in at least 2 other similar EE questions.
    LVL 27

    Assisted Solution

    Retrive outreHTML (bInner=false) or innerHTML(bInner=true) from specific tag and id.
    You can call it like
    MsgBox GetTagContent(yourHtmlSource,"TABLE",yourTableID,True)
            Public Shared Function GetTagContent(ByVal strHTML As String, _
                                                 ByVal tagName As String, _
                                        Optional ByVal id As String = "", _
                                        Optional ByVal bInner As Boolean = False) As String
                If String.IsNullOrEmpty(strHTML) Then Return ""
                Dim pattern As String
                If id <> "" Then
                    pattern = String.Format("<{0}[^>]*id[^=]*=[^'|^""]*[""|']{1}['|""][^>]*>(.*?)</{0}>", tagName.ToLower, id.ToLower)
                    pattern = String.Format("<{0}[^>]*>(.*?)</{0}>", tagName.ToLower)
                End If
                Dim rgx As New Regex(pattern, RegularExpressions.RegexOptions.IgnoreCase + RegularExpressions.RegexOptions.Singleline)
                Dim m As Match = rgx.Match(strHTML)
                If Not m.Success Then Return ""
                If Not bInner Then Return m.Groups(0).Value
                If m.Groups.Count < 2 Then Return ""
                Return m.Groups(1).Value
            End Function

    Open in new window


    Featured Post

    Looking for New Ways to Advertise?

    Engage with tech pros in our community with native advertising, as a Vendor Expert, and more.

    Join & Write a Comment

    A long time ago (May 2011), I have written an article showing you how to create a DLL using Visual Studio 2005 to be hosted in SQL Server 2005. That was valid at that time and it is still valid if you are still using these versions. You can still re…
    Building a website can seem like a daunting task to the uninitiated but it really only requires knowledge of two basic languages: HTML and CSS.
    In this tutorial viewers will learn how add a scalable full-width header using CSS3. Create a new HTML document with an internal stylesheet. Set a tiled background.:  Create a new div and name it Header. Position it with position:absolute at the top…
    In this tutorial viewers will learn how to style transparent/translucent elements using alpha transparency in CSS Start with a normal styled element, such as a div.: Define its "background-color" property as "rgba (255, 255, 255, .5): The numbers in…

    732 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    19 Experts available now in Live!

    Get 1:1 Help Now