Link to home
Start Free TrialLog in
Avatar of jedistar
jedistarFlag for Singapore

asked on

Parse HTML file index.html

Hi,

For example: <a href="http://www/x.html">Visit Me</a>

Currently my below code only parse for the HTML links and does not read "Visit Me". It only grabs the http link.
How do i grab both the link and the "Visit Me" title, my code is as follows:


  Public Function ParseLinks(ByVal HTML As String) As String
            ' Remember to add the following at top of class:
            ' - Imports System.Text.RegularExpressions
            Dim objRegEx As System.Text.RegularExpressions.Regex
            Dim objMatch As System.Text.RegularExpressions.Match
            Dim strResult As String
            ' Create regular expression
            objRegEx = New System.Text.RegularExpressions.Regex( _
                  "a.*href\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+))", _
                  System.Text.RegularExpressions.RegexOptions.IgnoreCase Or _
                  System.Text.RegularExpressions.RegexOptions.Compiled)
            ' Match expression to HTML
            objMatch = objRegEx.Match(HTML)

            While objMatch.Success
                  Dim strMatch As String
                  strMatch = objMatch.Groups(1).ToString
                  strResult &= strMatch & vbCrLf
                  objMatch = objMatch.NextMatch()
            End While
            ' Pass back results
            Return strResult
  End Function
ASKER CERTIFIED SOLUTION
Avatar of Fernando Soto
Fernando Soto
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of jedistar

ASKER

Thanks Fernado for replying ;)

Hmm that works, however there is a shortcoming, the entire path was removed, now leaving with the filename only.

Initially i had

/temp/blah/blah/file.doc

now its just

file.doc

How do i keep the path in strMatch

Tks again Fernado!!
Qn raised to 400
Hi jedistar;

The solution I posted above when tested with this string

    "For example: <a href="http://www/x.html">Visit Me</a>"

Returns the following string:

    "http://www/x.html : Visit Me"

Now inside the function ParseLinks the variable strMatch has the link, http://www/x.html, and the variable strMatch2 has the value, Visit Me. Your original code returned http://www/x.html.

So how is the code I posted not giving you what you want?
Can you post an actual input line / small html file where it fails, and how it fails?

Thanks;

Fernando


i fixed it now, thanks too!
Not a Problem, glad I was able to help. ;=)