jedistar
asked on
Parse HTML file index.html
Hi,
For example: <a href="http://www/x.html">Visit Me</a>
Currently my below code only parse for the HTML links and does not read "Visit Me". It only grabs the http link.
How do i grab both the link and the "Visit Me" title, my code is as follows:
Public Function ParseLinks(ByVal HTML As String) As String
' Remember to add the following at top of class:
' - Imports System.Text.RegularExpress ions
Dim objRegEx As System.Text.RegularExpress ions.Regex
Dim objMatch As System.Text.RegularExpress ions.Match
Dim strResult As String
' Create regular expression
objRegEx = New System.Text.RegularExpress ions.Regex ( _
"a.*href\s*=\s*(?:""(?<1>[ ^""]*)""|( ?<1>\S+))" , _
System.Text.RegularExpress ions.Regex Options.Ig noreCase Or _
System.Text.RegularExpress ions.Regex Options.Co mpiled)
' Match expression to HTML
objMatch = objRegEx.Match(HTML)
While objMatch.Success
Dim strMatch As String
strMatch = objMatch.Groups(1).ToStrin g
strResult &= strMatch & vbCrLf
objMatch = objMatch.NextMatch()
End While
' Pass back results
Return strResult
End Function
For example: <a href="http://www/x.html">Visit Me</a>
Currently my below code only parse for the HTML links and does not read "Visit Me". It only grabs the http link.
How do i grab both the link and the "Visit Me" title, my code is as follows:
Public Function ParseLinks(ByVal HTML As String) As String
' Remember to add the following at top of class:
' - Imports System.Text.RegularExpress
Dim objRegEx As System.Text.RegularExpress
Dim objMatch As System.Text.RegularExpress
Dim strResult As String
' Create regular expression
objRegEx = New System.Text.RegularExpress
"a.*href\s*=\s*(?:""(?<1>[
System.Text.RegularExpress
System.Text.RegularExpress
' Match expression to HTML
objMatch = objRegEx.Match(HTML)
While objMatch.Success
Dim strMatch As String
strMatch = objMatch.Groups(1).ToStrin
strResult &= strMatch & vbCrLf
objMatch = objMatch.NextMatch()
End While
' Pass back results
Return strResult
End Function
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Qn raised to 400
Hi jedistar;
The solution I posted above when tested with this string
"For example: <a href="http://www/x.html">Visit Me</a>"
Returns the following string:
"http://www/x.html : Visit Me"
Now inside the function ParseLinks the variable strMatch has the link, http://www/x.html, and the variable strMatch2 has the value, Visit Me. Your original code returned http://www/x.html.
So how is the code I posted not giving you what you want?
Can you post an actual input line / small html file where it fails, and how it fails?
Thanks;
Fernando
The solution I posted above when tested with this string
"For example: <a href="http://www/x.html">Visit Me</a>"
Returns the following string:
"http://www/x.html : Visit Me"
Now inside the function ParseLinks the variable strMatch has the link, http://www/x.html, and the variable strMatch2 has the value, Visit Me. Your original code returned http://www/x.html.
So how is the code I posted not giving you what you want?
Can you post an actual input line / small html file where it fails, and how it fails?
Thanks;
Fernando
ASKER
i fixed it now, thanks too!
Not a Problem, glad I was able to help. ;=)
ASKER
Hmm that works, however there is a shortcoming, the entire path was removed, now leaving with the filename only.
Initially i had
/temp/blah/blah/file.doc
now its just
file.doc
How do i keep the path in strMatch
Tks again Fernado!!