Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

RegEx issue need to Parse all variations of a URL form HTML Body

Posted on 2006-04-29
8
Medium Priority
?
256 Views
Last Modified: 2008-03-06
I am trying to strip all the HREF="link.aspx" off a page - all I want is the link inside the quotes or single ' ' quotes. I have tried many different regex's that are really close to working but don't actually work in most scenarios. Here is a simple one I am using now that finds all the links on a page but my limited use of RegEx is preventing me from back copying this out to just the value inside the " or ' quotes.

(?<thishref>)\bhref[-A-Z0-9+&@#/%?=~_|!:,.;""""']*[-A-Z0-9+&@#/%=~_|""""']
0
Comment
Question by:ComputerCu
  • 3
  • 2
  • 2
  • +1
8 Comments
 
LVL 64

Expert Comment

by:Fernando Soto
ID: 16570302
Hi ComputerCu;

The following Regex statement will remove all href="somelink" and replace it with "somelink".

Imports System.Text.RegularExpressions

        Dim output As String = Regex.Replace(input, "(href\s*=\s*)""[^""]+", _
            "", RegexOptions.IgnoreCase)

Fernando
0
 

Author Comment

by:ComputerCu
ID: 16570389
Fernado - I am trying to do the complete regex in one shot so that I don't have to do any sort of replace. I guess what I am missing is I want to search for and find the href - but not capture it - then find the next " or ' and capture the character immediatly after it until I reach the end quotes and that is the only value I want to return. for example <a href="myLink.aspx">Click here</a> I want to pull back only myLink.aspx. Make sense?
0
 
LVL 64

Accepted Solution

by:
Fernando Soto earned 500 total points
ID: 16570605
Hi ComputerCu;

Sorry I misunderstood the question. This code will pull all the links and write them out to a file.

Imports System.Text.RegularExpressions
Imports System.IO

        Dim sr As New StreamReader("C:\Temp\href.txt")
        Dim input As String = sr.ReadToEnd()
        sr.Close()
        Dim sw As New StreamWriter("C:\Temp\hrefMod.txt")
        Dim output As String
        Dim mc As MatchCollection

        mc = Regex.Matches(input, "href\s*=\s*[""'](?<TheLink>[^""']+)", _
            RegexOptions.IgnoreCase)
        For Each m As Match In mc
            output = m.Groups("TheLink").Value
            sw.WriteLine(output)
        Next
        sw.Close()


Fernando
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
LVL 96

Expert Comment

by:Bob Learned
ID: 16570628
There are easier ways than complex regular expressions to get URLs from HTML text.

Bob
0
 
LVL 14

Expert Comment

by:Shiju Sasidharan
ID: 16573279
Try this

        Dim oReg As New Regex("href[ \t]*=[ \t]*([""'])(.*?)\1", RegexOptions.IgnoreCase)
        Dim oCol As MatchCollection
        oCol = oReg.Matches("your_html_source")
        For Each oMatch As Match In oCol
            MsgBox(oReg.Replace(oMatch.Value, "$2"))
        Next
0
 
LVL 64

Expert Comment

by:Fernando Soto
ID: 16573536
Hi shijusn;

The Regular Expression pattern you gave, "href[ \t]*=[ \t]*([""'])(.*?)\1", has a problem with it and that is that if you had a CrLf between href and the = sign or between the = sign and the " symbol your pattern will not capture it and the reason for that is in this part of the pattern, [ \t]*, which only looks for a space or tab character zero or more times. If you were to change [ \t]* to \s* then you would be able to find any white space in those positions.

Fernando
0
 
LVL 96

Expert Comment

by:Bob Learned
ID: 16576899
Well, I tried :)

Bob
0
 

Author Comment

by:ComputerCu
ID: 16578887
Thanks for the help everyone. I know how to do this already using InStr and SubString etc but from my understanding using a compiled RegEx is much faster which is why I went this route.
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Article by: Kraeven
Introduction Remote Share is a simple remote sharing tool, enabling you to see, add and remove remote or local shares. The application is written in VB.NET targeting the .NET framework 2.0. The source code and the compiled programs have been in…
Parsing a CSV file is a task that we are confronted with regularly, and although there are a vast number of means to do this, as a newbie, the field can be confusing and the tools can seem complex. A simple solution to parsing a customized CSV fi…
This Micro Tutorial will teach you how to add a cinematic look to any film or video out there. There are very few simple steps that you will follow to do so. This will be demonstrated using Adobe Premiere Pro CS6.
This lesson discusses how to use a Mainform + Subforms in Microsoft Access to find and enter data for payments on orders. The sample data comes from a custom shop that builds and sells movable storage structures that are delivered to your property. …
Suggested Courses

581 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question