Solved

regex to match url

Posted on 2011-02-16
40
1,283 Views
Last Modified: 2012-05-11
i need a regex to match the href part of a url
the links are already html links meaning they are already <a href="xxx">xxx</a>

i'm currently trying to use href=[\'"]?([^\'" >]+)
but it returns the href=" along with the match.

that is the question - How do i return the href part without the href="
0
Comment
Question by:dgrafx
  • 17
  • 16
  • 2
  • +2
40 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 34908866
Which language are you using, and how are you getting the return?
0
 
LVL 25

Author Comment

by:dgrafx
ID: 34909024
coldfusion and return is from rematch
if not familiar you just specify a regex
some regex is the same as other languages and some different - as is always the case I imagine

thanks ...
0
 
LVL 3

Expert Comment

by:QuinnDester
ID: 34909158
this should do it

<a href="[^>]+">
0
 
LVL 25

Author Comment

by:dgrafx
ID: 34909189
quinndester
keep in mind that I DONT want the link!
i want the href part of the link

thanks
0
 
LVL 84

Expert Comment

by:ozo
ID: 34909611
try Matcher.Group( 2 )
or (?<=href=[\'"]?)([^\'" >]+)
0
 
LVL 25

Author Comment

by:dgrafx
ID: 34909771
that exact function wont run in rematch
"unrecognized sequence" referring to ?<
do you have a variation?
0
 
LVL 3

Expert Comment

by:QuinnDester
ID: 34909865
<a href="([^>]+)">
0
 
LVL 25

Author Comment

by:dgrafx
ID: 34910022
quinn - that returns <a href="http://xxx.tester.org?c=ecfr">
for example
i want http://xxx.tester.org?c=ecfr returned

thanks
0
 
LVL 3

Expert Comment

by:QuinnDester
ID: 34911104
(<a href=")([^>]+)("> ) and match on group2 as ozo suggested above
0
 
LVL 3

Expert Comment

by:QuinnDester
ID: 34911542
another way would be to get what you can once you have it in a variable you can do a replace to get rid of the parts you dont want.
0
 
LVL 25

Author Comment

by:dgrafx
ID: 34915801
that last post doesn't work either - what do you mean by match on group 2?
what are you referring to?

thanks
0
 
LVL 3

Expert Comment

by:QuinnDester
ID: 34915978
you are matching the regex, the () break the regex into groups, so you can match on the whole thing or on individual  groups, the part of the match that falls between the brackets, you need to retrieve what falls between the second set of brackets.

you sure this doesn't work?

 Replace(thisURL, "<a href="", "" [, ALL ])
0
 
LVL 25

Author Comment

by:dgrafx
ID: 34916003
thats what i've been doing
now - i want to figure out how to return just the url - and nothing else ...
0
 
LVL 3

Expert Comment

by:QuinnDester
ID: 34916129
try this, it checks that the <a href=" exists at the start of the match and > at the end, but only picks up what is in the middle

(?<=<a href=")<a[^>]+>.+?</a>(?=>)
0
 
LVL 3

Expert Comment

by:QuinnDester
ID: 34916150
sorry its picking up too much


(?<=<a href=")<a[^>]+(?=>)
0
 
LVL 3

Expert Comment

by:QuinnDester
ID: 34916351
good job i looked again, that wouldn't have worked at all... this should do it

(?<=<a href=")[^>]+(?=>)
0
 
LVL 25

Author Comment

by:dgrafx
ID: 34916748
i get a sequence not recognized error ...
it's referring to ?<
what can you think of that does the same thing in other languages?
0
 
LVL 3

Expert Comment

by:QuinnDester
ID: 34916854
thats the coldfusion method the problem with coldfusion is they have a very basic regex engine

you could try using a javascript method.

in c# you would choose the group you wanted to keep from the match

can i see the part of the code thats using the regex
0
 
LVL 25

Author Comment

by:dgrafx
ID: 34916951
i can't use js - i need to grab urls from a file - then proceed with "processing" on each url.
there may be 0 or 50 or ??? links that I need to grab the url from.

yes - rematch is a basic regex matcher
I could use the java version that its based on but I don't know the syntax - do you?

what do you mean by "the part of the code thats using the regex"?
do you mean the rematch function?
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 
LVL 3

Expert Comment

by:QuinnDester
ID: 34916968
the function you have writen where you need the regex
0
 
LVL 25

Author Comment

by:dgrafx
ID: 34917025
its just a line of code using rematch

rematch(regex, input)

thats it!
regex is your regex statement and input is the var from reading the file
0
 
LVL 3

Expert Comment

by:QuinnDester
ID: 34917521
try this

rematch("http?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?",  input)
0
 
LVL 3

Expert Comment

by:QuinnDester
ID: 34917603
sorry, missed the s out

rematch("https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?",  input)
0
 
LVL 25

Author Comment

by:dgrafx
ID: 34917744
that, i believe, is the EXACT regex i was using until the current problem arose!

and it works great MOST of the time.

The problem: we have these links that need to be grabbed - here is an example link: http://xxx.xxx.xxx/cgi/t/text/text-idx?c=ecfr;sid=22367671395357d0a5bfe1c9fe1004ee;rgn=div5;view=text;node=45%3A4.1.2.4.14;idno=45;cc=ecfr##45:4.1.2.4.14.3.1.1

That regex you just posted brings back only a partial : http://xxx.xxx.xxx/cgi/t/text/text

so then i started searching for a new regex
i found : href=[\'"]?([^\'" >]+) which works great EXCEPT it leaves the href=" on the front of the link!

ideas?
0
 
LVL 3

Expert Comment

by:QuinnDester
ID: 34917929
so combining the 2 you get this, try it see how it works


"https?://([^\'" >]+)
0
 
LVL 25

Author Comment

by:dgrafx
ID: 34918106
i don't know why i didn't post this accurately ...
what we have are files with links
the links href is what i need to grab (as i've said earlier)
but the text part of the link is usually the same as the href part
BUT i don't want it at all because it is usually distorted like with spaces and carriage returns from formatting by non-programmers.
and even if not distorted - would be a duplicate of href

so - here is an example:
<a href="http://xxx.xxx.xxx/cgi/t/text/text-idx?c=ecfr;sid=22367671395357d0a5bfe1c9fe1004ee;rgn=div5;view=text;node=45%3A4.1.2.4.14;idno=45;cc=ecfr##45:4.1.2.4.14.3.1.1">http://xxx.xxx.xxx/cgi/t/text/text-idx?c=ecfr;sid=22367671395357d0a5bfe1c9fe1004ee;rgn= div5;view=text;node=45%3A4.1.2.4.14;idno=45;cc= ecfr##45:4.1.2.4.14.3.1.1</a>

the latest regex gives a return of both links
I need just the href because the links are being tested for being valid - then if not valid the document is flagged
long story - but need just href="***"

thanks
0
 
LVL 3

Accepted Solution

by:
QuinnDester earned 500 total points
ID: 34918425
this seems to be the best solution to the problem

Replace((rematch,"href=[\'"]?([^\'" >]+)",input),"href="",""[,ALL])
0
 
LVL 25

Author Comment

by:dgrafx
ID: 34918516
!!!
Thats exactly (results wise) where I was at when I came here!!!
lol

fyi though - rematch returns an array so the last example you posted would be incorrect - I know what you mean though ...

any other ideas?
0
 
LVL 25

Author Comment

by:dgrafx
ID: 34918648
hey thanks

thermoduric - how does one "find" Zones here on EE?
0
 
LVL 3

Expert Comment

by:QuinnDester
ID: 34919069
going back to this one, (?<=<a href=")[^>]+(?=>)

 (?<= is giving you an error because it isn't natively  supported in coldfusion

below is a link that has the script needed to make it work, it is also has some very good information on extracting URL's

 http://stackoverflow.com/questions/3250455/parse-url-from-string-in-coldfusion
0
 
LVL 25

Author Comment

by:dgrafx
ID: 34919955
the problem with the solution on that page is that it requires one to install 'jre-utils' - I do NOT have permissions to install anything on the server in question ...

if I could do something similar with a native java lib ....
that would be awesome
0
 
LVL 3

Expert Comment

by:QuinnDester
ID: 34920259
no you dont have to install it, you just place the file in the folder
0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 34920279
I think what you want is a rematch with pattern:
<a href="([^>]+)">
and then you just need to get the 2nd value from the resulting array. Do you know how to deal with arrays?

Or this might work, but it looks more complicated that what it should be...
http://www.bennadel.com/blog/1040-REMatchGroups-ColdFusion-User-Defined-Function.htm

0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 34920359
Ok, maybe a simple subpattern won't capture the url like you need it to... I found this comment somewhere:
#5 Posted By: Adam Cameron Posted On: 9/18/09 4:38 PM
I see the usefulness of reMatch() as being fairly limited, given it doesn't support the return of matched subexpressions like reFind() does. It's pretty rare that I don't also want to match subexpressions when using regexes, and in not doing this, it's rendered useless for all except fairly basic situations.

It seems to me like it's a half-finished solution.

Still: something that's half-finished does have the scope to be finished one day, I guess.
--
Adam


Looks like reFind might do the trick too - this article is by the same guy as the previous link:
http://www.bennadel.com/blog/1090-REFind-Sub-Expressions-Thanks-Adam-Cameron-.htm

All I can say really is that it's really easy in PHP...  
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34921004
Perhaps a REReplace would suit you better?
<CFSet variables.extracted=REReplace("<a href=""xxx"">yyy</a>", ".*?href=[\'""]?([^\'"" >]+).*", "\1") />
<CFDump var=#variables.extracted# label="href" />

Open in new window

untitled.png
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34921068
>>  how does one "find" Zones here on EE?

When you post your question, at the bottom of the page there is a search feature for finding zones:
untitled.png
0
 
LVL 25

Author Comment

by:dgrafx
ID: 34925429
NO - I didn't mean to close!!!
I meant to accept a solution!!!

Did I accidentally accept my own post!

Sorry guys ...
I'll fix it ...
0
 
LVL 25

Author Comment

by:dgrafx
ID: 34925452
Moderator - please remove the request to delete this question or whatever ...

I clicked my own post by mistake

I want to award points to a poster

sorry for the error
0
 
LVL 25

Author Closing Comment

by:dgrafx
ID: 34925462
Thanks for all the help
0

Featured Post

Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

Join & Write a Comment

I have been reconstructing a PHP-based application that has grown into a full blown interface system over the last ten years by a developer that has now gone into business for himself building websites. I am not incredibly fond of writing PHP code o…
As most anyone who uses or has come across them can attest to, regular expressions (regex) are a complicated bit of magic. Packed so succinctly within their cryptic syntax lies a great deal of power. It's not the "take over the world" kind of power,…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now