Link to home
Start Free TrialLog in
Avatar of dgrafx
dgrafxFlag for United States of America

asked on

regex to match url

i need a regex to match the href part of a url
the links are already html links meaning they are already <a href="xxx">xxx</a>

i'm currently trying to use href=[\'"]?([^\'" >]+)
but it returns the href=" along with the match.

that is the question - How do i return the href part without the href="
Avatar of ozo
Flag of United States of America image

Which language are you using, and how are you getting the return?
Avatar of dgrafx


coldfusion and return is from rematch
if not familiar you just specify a regex
some regex is the same as other languages and some different - as is always the case I imagine

thanks ...
this should do it

<a href="[^>]+">
Avatar of dgrafx


keep in mind that I DONT want the link!
i want the href part of the link

try Matcher.Group( 2 )
or (?<=href=[\'"]?)([^\'" >]+)
Avatar of dgrafx


that exact function wont run in rematch
"unrecognized sequence" referring to ?<
do you have a variation?
<a href="([^>]+)">
Avatar of dgrafx


quinn - that returns <a href="">
for example
i want returned

(<a href=")([^>]+)("> ) and match on group2 as ozo suggested above
another way would be to get what you can once you have it in a variable you can do a replace to get rid of the parts you dont want.
Avatar of dgrafx


that last post doesn't work either - what do you mean by match on group 2?
what are you referring to?

you are matching the regex, the () break the regex into groups, so you can match on the whole thing or on individual  groups, the part of the match that falls between the brackets, you need to retrieve what falls between the second set of brackets.

you sure this doesn't work?

 Replace(thisURL, "<a href="", "" [, ALL ])
Avatar of dgrafx


thats what i've been doing
now - i want to figure out how to return just the url - and nothing else ...
try this, it checks that the <a href=" exists at the start of the match and > at the end, but only picks up what is in the middle

(?<=<a href=")<a[^>]+>.+?</a>(?=>)
sorry its picking up too much

(?<=<a href=")<a[^>]+(?=>)
good job i looked again, that wouldn't have worked at all... this should do it

(?<=<a href=")[^>]+(?=>)
Avatar of dgrafx


i get a sequence not recognized error ...
it's referring to ?<
what can you think of that does the same thing in other languages?
thats the coldfusion method the problem with coldfusion is they have a very basic regex engine

you could try using a javascript method.

in c# you would choose the group you wanted to keep from the match

can i see the part of the code thats using the regex
Avatar of dgrafx


i can't use js - i need to grab urls from a file - then proceed with "processing" on each url.
there may be 0 or 50 or ??? links that I need to grab the url from.

yes - rematch is a basic regex matcher
I could use the java version that its based on but I don't know the syntax - do you?

what do you mean by "the part of the code thats using the regex"?
do you mean the rematch function?
the function you have writen where you need the regex
Avatar of dgrafx


its just a line of code using rematch

rematch(regex, input)

thats it!
regex is your regex statement and input is the var from reading the file
try this

rematch("http?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?",  input)
sorry, missed the s out

rematch("https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?",  input)
Avatar of dgrafx


that, i believe, is the EXACT regex i was using until the current problem arose!

and it works great MOST of the time.

The problem: we have these links that need to be grabbed - here is an example link:;sid=22367671395357d0a5bfe1c9fe1004ee;rgn=div5;view=text;node=45%3A4.;idno=45;cc=ecfr##45:

That regex you just posted brings back only a partial : 

so then i started searching for a new regex
i found : href=[\'"]?([^\'" >]+) which works great EXCEPT it leaves the href=" on the front of the link!

so combining the 2 you get this, try it see how it works

"https?://([^\'" >]+)
Avatar of dgrafx


i don't know why i didn't post this accurately ...
what we have are files with links
the links href is what i need to grab (as i've said earlier)
but the text part of the link is usually the same as the href part
BUT i don't want it at all because it is usually distorted like with spaces and carriage returns from formatting by non-programmers.
and even if not distorted - would be a duplicate of href

so - here is an example:
<a href=";sid=22367671395357d0a5bfe1c9fe1004ee;rgn=div5;view=text;node=45%3A4.;idno=45;cc=ecfr##45:">;sid=22367671395357d0a5bfe1c9fe1004ee;rgn= div5;view=text;node=45%3A4.;idno=45;cc= ecfr##45:</a>

the latest regex gives a return of both links
I need just the href because the links are being tested for being valid - then if not valid the document is flagged
long story - but need just href="***"

Avatar of QuinnDester
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of dgrafx


Thats exactly (results wise) where I was at when I came here!!!

fyi though - rematch returns an array so the last example you posted would be incorrect - I know what you mean though ...

any other ideas?
Avatar of dgrafx


hey thanks

thermoduric - how does one "find" Zones here on EE?
going back to this one, (?<=<a href=")[^>]+(?=>)

 (?<= is giving you an error because it isn't natively  supported in coldfusion

below is a link that has the script needed to make it work, it is also has some very good information on extracting URL's
Avatar of dgrafx


the problem with the solution on that page is that it requires one to install 'jre-utils' - I do NOT have permissions to install anything on the server in question ...

if I could do something similar with a native java lib ....
that would be awesome
no you dont have to install it, you just place the file in the folder
I think what you want is a rematch with pattern:
<a href="([^>]+)">
and then you just need to get the 2nd value from the resulting array. Do you know how to deal with arrays?

Or this might work, but it looks more complicated that what it should be...

Ok, maybe a simple subpattern won't capture the url like you need it to... I found this comment somewhere:
#5 Posted By: Adam Cameron Posted On: 9/18/09 4:38 PM
I see the usefulness of reMatch() as being fairly limited, given it doesn't support the return of matched subexpressions like reFind() does. It's pretty rare that I don't also want to match subexpressions when using regexes, and in not doing this, it's rendered useless for all except fairly basic situations.

It seems to me like it's a half-finished solution.

Still: something that's half-finished does have the scope to be finished one day, I guess.

Looks like reFind might do the trick too - this article is by the same guy as the previous link:

All I can say really is that it's really easy in PHP...  
Perhaps a REReplace would suit you better?
<CFSet variables.extracted=REReplace("<a href=""xxx"">yyy</a>", ".*?href=[\'""]?([^\'"" >]+).*", "\1") />
<CFDump var=#variables.extracted# label="href" />

Open in new window

>>  how does one "find" Zones here on EE?

When you post your question, at the bottom of the page there is a search feature for finding zones:
Avatar of dgrafx


NO - I didn't mean to close!!!
I meant to accept a solution!!!

Did I accidentally accept my own post!

Sorry guys ...
I'll fix it ...
Avatar of dgrafx


Moderator - please remove the request to delete this question or whatever ...

I clicked my own post by mistake

I want to award points to a poster

sorry for the error
Avatar of dgrafx


Thanks for all the help