regex to match url

i need a regex to match the href part of a url
the links are already html links meaning they are already <a href="xxx">xxx</a>

i'm currently trying to use href=[\'"]?([^\'" >]+)
but it returns the href=" along with the match.

that is the question - How do i return the href part without the href="
LVL 25
dgrafxAsked:
Who is Participating?

[Webinar] Streamline your web hosting managementRegister Today

x
 
QuinnDesterConnect With a Mentor Commented:
this seems to be the best solution to the problem

Replace((rematch,"href=[\'"]?([^\'" >]+)",input),"href="",""[,ALL])
0
 
ozoCommented:
Which language are you using, and how are you getting the return?
0
 
dgrafxAuthor Commented:
coldfusion and return is from rematch
if not familiar you just specify a regex
some regex is the same as other languages and some different - as is always the case I imagine

thanks ...
0
Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
QuinnDesterCommented:
this should do it

<a href="[^>]+">
0
 
dgrafxAuthor Commented:
quinndester
keep in mind that I DONT want the link!
i want the href part of the link

thanks
0
 
ozoCommented:
try Matcher.Group( 2 )
or (?<=href=[\'"]?)([^\'" >]+)
0
 
dgrafxAuthor Commented:
that exact function wont run in rematch
"unrecognized sequence" referring to ?<
do you have a variation?
0
 
QuinnDesterCommented:
<a href="([^>]+)">
0
 
dgrafxAuthor Commented:
quinn - that returns <a href="http://xxx.tester.org?c=ecfr">
for example
i want http://xxx.tester.org?c=ecfr returned

thanks
0
 
QuinnDesterCommented:
(<a href=")([^>]+)("> ) and match on group2 as ozo suggested above
0
 
QuinnDesterCommented:
another way would be to get what you can once you have it in a variable you can do a replace to get rid of the parts you dont want.
0
 
dgrafxAuthor Commented:
that last post doesn't work either - what do you mean by match on group 2?
what are you referring to?

thanks
0
 
QuinnDesterCommented:
you are matching the regex, the () break the regex into groups, so you can match on the whole thing or on individual  groups, the part of the match that falls between the brackets, you need to retrieve what falls between the second set of brackets.

you sure this doesn't work?

 Replace(thisURL, "<a href="", "" [, ALL ])
0
 
dgrafxAuthor Commented:
thats what i've been doing
now - i want to figure out how to return just the url - and nothing else ...
0
 
QuinnDesterCommented:
try this, it checks that the <a href=" exists at the start of the match and > at the end, but only picks up what is in the middle

(?<=<a href=")<a[^>]+>.+?</a>(?=>)
0
 
QuinnDesterCommented:
sorry its picking up too much


(?<=<a href=")<a[^>]+(?=>)
0
 
QuinnDesterCommented:
good job i looked again, that wouldn't have worked at all... this should do it

(?<=<a href=")[^>]+(?=>)
0
 
dgrafxAuthor Commented:
i get a sequence not recognized error ...
it's referring to ?<
what can you think of that does the same thing in other languages?
0
 
QuinnDesterCommented:
thats the coldfusion method the problem with coldfusion is they have a very basic regex engine

you could try using a javascript method.

in c# you would choose the group you wanted to keep from the match

can i see the part of the code thats using the regex
0
 
dgrafxAuthor Commented:
i can't use js - i need to grab urls from a file - then proceed with "processing" on each url.
there may be 0 or 50 or ??? links that I need to grab the url from.

yes - rematch is a basic regex matcher
I could use the java version that its based on but I don't know the syntax - do you?

what do you mean by "the part of the code thats using the regex"?
do you mean the rematch function?
0
 
QuinnDesterCommented:
the function you have writen where you need the regex
0
 
dgrafxAuthor Commented:
its just a line of code using rematch

rematch(regex, input)

thats it!
regex is your regex statement and input is the var from reading the file
0
 
QuinnDesterCommented:
try this

rematch("http?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?",  input)
0
 
QuinnDesterCommented:
sorry, missed the s out

rematch("https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?",  input)
0
 
dgrafxAuthor Commented:
that, i believe, is the EXACT regex i was using until the current problem arose!

and it works great MOST of the time.

The problem: we have these links that need to be grabbed - here is an example link: http://xxx.xxx.xxx/cgi/t/text/text-idx?c=ecfr;sid=22367671395357d0a5bfe1c9fe1004ee;rgn=div5;view=text;node=45%3A4.1.2.4.14;idno=45;cc=ecfr##45:4.1.2.4.14.3.1.1

That regex you just posted brings back only a partial : http://xxx.xxx.xxx/cgi/t/text/text 

so then i started searching for a new regex
i found : href=[\'"]?([^\'" >]+) which works great EXCEPT it leaves the href=" on the front of the link!

ideas?
0
 
QuinnDesterCommented:
so combining the 2 you get this, try it see how it works


"https?://([^\'" >]+)
0
 
dgrafxAuthor Commented:
i don't know why i didn't post this accurately ...
what we have are files with links
the links href is what i need to grab (as i've said earlier)
but the text part of the link is usually the same as the href part
BUT i don't want it at all because it is usually distorted like with spaces and carriage returns from formatting by non-programmers.
and even if not distorted - would be a duplicate of href

so - here is an example:
<a href="http://xxx.xxx.xxx/cgi/t/text/text-idx?c=ecfr;sid=22367671395357d0a5bfe1c9fe1004ee;rgn=div5;view=text;node=45%3A4.1.2.4.14;idno=45;cc=ecfr##45:4.1.2.4.14.3.1.1">http://xxx.xxx.xxx/cgi/t/text/text-idx?c=ecfr;sid=22367671395357d0a5bfe1c9fe1004ee;rgn= div5;view=text;node=45%3A4.1.2.4.14;idno=45;cc= ecfr##45:4.1.2.4.14.3.1.1</a>

the latest regex gives a return of both links
I need just the href because the links are being tested for being valid - then if not valid the document is flagged
long story - but need just href="***"

thanks
0
 
dgrafxAuthor Commented:
!!!
Thats exactly (results wise) where I was at when I came here!!!
lol

fyi though - rematch returns an array so the last example you posted would be incorrect - I know what you mean though ...

any other ideas?
0
 
dgrafxAuthor Commented:
hey thanks

thermoduric - how does one "find" Zones here on EE?
0
 
QuinnDesterCommented:
going back to this one, (?<=<a href=")[^>]+(?=>)

 (?<= is giving you an error because it isn't natively  supported in coldfusion

below is a link that has the script needed to make it work, it is also has some very good information on extracting URL's

 http://stackoverflow.com/questions/3250455/parse-url-from-string-in-coldfusion
0
 
dgrafxAuthor Commented:
the problem with the solution on that page is that it requires one to install 'jre-utils' - I do NOT have permissions to install anything on the server in question ...

if I could do something similar with a native java lib ....
that would be awesome
0
 
QuinnDesterCommented:
no you dont have to install it, you just place the file in the folder
0
 
Terry WoodsIT GuruCommented:
I think what you want is a rematch with pattern:
<a href="([^>]+)">
and then you just need to get the 2nd value from the resulting array. Do you know how to deal with arrays?

Or this might work, but it looks more complicated that what it should be...
http://www.bennadel.com/blog/1040-REMatchGroups-ColdFusion-User-Defined-Function.htm

0
 
Terry WoodsIT GuruCommented:
Ok, maybe a simple subpattern won't capture the url like you need it to... I found this comment somewhere:
#5 Posted By: Adam Cameron Posted On: 9/18/09 4:38 PM
I see the usefulness of reMatch() as being fairly limited, given it doesn't support the return of matched subexpressions like reFind() does. It's pretty rare that I don't also want to match subexpressions when using regexes, and in not doing this, it's rendered useless for all except fairly basic situations.

It seems to me like it's a half-finished solution.

Still: something that's half-finished does have the scope to be finished one day, I guess.
--
Adam


Looks like reFind might do the trick too - this article is by the same guy as the previous link:
http://www.bennadel.com/blog/1090-REFind-Sub-Expressions-Thanks-Adam-Cameron-.htm

All I can say really is that it's really easy in PHP...  
0
 
käµfm³d 👽Commented:
Perhaps a REReplace would suit you better?
<CFSet variables.extracted=REReplace("<a href=""xxx"">yyy</a>", ".*?href=[\'""]?([^\'"" >]+).*", "\1") />
<CFDump var=#variables.extracted# label="href" />

Open in new window

untitled.png
0
 
käµfm³d 👽Commented:
>>  how does one "find" Zones here on EE?

When you post your question, at the bottom of the page there is a search feature for finding zones:
untitled.png
0
 
dgrafxAuthor Commented:
NO - I didn't mean to close!!!
I meant to accept a solution!!!

Did I accidentally accept my own post!

Sorry guys ...
I'll fix it ...
0
 
dgrafxAuthor Commented:
Moderator - please remove the request to delete this question or whatever ...

I clicked my own post by mistake

I want to award points to a poster

sorry for the error
0
 
dgrafxAuthor Commented:
Thanks for all the help
0
All Courses

From novice to tech pro — start learning today.