We help IT Professionals succeed at work.

regex for finding URLS in a text

Nura111
Nura111 asked
on
Hi I'm trying ti check if a text containing an URL is it that simple as looking for http:// or www. or am I missing something here?

also if I want to add the the www to the same regex how can I add in [] regex1 or regex 2 ( in this case http:// or www)

Thanks..
$match = preg_match('/http:\/\/(.*)/s','', $text);

Open in new window

Comment
Watch Question

HI

You can find some on
http://regexlib.com/Search.aspx?k=URL&AspxAutoDetectCookieSupport=1

My favorite is
(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?

Author

Commented:
DO you mind explaining you favorite one?
why is itnot enoufh to check just the begining http|ftp|https

Thanks
Most Valuable Expert 2011
Top Expert 2016

Commented:
Please post the test data you want us to use.  We can demonstrate some good examples of "finding URLs" if we know what we are working with.

This article shows the general thought processes that are involved in answering a question like this one.  As you can see, the test data is an integral part of the answer.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_7830-A-Quick-Tour-of-Test-Driven-Development.html

Author

Commented:
Its about the last quetsion you were involve in about spam emails so im trying to see if there is an Url in the message thats one indication thats it spam.
Most Valuable Expert 2011
Top Expert 2016

Commented:
Please post the test data you want us to use.

If you have some collected emails that are the sorts of things you want to look at, please post those email messages.  It doesn't make sense for us to guess at what the input should look like.  We can help you find things in the input, but only if we have the input where we can see it.  Thanks, ~Ray
Most Valuable Expert 2011
Top Expert 2015
Commented:
You might try the following:

(?:https?://)?(?:[a-zA-Z0-9\-._~]|%[a-fA-F0-9]{2}|[!$&'()*+,;=])*(?:/(?:[a-zA-Z0-9\-._~]|%[a-fA-F0-9]{2}|[!$&'()*+,;=]|:/@)*)*(?:\?(?:[a-zA-Z0-9\-._~]|%[a-fA-F0-9]{2}|[!$&'()*+,;=]|:|@|/|\?)*)?(?:#(?:[a-zA-Z0-9\-._~]|%[a-fA-F0-9]{2}|[!$&'()*+,;=]|:|@|/|\?)*)?

Open in new window


I wrote this the other day going against RFC 3986.
Most Valuable Expert 2011
Top Expert 2015

Commented:
why is itnot enoufh to check just the begining http|ftp|https
Because you might have sentences like the following:

The http protocol is used to transmit documents over the web. The http was invneted around 1990.

By your logic, there are two URLs in the above sentences.
Most Valuable Expert 2011
Top Expert 2016

Commented:
And that is exactly why we want you to make it easier for us to give you the best answer.  Please post the test data you want us to use to demonstrate the efficacy of our solutions, thanks.

Author

Commented:
Ray- The data text is every message with a link to a website

kaufmed: but im looking for http:// who will write that in the middle of a sentence if its not a link?
and isnt in your exmple you r looking for just https?

Top Expert 2011
Commented:
but im looking for http:// who will write that in the middle of a sentence if its not a link

Anyone that is talking about web browsers, URLs, etc.

For example,
A few versions ago, the Firefox browser started hiding the http:// and https:// from the address bar.  If you want to undo this change, ...

If you are not worried about such things, and don't mind that things like http:// all by iteself is turned into a hyperlink, then that is, of course, fine and dandy - it's your choice.  Perhaps your content is not technology centric and you know that your content writers would never do such things.
Top Expert 2011

Commented:
Oh yeah - and you just wrote http:// in the middle of your question - so the answer to your question  "who would do this?" is "you".

Author

Commented:
yes thats the case so do you mind tell me the regex for finding  http:// or https or ftp ?

btw ftp can be a link to a website?
Most Valuable Expert 2011
Top Expert 2016

Commented:
Yes, "ftp" is a protocol, as is DNS, SMTP, POP, etc.

#(https?|ftps?)# might work.  Test it out and see.
Top Expert 2011

Commented:
Yes, ftp can be a valid link to a website.  It's for transferring files.  Some such sites will be nothing more than a directory listing where you can grab files (images, videos, etc.).

Unless you have a reason to not check that such things are actual links and not random text, you should use elimesika's suggestion, purely because it does a good job of grabbing the entire URL without including any accidental punctuation that may occur at the end of the URL (have you ever clicked a link and it failed because the link including a "." at the end of it?):
(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?

Open in new window

Author

Commented:
Thank you everybody
Most Valuable Expert 2011
Top Expert 2015

Commented:
and isnt in your exmple you r looking for just https?
That would be a resounding, "No." The question mark after the "s" makes the "s" optional. Either "http" or "https" will be matched.

In hindsight, I don't think my regex would be suitable for this task anyway, as it returns far more than just web addresses. This is because of the nature of the ABNF in the RFC. I plan to try and refine the regex, but I can't do it presently.