Nathan Riley
asked on
Search for URL in string php
I'm trying to clean some data when users input it and I need to be able to look for URL's and grab them.
So for example:
Ok so I have that post. Say in a php variable. How do I look for http:// or https:// and then grab the full URL in PHP?
So for example:
This is an awesome post! Check it out: http://google.com and or https://google.com.
Ok so I have that post. Say in a php variable. How do I look for http:// or https:// and then grab the full URL in PHP?
This is an interesting question and has been with us for many years, if not decades. Enormous volumes have been written about this question. I even used it as an example in an E-E article to illustrate the process of test-driven development, back in the day before automated testing "grew up."
The quality of the results in an application like this is highly dependent on the detailed problem definition, and the quality of your test data. String parsing with regular expressions can be dicey! The sort of questions we need to consider include "Must the protocol always be HTTP or HTTPS?" Or "Can we include FTP, too?" Or "What if it says 'www' but has no leading protocol?" Or "What TLDs, besides '.com', must I locate?" In practice you will probably come up with more questions than answers! Eventually you will get to a regular expression that is "good enough" but that may not cover 100% of the edge and corner cases.
Here's an article that describes the thought process and the way we write the programming:
https://www.experts-exchange.com/articles/7830/A-Quick-Tour-of-Test-Driven-Development.html
Here's an example that uses your test data. It contains comments to explain how the regular expression works:
https://iconoun.com/demo/temp_nathan_riley.php
The quality of the results in an application like this is highly dependent on the detailed problem definition, and the quality of your test data. String parsing with regular expressions can be dicey! The sort of questions we need to consider include "Must the protocol always be HTTP or HTTPS?" Or "Can we include FTP, too?" Or "What if it says 'www' but has no leading protocol?" Or "What TLDs, besides '.com', must I locate?" In practice you will probably come up with more questions than answers! Eventually you will get to a regular expression that is "good enough" but that may not cover 100% of the edge and corner cases.
Here's an article that describes the thought process and the way we write the programming:
https://www.experts-exchange.com/articles/7830/A-Quick-Tour-of-Test-Driven-Development.html
Here's an example that uses your test data. It contains comments to explain how the regular expression works:
https://iconoun.com/demo/temp_nathan_riley.php
<?php // demo/temp_nathan_riley.php
/**
* https://www.experts-exchange.com/questions/28959518/Search-for-URL-in-string-php.html
*
* https://www.experts-exchange.com/articles/7830/A-Quick-Tour-of-Test-Driven-Development.html
*/
error_reporting(E_ALL);
// TEST DATA FROM THE POST AT E-E
$str = 'This is an awesome post! Check it out: http://google.com and or https://google.com.';
// A REGEX THAT FINDS URLS AND DOMAIN SUBSTRINGS
$rgx
= '#' // REGEX DELIMITER
. '\b' // ON WORD BOUNDARY
. '(' // START GROUP
. 'https?' // HTTP OR HTTPS
. '|' // OR
. 'ftps?' // FTP OR FTPS
. ')' // END GROUP
. '??' // ZERO OR ONE OF THIS GROUP, UNGREEDY
. '(' // START GROUP
. '://' // COLON, SLASH, SLASH
. ')' // END GROUP
. '??' // ZERO OR ONE OF THIS GROUP, UNGREEDY
. '(' // START GROUP
. '[A-Z0-9]' // A SUBDOMAIN
. '+?' // INDETERMINATE LENGTH
. '\.' // A DOT (ESCAPED)
. ')' // END GROUP
. '??' // ZERO OR ONE OF THIS GROUP, UNGREEDY
. '(' // START GROUP
. '[A-Z0-9]' // CHARACTER CLASS ALPHANUMERIC
. '+?' // INDETERMINATE LENGTH
. ')' // END GROUP
. '(' // START GROUP
. '[.]' // THE DOT (BEFORE THE TLD)
. '{1}' // LENGTH IS EXACTLY ONE
. ')' // END GROUP
. '(' // START GROUP
. '[A-Z]' // CHARACTER CLASS ALPHA
. '{2,7}' // LENGTH IS TWO TO SEVEN
. ')' // END GROUP
. '\b' // ON WORD BOUNDARY
. '#' // REGEX DELIMITER
. 'i' // CASE-INSENSITIVE
;
// LOCATE THE URLS
preg_match_all($rgx, $str, $mat);
// SHOW THE WORK PRODUCT
print_r($mat[0]);
// ACTIVATE THIS TO SEE ALL OF THE URL PIECES
// print_r($mat);
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
http://www.regexr.com/3bqqh
http://php.net/manual/en/function.preg-match.php