• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 121
  • Last Modified:

Search for URL in string php

I'm trying to clean some data when users input it and I need to be able to look for URL's and grab them.

So for example:

This is an awesome post!  Check it out: http://google.com and or https://google.com.

Open in new window


Ok so I have that post.  Say in a php variable.  How do I look for http:// or https:// and then grab the full URL in PHP?
0
Nathan Riley
Asked:
Nathan Riley
  • 2
2 Solutions
 
James BilousSoftware EngineerCommented:
You'll want to use REGEX with preg_match to extract the desired substring from a string in a variable. See:

http://www.regexr.com/3bqqh
http://php.net/manual/en/function.preg-match.php
0
 
Ray PaseurCommented:
This is an interesting question and has been with us for many years, if not decades.  Enormous volumes have been written about this question.  I even used it as an example in an E-E article to illustrate the process of test-driven development, back in the day before automated testing "grew up."

The quality of the results in an application like this is highly dependent on the detailed problem definition, and the quality of your test data.  String parsing with regular expressions can be dicey!  The sort of questions we need to consider include "Must the protocol always be HTTP or HTTPS?"  Or "Can we include FTP, too?"  Or "What if it says 'www' but has no leading protocol?"  Or "What TLDs, besides '.com', must I locate?"  In practice you will probably come up with more questions than answers!  Eventually you will get to a regular expression that is "good enough" but that may not cover 100% of the edge and corner cases.

Here's an article that describes the thought process and the way we write the programming:
https://www.experts-exchange.com/articles/7830/A-Quick-Tour-of-Test-Driven-Development.html

Here's an example that uses your test data.  It contains comments to explain how the regular expression works:
https://iconoun.com/demo/temp_nathan_riley.php
<?php // demo/temp_nathan_riley.php
/**
 * https://www.experts-exchange.com/questions/28959518/Search-for-URL-in-string-php.html
 *
 * https://www.experts-exchange.com/articles/7830/A-Quick-Tour-of-Test-Driven-Development.html
 */
error_reporting(E_ALL);


// TEST DATA FROM THE POST AT E-E
$str = 'This is an awesome post!  Check it out: http://google.com and or https://google.com.';

// A REGEX THAT FINDS URLS AND DOMAIN SUBSTRINGS
$rgx
= '#'         // REGEX DELIMITER

. '\b'        // ON WORD BOUNDARY

. '('         // START GROUP
. 'https?'    // HTTP OR HTTPS
. '|'         // OR
. 'ftps?'     // FTP OR FTPS
. ')'         // END GROUP
. '??'        // ZERO OR ONE OF THIS GROUP, UNGREEDY

. '('         // START GROUP
. '://'       // COLON, SLASH, SLASH
. ')'         // END GROUP
. '??'        // ZERO OR ONE OF THIS GROUP, UNGREEDY

. '('         // START GROUP
. '[A-Z0-9]'  // A SUBDOMAIN
. '+?'        // INDETERMINATE LENGTH
. '\.'        // A DOT (ESCAPED)
. ')'         // END GROUP
. '??'        // ZERO OR ONE OF THIS GROUP, UNGREEDY

. '('         // START GROUP
. '[A-Z0-9]'  // CHARACTER CLASS ALPHANUMERIC
. '+?'        // INDETERMINATE LENGTH
. ')'         // END GROUP

. '('         // START GROUP
. '[.]'       // THE DOT (BEFORE THE TLD)
. '{1}'       // LENGTH IS EXACTLY ONE
. ')'         // END GROUP

. '('         // START GROUP
. '[A-Z]'     // CHARACTER CLASS ALPHA
. '{2,7}'     // LENGTH IS TWO TO SEVEN
. ')'         // END GROUP

. '\b'        // ON WORD BOUNDARY

. '#'         // REGEX DELIMITER
. 'i'         // CASE-INSENSITIVE
;

// LOCATE THE URLS
preg_match_all($rgx, $str, $mat);

// SHOW THE WORK PRODUCT
print_r($mat[0]);

// ACTIVATE THIS TO SEE ALL OF THE URL PIECES
// print_r($mat);

Open in new window

0
 
Dave BaldwinFixer of ProblemsCommented:
I've been having to 'scrape' a lot of internal data.  I never use regular expressions.  My favorite method these days is to locate the constant at the beginning of the string which in this case will be 'http' using 'strpos()'.  Then I use 'substr()' to grab a string that is long enough to contain the data I want starting with that constant.  Then I use 'explode()' to split the sub-string using a constant.  In this case it would be a space ' ' because that isn't valid in a URL and always follows it in text.  You may have to account for a period or other white space at the end.
0
 
Ray PaseurCommented:
What Dave's describing is often called a "state engine."  There are many variants of state engines; For string parsing, they are usually self-aware and walk through a document a character at a time.  When they are "in-state" they process the contents of the documents as commands; when they are "out-of-state" they simply return the contents of the documents as text.  HTML parsers are usually state engines.  The tags are in-state; the content is out-of-state.  We can simplify a state engine to find bounded substrings that will enable us to extract the URLs.  Here's an example.
https://iconoun.com/demo/state_engine_substrings.php
<?php // demo/state_engine_substrings.php
/**
 * https://www.experts-exchange.com/questions/28959518/Search-for-URL-in-string-php.html
 *
 * https://en.wikipedia.org/wiki/Finite-state_machine
 */
error_reporting(E_ALL);

// TEST DATA FROM THE POST AT E-E
$test = 'This is an awesome post!  Check it out: http://google.com and or https://google.com.';

Class StateEngine
{
    public $results = [];
    public function __construct($s)
    {
        $this->string = $s;
    }
    public function setAlpha($a)
    {
        $this->alpha = $a;
    }
    public function setOmega($z)
    {
        $this->omega = $z;
    }
    public function getResults()
    {
        $arr = explode($this->alpha, $this->string);
        unset($arr[0]);
        foreach ($arr as $substring)
        {
            $sub = explode($this->omega, $substring);
            $this->results[] = $this->alpha . $sub[0] . $this->omega;
        }
        return $this->results;
    }
}

$se = new StateEngine($test);
$se->setAlpha('http');
$se->setOmega('.com');
$res = $se->getResults();
print_r($res);

Open in new window

Outputs:
Array ( [0] => http://google.com [1] => https://google.com )

Open in new window

0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now