?
Solved

Search for URL in string php

Posted on 2016-07-25
4
Medium Priority
?
105 Views
Last Modified: 2016-08-08
I'm trying to clean some data when users input it and I need to be able to look for URL's and grab them.

So for example:

This is an awesome post!  Check it out: http://google.com and or https://google.com.

Open in new window


Ok so I have that post.  Say in a php variable.  How do I look for http:// or https:// and then grab the full URL in PHP?
0
Comment
Question by:Nathan Riley
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
4 Comments
 
LVL 9

Expert Comment

by:James Bilous
ID: 41728807
You'll want to use REGEX with preg_match to extract the desired substring from a string in a variable. See:

http://www.regexr.com/3bqqh
http://php.net/manual/en/function.preg-match.php
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 41728829
This is an interesting question and has been with us for many years, if not decades.  Enormous volumes have been written about this question.  I even used it as an example in an E-E article to illustrate the process of test-driven development, back in the day before automated testing "grew up."

The quality of the results in an application like this is highly dependent on the detailed problem definition, and the quality of your test data.  String parsing with regular expressions can be dicey!  The sort of questions we need to consider include "Must the protocol always be HTTP or HTTPS?"  Or "Can we include FTP, too?"  Or "What if it says 'www' but has no leading protocol?"  Or "What TLDs, besides '.com', must I locate?"  In practice you will probably come up with more questions than answers!  Eventually you will get to a regular expression that is "good enough" but that may not cover 100% of the edge and corner cases.

Here's an article that describes the thought process and the way we write the programming:
https://www.experts-exchange.com/articles/7830/A-Quick-Tour-of-Test-Driven-Development.html

Here's an example that uses your test data.  It contains comments to explain how the regular expression works:
https://iconoun.com/demo/temp_nathan_riley.php
<?php // demo/temp_nathan_riley.php
/**
 * https://www.experts-exchange.com/questions/28959518/Search-for-URL-in-string-php.html
 *
 * https://www.experts-exchange.com/articles/7830/A-Quick-Tour-of-Test-Driven-Development.html
 */
error_reporting(E_ALL);


// TEST DATA FROM THE POST AT E-E
$str = 'This is an awesome post!  Check it out: http://google.com and or https://google.com.';

// A REGEX THAT FINDS URLS AND DOMAIN SUBSTRINGS
$rgx
= '#'         // REGEX DELIMITER

. '\b'        // ON WORD BOUNDARY

. '('         // START GROUP
. 'https?'    // HTTP OR HTTPS
. '|'         // OR
. 'ftps?'     // FTP OR FTPS
. ')'         // END GROUP
. '??'        // ZERO OR ONE OF THIS GROUP, UNGREEDY

. '('         // START GROUP
. '://'       // COLON, SLASH, SLASH
. ')'         // END GROUP
. '??'        // ZERO OR ONE OF THIS GROUP, UNGREEDY

. '('         // START GROUP
. '[A-Z0-9]'  // A SUBDOMAIN
. '+?'        // INDETERMINATE LENGTH
. '\.'        // A DOT (ESCAPED)
. ')'         // END GROUP
. '??'        // ZERO OR ONE OF THIS GROUP, UNGREEDY

. '('         // START GROUP
. '[A-Z0-9]'  // CHARACTER CLASS ALPHANUMERIC
. '+?'        // INDETERMINATE LENGTH
. ')'         // END GROUP

. '('         // START GROUP
. '[.]'       // THE DOT (BEFORE THE TLD)
. '{1}'       // LENGTH IS EXACTLY ONE
. ')'         // END GROUP

. '('         // START GROUP
. '[A-Z]'     // CHARACTER CLASS ALPHA
. '{2,7}'     // LENGTH IS TWO TO SEVEN
. ')'         // END GROUP

. '\b'        // ON WORD BOUNDARY

. '#'         // REGEX DELIMITER
. 'i'         // CASE-INSENSITIVE
;

// LOCATE THE URLS
preg_match_all($rgx, $str, $mat);

// SHOW THE WORK PRODUCT
print_r($mat[0]);

// ACTIVATE THIS TO SEE ALL OF THE URL PIECES
// print_r($mat);

Open in new window

0
 
LVL 84

Accepted Solution

by:
Dave Baldwin earned 1000 total points
ID: 41728926
I've been having to 'scrape' a lot of internal data.  I never use regular expressions.  My favorite method these days is to locate the constant at the beginning of the string which in this case will be 'http' using 'strpos()'.  Then I use 'substr()' to grab a string that is long enough to contain the data I want starting with that constant.  Then I use 'explode()' to split the sub-string using a constant.  In this case it would be a space ' ' because that isn't valid in a URL and always follows it in text.  You may have to account for a period or other white space at the end.
0
 
LVL 111

Assisted Solution

by:Ray Paseur
Ray Paseur earned 1000 total points
ID: 41729697
What Dave's describing is often called a "state engine."  There are many variants of state engines; For string parsing, they are usually self-aware and walk through a document a character at a time.  When they are "in-state" they process the contents of the documents as commands; when they are "out-of-state" they simply return the contents of the documents as text.  HTML parsers are usually state engines.  The tags are in-state; the content is out-of-state.  We can simplify a state engine to find bounded substrings that will enable us to extract the URLs.  Here's an example.
https://iconoun.com/demo/state_engine_substrings.php
<?php // demo/state_engine_substrings.php
/**
 * https://www.experts-exchange.com/questions/28959518/Search-for-URL-in-string-php.html
 *
 * https://en.wikipedia.org/wiki/Finite-state_machine
 */
error_reporting(E_ALL);

// TEST DATA FROM THE POST AT E-E
$test = 'This is an awesome post!  Check it out: http://google.com and or https://google.com.';

Class StateEngine
{
    public $results = [];
    public function __construct($s)
    {
        $this->string = $s;
    }
    public function setAlpha($a)
    {
        $this->alpha = $a;
    }
    public function setOmega($z)
    {
        $this->omega = $z;
    }
    public function getResults()
    {
        $arr = explode($this->alpha, $this->string);
        unset($arr[0]);
        foreach ($arr as $substring)
        {
            $sub = explode($this->omega, $substring);
            $this->results[] = $this->alpha . $sub[0] . $this->omega;
        }
        return $this->results;
    }
}

$se = new StateEngine($test);
$se->setAlpha('http');
$se->setOmega('.com');
$res = $se->getResults();
print_r($res);

Open in new window

Outputs:
Array ( [0] => http://google.com [1] => https://google.com )

Open in new window

0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Does your audience prefer people in photos or no people? How can you best highlight what you’re selling? What are your competitors doing, and what can you do that is different and unique from them?  Continue reading to learn how to make your images …
A quick Powershell script I wrote to find old program installations and check versions of a specific file across the network.
This video teaches users how to migrate an existing Wordpress website to a new domain.
Video by: Mark
This lesson goes over how to construct ordered and unordered lists and how to create hyperlinks.
Suggested Courses

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question