Solved

Search for URL in string php

Posted on 2016-07-25
4
54 Views
Last Modified: 2016-08-08
I'm trying to clean some data when users input it and I need to be able to look for URL's and grab them.

So for example:

This is an awesome post!  Check it out: http://google.com and or https://google.com.

Open in new window


Ok so I have that post.  Say in a php variable.  How do I look for http:// or https:// and then grab the full URL in PHP?
0
Comment
Question by:N R
  • 2
4 Comments
 
LVL 7

Expert Comment

by:James Bilous
Comment Utility
You'll want to use REGEX with preg_match to extract the desired substring from a string in a variable. See:

http://www.regexr.com/3bqqh
http://php.net/manual/en/function.preg-match.php
0
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
This is an interesting question and has been with us for many years, if not decades.  Enormous volumes have been written about this question.  I even used it as an example in an E-E article to illustrate the process of test-driven development, back in the day before automated testing "grew up."

The quality of the results in an application like this is highly dependent on the detailed problem definition, and the quality of your test data.  String parsing with regular expressions can be dicey!  The sort of questions we need to consider include "Must the protocol always be HTTP or HTTPS?"  Or "Can we include FTP, too?"  Or "What if it says 'www' but has no leading protocol?"  Or "What TLDs, besides '.com', must I locate?"  In practice you will probably come up with more questions than answers!  Eventually you will get to a regular expression that is "good enough" but that may not cover 100% of the edge and corner cases.

Here's an article that describes the thought process and the way we write the programming:
https://www.experts-exchange.com/articles/7830/A-Quick-Tour-of-Test-Driven-Development.html

Here's an example that uses your test data.  It contains comments to explain how the regular expression works:
https://iconoun.com/demo/temp_nathan_riley.php
<?php // demo/temp_nathan_riley.php
/**
 * https://www.experts-exchange.com/questions/28959518/Search-for-URL-in-string-php.html
 *
 * https://www.experts-exchange.com/articles/7830/A-Quick-Tour-of-Test-Driven-Development.html
 */
error_reporting(E_ALL);


// TEST DATA FROM THE POST AT E-E
$str = 'This is an awesome post!  Check it out: http://google.com and or https://google.com.';

// A REGEX THAT FINDS URLS AND DOMAIN SUBSTRINGS
$rgx
= '#'         // REGEX DELIMITER

. '\b'        // ON WORD BOUNDARY

. '('         // START GROUP
. 'https?'    // HTTP OR HTTPS
. '|'         // OR
. 'ftps?'     // FTP OR FTPS
. ')'         // END GROUP
. '??'        // ZERO OR ONE OF THIS GROUP, UNGREEDY

. '('         // START GROUP
. '://'       // COLON, SLASH, SLASH
. ')'         // END GROUP
. '??'        // ZERO OR ONE OF THIS GROUP, UNGREEDY

. '('         // START GROUP
. '[A-Z0-9]'  // A SUBDOMAIN
. '+?'        // INDETERMINATE LENGTH
. '\.'        // A DOT (ESCAPED)
. ')'         // END GROUP
. '??'        // ZERO OR ONE OF THIS GROUP, UNGREEDY

. '('         // START GROUP
. '[A-Z0-9]'  // CHARACTER CLASS ALPHANUMERIC
. '+?'        // INDETERMINATE LENGTH
. ')'         // END GROUP

. '('         // START GROUP
. '[.]'       // THE DOT (BEFORE THE TLD)
. '{1}'       // LENGTH IS EXACTLY ONE
. ')'         // END GROUP

. '('         // START GROUP
. '[A-Z]'     // CHARACTER CLASS ALPHA
. '{2,7}'     // LENGTH IS TWO TO SEVEN
. ')'         // END GROUP

. '\b'        // ON WORD BOUNDARY

. '#'         // REGEX DELIMITER
. 'i'         // CASE-INSENSITIVE
;

// LOCATE THE URLS
preg_match_all($rgx, $str, $mat);

// SHOW THE WORK PRODUCT
print_r($mat[0]);

// ACTIVATE THIS TO SEE ALL OF THE URL PIECES
// print_r($mat);

Open in new window

0
 
LVL 82

Accepted Solution

by:
Dave Baldwin earned 250 total points
Comment Utility
I've been having to 'scrape' a lot of internal data.  I never use regular expressions.  My favorite method these days is to locate the constant at the beginning of the string which in this case will be 'http' using 'strpos()'.  Then I use 'substr()' to grab a string that is long enough to contain the data I want starting with that constant.  Then I use 'explode()' to split the sub-string using a constant.  In this case it would be a space ' ' because that isn't valid in a URL and always follows it in text.  You may have to account for a period or other white space at the end.
0
 
LVL 108

Assisted Solution

by:Ray Paseur
Ray Paseur earned 250 total points
Comment Utility
What Dave's describing is often called a "state engine."  There are many variants of state engines; For string parsing, they are usually self-aware and walk through a document a character at a time.  When they are "in-state" they process the contents of the documents as commands; when they are "out-of-state" they simply return the contents of the documents as text.  HTML parsers are usually state engines.  The tags are in-state; the content is out-of-state.  We can simplify a state engine to find bounded substrings that will enable us to extract the URLs.  Here's an example.
https://iconoun.com/demo/state_engine_substrings.php
<?php // demo/state_engine_substrings.php
/**
 * https://www.experts-exchange.com/questions/28959518/Search-for-URL-in-string-php.html
 *
 * https://en.wikipedia.org/wiki/Finite-state_machine
 */
error_reporting(E_ALL);

// TEST DATA FROM THE POST AT E-E
$test = 'This is an awesome post!  Check it out: http://google.com and or https://google.com.';

Class StateEngine
{
    public $results = [];
    public function __construct($s)
    {
        $this->string = $s;
    }
    public function setAlpha($a)
    {
        $this->alpha = $a;
    }
    public function setOmega($z)
    {
        $this->omega = $z;
    }
    public function getResults()
    {
        $arr = explode($this->alpha, $this->string);
        unset($arr[0]);
        foreach ($arr as $substring)
        {
            $sub = explode($this->omega, $substring);
            $this->results[] = $this->alpha . $sub[0] . $this->omega;
        }
        return $this->results;
    }
}

$se = new StateEngine($test);
$se->setAlpha('http');
$se->setOmega('.com');
$res = $se->getResults();
print_r($res);

Open in new window

Outputs:
Array ( [0] => http://google.com [1] => https://google.com )

Open in new window

0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Active Directory replication delay is the cause to many problems.  Here is a super easy script to force Active Directory replication to all sites with by using an elevated PowerShell command prompt, and a tool to verify your changes.
Boost your ability to deliver ambitious and competitive web apps by choosing the right JavaScript framework to best suit your project’s needs.
This video teaches users how to migrate an existing Wordpress website to a new domain.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now