Solved

How do I make a specific regular expression for PHP that searches for http first and it its not there to search www. I need to grab domain names from a single line.

Posted on 2011-09-13
12
1,123 Views
Last Modified: 2012-05-12
Hi,

How do I make a specific regular expression for PHP that searches for "http" first and if its not there to search "www.", if its also not there search .TLD's

I need to grab domain names from a single line of text, with in mind working with only dyslectic people. The dots can sometimes be replaced with commas. Or they somehow skip a dot in the domain name.

So my idea to solve this is;

1: search for http (if not found, go to approach 2)
2: search for www (if not found, go to approach 3)
3: search for all available TLD in this list http://data.iana.org/TLD/tlds-alpha-by-domain.txt
4: FAIL, I did my best.

I only need to return,  subdomain.domain.tld and if there's a path a path.

Thank you
0
Comment
Question by:Octalys
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 4
  • 3
  • +1
12 Comments
 
LVL 75

Accepted Solution

by:
käµfm³d   👽 earned 250 total points
ID: 36531104
If I understand your explanation correctly, the I believe this is what you are after:

preg_match('#((?:http|www\.|\S+\.(?:AC|AD|AE|AERO|AF|AG|AI|AL|AM|AN|AO|AQ|AR|ARPA|AS|ASIA|AT|AU|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BIZ|BJ|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CAT|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|COM|COOP|CR|CU|CV|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EDU|EE|EG|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GOV|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|INFO|INT|IO|IQ|IR|IS|IT|JE|JM|JO|JOBS|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MG|MH|MIL|MK|ML|MM|MN|MO|MOBI|MP|MQ|MR|MS|MT|MU|MUSEUM|MV|MW|MX|MY|MZ|NA|NAME|NC|NE|NET|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|ORG|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PRO|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SY|SZ|TC|TD|TEL|TF|TG|TH|TJ|TK|TL|TM|TN|TO|TP|TR|TRAVEL|TT|TV|TW|TZ|UA|UG|UK|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|XN|XXX|YE|YT|ZA|ZM|ZW)\S+)(?=\s|$)#', $target, $match)

Open in new window


You should have the value in $match[1]. If it doesn't work, please post a "before" string and and "after" result.

P.S.

I truncated the multiple occurrences of "XN". I'm not entirely sure why it's listed more than once at IANA page.
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 36531121
I forgot a paren:
preg_match('#((?:http|www\.|\S+\.(?:AC|AD|AE|AERO|AF|AG|AI|AL|AM|AN|AO|AQ|AR|ARPA|AS|ASIA|AT|AU|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BIZ|BJ|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CAT|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|COM|COOP|CR|CU|CV|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EDU|EE|EG|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GOV|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|INFO|INT|IO|IQ|IR|IS|IT|JE|JM|JO|JOBS|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MG|MH|MIL|MK|ML|MM|MN|MO|MOBI|MP|MQ|MR|MS|MT|MU|MUSEUM|MV|MW|MX|MY|MZ|NA|NAME|NC|NE|NET|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|ORG|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PRO|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SY|SZ|TC|TD|TEL|TF|TG|TH|TJ|TK|TL|TM|TN|TO|TP|TR|TRAVEL|TT|TV|TW|TZ|UA|UG|UK|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|XN|XXX|YE|YT|ZA|ZM|ZW))\S+)(?=\s|$)#', $target, $match)

Open in new window

0
 

Author Comment

by:Octalys
ID: 36531519
Hi,

Thank you for the answer. The regex looks good.

It finds most domains, but not this example;

"asdasdas subdomain,google.com asfds"

And is it possible to cut http:// away from the result, just returning subdomain.domain.tld?

Thanks
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 36531628
It finds most domains, but not this example;
Ah. The TLDs are all in caps. You would probably want case insensitivity on. Put a lower-case I after the last hash ( # ).

e.g.

preg_replace('#...#i', $target, $match)

Open in new window


And is it possible to cut http:// away from the result, just returning subdomain.domain.tld?
For that, I think it would be simpler to do a subsequent replace call. For example:

$result = preg_match...
$result = str_replace("http://", "", $result);

Open in new window

0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 36536414
Just a thought... It might be easier to do this in a few steps rather than try to do it all in a single REGEX.  Maybe you would want your script to read the page at http://data.iana.org/TLD/tlds-alpha-by-domain.txt and use the up-to-the-minute list.

If you want to post some test data and show us the expected results we might be able to give you a good function to return the desired information.
0
 

Author Comment

by:Octalys
ID: 36553234
Yeah I might split up the whole checking process later for a better overview in the code and easier to customise parts of it. But the regex is working really nice.

The only problem is, it now also grab email addresses ignoring the @.  Is it possible to fix this in the regex or do I have to do an extra check?

The test data is very dynamic, because its all user input. But because we work with dyslectic people, we have to be as creative as possible for the possible faulty inputs. Quite a challenge!
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 36554161
If you want to post some test data and show us the expected results we might be able to give you...

Trying to write code without test data, especially in the context of potentially dyslexic input strings, is a fool's errand.

Here is a code snippet that I have used to validate an email address.  It has a regular expression that seems to work fairly well.
<?php // RAY_email_validation.php
error_reporting(E_ALL);



// A FUNCTION TO TEST FOR A VALID EMAIL ADDRESS, RETURN TRUE OR FALSE



// SEE MAN PAGE: http://php.net/manual/en/intro.filter.php
function check_valid_email($email)
{
    // IF PHP 5.2 OR ABOVE, WE CAN USE THE FILTER
    if (strnatcmp(phpversion(),'5.2') >= 0)
    {
        if(filter_var($email, FILTER_VALIDATE_EMAIL) === FALSE) return FALSE;
    }

    // IF LOWER-LEVEL PHP, WE CAN CONSTRUCT A REGULAR EXPRESSION
    else
    {
        $regex
        = '/'                        // START REGEX DELIMITER
        . '^'                        // START STRING
        . '[A-Z0-9_-]'               // AN EMAIL - SOME CHARACTER(S)
        . '[A-Z0-9._-]*'             // AN EMAIL - SOME CHARACTER(S) PERMITS DOT
        . '@'                        // A SINGLE AT-SIGN
        . '([A-Z0-9][A-Z0-9-]*\.)+'  // A DOMAIN NAME PERMITS DOT, ENDS DOT
        . '[A-Z\.]'                  // A TOP-LEVEL DOMAIN PERMITS DOT
        . '{2,6}'                    // TLD LENGTH >= 2 AND =< 6
        . '$'                        // ENDOF STRING
        . '/'                        // ENDOF REGEX DELIMITER
        . 'i'                        // CASE INSENSITIVE
        ;
        // TEST THE STRING FORMAT
        if (!preg_match($regex, $email)) return FALSE;
    }

    // FILTER_VAR OR PREG_MATCH DOES NOT TEST IF THE DOMAIN IS ROUTABLE
    $domain = explode('@', $email);

    // MAN PAGE: http://php.net/manual/en/function.checkdnsrr.php
    if ( checkdnsrr($domain[1], "MX") || checkdnsrr($domain[1], "A") ) return TRUE;

    // EMAIL IS NOT ROUTABLE
    return FALSE;
}



// DEMONSTRATE THE FUNCTION IN ACTION
$e = NULL;
if (!empty($_GET["e"]))
{
    $e = $_GET["e"];
    if (check_valid_email($e))
    {
        echo "<br/>VALID: $e \n";
    }
    else
    {
        echo "<br/>BOGUS: $e \n";
    }
}


// END OF PROCESSING - CREATE THE FORM USING HEREDOC NOTATION
$form = <<<ENDFORM
<form>
TEST A STRING FOR A VALID EMAIL ADDRESS:
<input name="e" value="$e" />
<input type="submit" />
</form>
ENDFORM;

echo $form;

Open in new window

0
 

Author Comment

by:Octalys
ID: 36555557
I have no test data besides some stuff I can come up with.

But kaufmed, i tried this today and results was;

$target = "test data1 data2 domain.com sdc";
the regex you gave me will find it domain.com

$target = "test data1 data2 domain.nl sdc";
domain.nl cant be found
0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 36555986
Adjusting kaufmed's pattern, I think this might work a little better for your approach #3 (I suggest keeping the first 2 as separate patterns, to reduce complexity):
preg_match('#(\S+[,.]){1,}(?:AC|AD|AE|AERO|AF|AG|AI|AL|AM|AN|AO|AQ|AR|ARPA|AS|ASIA|AT|AU|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BIZ|BJ|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CAT|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|COM|COOP|CR|CU|CV|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EDU|EE|EG|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GOV|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|INFO|INT|IO|IQ|IR|IS|IT|JE|JM|JO|JOBS|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MG|MH|MIL|MK|ML|MM|MN|MO|MOBI|MP|MQ|MR|MS|MT|MU|MUSEUM|MV|MW|MX|MY|MZ|NA|NAME|NC|NE|NET|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|ORG|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PRO|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SY|SZ|TC|TD|TEL|TF|TG|TH|TJ|TK|TL|TM|TN|TO|TP|TR|TRAVEL|TT|TV|TW|TZ|UA|UG|UK|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|XN|XXX|YE|YT|ZA|ZM|ZW)(?=\s|$)#i', $target, $match);

Open in new window

1
 
LVL 110

Assisted Solution

by:Ray Paseur
Ray Paseur earned 250 total points
ID: 36556359
I have no test data besides some stuff I can come up with.

Great, let's start with that!  But before you waste too much time guessing about how the code and data might interact, learn about this software development technique.  There is a reason that professional developers get excellent results very quickly, whereas amateurs tend to work more slowly and often write brittle and ineffective code.
http://www.extremeprogramming.org/rules/testfirst.html
http://en.wikipedia.org/wiki/Test-driven_development

Now let's look at the test data posted at ID:36555557:
$target = "test data1 data2 domain.com sdc";
$target = "test data1 data2 domain.nl sdc";


The probable distinguishing characteristics of these strings are (1) the substrings are separated by whitespace, (2) one and only one of the substrings contains a dot, (3) all strings begin with "test" and end with the substring "sdc".  Are all of these distinguishing characteristics relevant and representative of your expected input?  If not, you might try to make some variations as you go about Job#1 which is the creation of the test data set.  Then as you make these variations to the test data, you want to feed the new test data to the algorithm and see if it fails.  If you are building the test data correctly, it should fail with each new iteration, and you should add new features to the software (probably to the regex pattern) to remediate the failures.  At some point you will be satisfied that you have enough test cases and you can stop adding variations to the test data.  Put the test data and algorithms on the "shelf" in case you need them again later.

A good starting place might be something like this code snippet.  A good next step might be to modify the $targets array to include a new element with a subdomain like maybe www.example.org.  

Then maybe you would want to add specific tests for .com, .nl, .museum, etc.

At some point the test data and the regex will become longer and more complex, but you do not want to start with long and complex stuff that throws you into a complicated debugging activity at the beginning.  The point is that you build up the test data and the program together, incrementally.
<?php // RAY_temp_octalys.php
error_reporting(E_ALL);
echo "<pre>";

// THE ONLY TEST DATA THE AUTHOR HAS GIVEN US
$targets
= array
( "test data1 data2 domain.com sdc"
, "test data1 data2 domain.nl sdc"
)
;

// A REGEX THAT FINDS THE DOMAIN SUBSTRINGS
$regex
= '#'         // REGEX DELIMITER
. '('         // START GROUP
. '[A-Z0-9]'  // ALPHANUMERIC
. '+?'        // INDETERMINATE LENGTH
. '[.]'       // THE DOT (BEFORE THE TLD)
. '{1}'       // EXACTLY ONE
. '[A-Z]'     // ALPHA ONLY
. '{2,6}'     // LENGTH 2 TO 6
. ')'         // ENDOF GROUP
. '#'         // REGEX DELIMITER
. 'i'         // CASE-INSENSITIVE
;

// TEST THE DATA STRINGS
foreach ($targets as $target)
{
    preg_match_all($regex, $target, $match);

    // THERE IS ONLY ONE GROUP THAT WE NEED TO FIND
    echo PHP_EOL;
    echo $match[1][0];
}

Open in new window

0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 36581152
This was a great question, and it got me thinking about the programming process.  So I wrote an article showing exactly how I would go about writing a program to grab the domain names.  It's not intended to be a solution, just an illustration of the thought process used in iterative development.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_7830-A-Quick-Tour-of-Test-Driven-Development.html
0
 

Author Closing Comment

by:Octalys
ID: 36934101
Hi,

Thank you for the answers and help. My original question got answered. I used a part of the regex given by kaufmed.

But after reading Ray_Paseur post. I completely rewritten my whole approach. Still using the regex, but not all of it.

Good article Ray!
1

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Part of the Global Positioning System A geocode (https://developers.google.com/maps/documentation/geocoding/) is the major subset of a GPS coordinate (http://en.wikipedia.org/wiki/Global_Positioning_System), the other parts being the altitude and t…
This article discusses four methods for overlaying images in a container on a web page
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to dynamically set the form action using jQuery.

729 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question