Solved

How do I make a specific regular expression for PHP that searches for http first and it its not there to search www. I need to grab domain names from a single line.

Posted on 2011-09-13
12
927 Views
Last Modified: 2012-05-12
Hi,

How do I make a specific regular expression for PHP that searches for "http" first and if its not there to search "www.", if its also not there search .TLD's

I need to grab domain names from a single line of text, with in mind working with only dyslectic people. The dots can sometimes be replaced with commas. Or they somehow skip a dot in the domain name.

So my idea to solve this is;

1: search for http (if not found, go to approach 2)
2: search for www (if not found, go to approach 3)
3: search for all available TLD in this list http://data.iana.org/TLD/tlds-alpha-by-domain.txt
4: FAIL, I did my best.

I only need to return,  subdomain.domain.tld and if there's a path a path.

Thank you
0
Comment
Question by:Octalys
  • 4
  • 4
  • 3
  • +1
12 Comments
 
LVL 74

Accepted Solution

by:
käµfm³d   👽 earned 250 total points
Comment Utility
If I understand your explanation correctly, the I believe this is what you are after:

preg_match('#((?:http|www\.|\S+\.(?:AC|AD|AE|AERO|AF|AG|AI|AL|AM|AN|AO|AQ|AR|ARPA|AS|ASIA|AT|AU|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BIZ|BJ|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CAT|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|COM|COOP|CR|CU|CV|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EDU|EE|EG|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GOV|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|INFO|INT|IO|IQ|IR|IS|IT|JE|JM|JO|JOBS|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MG|MH|MIL|MK|ML|MM|MN|MO|MOBI|MP|MQ|MR|MS|MT|MU|MUSEUM|MV|MW|MX|MY|MZ|NA|NAME|NC|NE|NET|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|ORG|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PRO|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SY|SZ|TC|TD|TEL|TF|TG|TH|TJ|TK|TL|TM|TN|TO|TP|TR|TRAVEL|TT|TV|TW|TZ|UA|UG|UK|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|XN|XXX|YE|YT|ZA|ZM|ZW)\S+)(?=\s|$)#', $target, $match)

Open in new window


You should have the value in $match[1]. If it doesn't work, please post a "before" string and and "after" result.

P.S.

I truncated the multiple occurrences of "XN". I'm not entirely sure why it's listed more than once at IANA page.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
I forgot a paren:
preg_match('#((?:http|www\.|\S+\.(?:AC|AD|AE|AERO|AF|AG|AI|AL|AM|AN|AO|AQ|AR|ARPA|AS|ASIA|AT|AU|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BIZ|BJ|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CAT|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|COM|COOP|CR|CU|CV|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EDU|EE|EG|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GOV|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|INFO|INT|IO|IQ|IR|IS|IT|JE|JM|JO|JOBS|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MG|MH|MIL|MK|ML|MM|MN|MO|MOBI|MP|MQ|MR|MS|MT|MU|MUSEUM|MV|MW|MX|MY|MZ|NA|NAME|NC|NE|NET|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|ORG|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PRO|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SY|SZ|TC|TD|TEL|TF|TG|TH|TJ|TK|TL|TM|TN|TO|TP|TR|TRAVEL|TT|TV|TW|TZ|UA|UG|UK|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|XN|XXX|YE|YT|ZA|ZM|ZW))\S+)(?=\s|$)#', $target, $match)

Open in new window

0
 

Author Comment

by:Octalys
Comment Utility
Hi,

Thank you for the answer. The regex looks good.

It finds most domains, but not this example;

"asdasdas subdomain,google.com asfds"

And is it possible to cut http:// away from the result, just returning subdomain.domain.tld?

Thanks
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
It finds most domains, but not this example;
Ah. The TLDs are all in caps. You would probably want case insensitivity on. Put a lower-case I after the last hash ( # ).

e.g.

preg_replace('#...#i', $target, $match)

Open in new window


And is it possible to cut http:// away from the result, just returning subdomain.domain.tld?
For that, I think it would be simpler to do a subsequent replace call. For example:

$result = preg_match...
$result = str_replace("http://", "", $result);

Open in new window

0
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
Just a thought... It might be easier to do this in a few steps rather than try to do it all in a single REGEX.  Maybe you would want your script to read the page at http://data.iana.org/TLD/tlds-alpha-by-domain.txt and use the up-to-the-minute list.

If you want to post some test data and show us the expected results we might be able to give you a good function to return the desired information.
0
 

Author Comment

by:Octalys
Comment Utility
Yeah I might split up the whole checking process later for a better overview in the code and easier to customise parts of it. But the regex is working really nice.

The only problem is, it now also grab email addresses ignoring the @.  Is it possible to fix this in the regex or do I have to do an extra check?

The test data is very dynamic, because its all user input. But because we work with dyslectic people, we have to be as creative as possible for the possible faulty inputs. Quite a challenge!
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
If you want to post some test data and show us the expected results we might be able to give you...

Trying to write code without test data, especially in the context of potentially dyslexic input strings, is a fool's errand.

Here is a code snippet that I have used to validate an email address.  It has a regular expression that seems to work fairly well.
<?php // RAY_email_validation.php
error_reporting(E_ALL);



// A FUNCTION TO TEST FOR A VALID EMAIL ADDRESS, RETURN TRUE OR FALSE



// SEE MAN PAGE: http://php.net/manual/en/intro.filter.php
function check_valid_email($email)
{
    // IF PHP 5.2 OR ABOVE, WE CAN USE THE FILTER
    if (strnatcmp(phpversion(),'5.2') >= 0)
    {
        if(filter_var($email, FILTER_VALIDATE_EMAIL) === FALSE) return FALSE;
    }

    // IF LOWER-LEVEL PHP, WE CAN CONSTRUCT A REGULAR EXPRESSION
    else
    {
        $regex
        = '/'                        // START REGEX DELIMITER
        . '^'                        // START STRING
        . '[A-Z0-9_-]'               // AN EMAIL - SOME CHARACTER(S)
        . '[A-Z0-9._-]*'             // AN EMAIL - SOME CHARACTER(S) PERMITS DOT
        . '@'                        // A SINGLE AT-SIGN
        . '([A-Z0-9][A-Z0-9-]*\.)+'  // A DOMAIN NAME PERMITS DOT, ENDS DOT
        . '[A-Z\.]'                  // A TOP-LEVEL DOMAIN PERMITS DOT
        . '{2,6}'                    // TLD LENGTH >= 2 AND =< 6
        . '$'                        // ENDOF STRING
        . '/'                        // ENDOF REGEX DELIMITER
        . 'i'                        // CASE INSENSITIVE
        ;
        // TEST THE STRING FORMAT
        if (!preg_match($regex, $email)) return FALSE;
    }

    // FILTER_VAR OR PREG_MATCH DOES NOT TEST IF THE DOMAIN IS ROUTABLE
    $domain = explode('@', $email);

    // MAN PAGE: http://php.net/manual/en/function.checkdnsrr.php
    if ( checkdnsrr($domain[1], "MX") || checkdnsrr($domain[1], "A") ) return TRUE;

    // EMAIL IS NOT ROUTABLE
    return FALSE;
}



// DEMONSTRATE THE FUNCTION IN ACTION
$e = NULL;
if (!empty($_GET["e"]))
{
    $e = $_GET["e"];
    if (check_valid_email($e))
    {
        echo "<br/>VALID: $e \n";
    }
    else
    {
        echo "<br/>BOGUS: $e \n";
    }
}


// END OF PROCESSING - CREATE THE FORM USING HEREDOC NOTATION
$form = <<<ENDFORM
<form>
TEST A STRING FOR A VALID EMAIL ADDRESS:
<input name="e" value="$e" />
<input type="submit" />
</form>
ENDFORM;

echo $form;

Open in new window

0
 

Author Comment

by:Octalys
Comment Utility
I have no test data besides some stuff I can come up with.

But kaufmed, i tried this today and results was;

$target = "test data1 data2 domain.com sdc";
the regex you gave me will find it domain.com

$target = "test data1 data2 domain.nl sdc";
domain.nl cant be found
0
 
LVL 35

Expert Comment

by:Terry Woods
Comment Utility
Adjusting kaufmed's pattern, I think this might work a little better for your approach #3 (I suggest keeping the first 2 as separate patterns, to reduce complexity):
preg_match('#(\S+[,.]){1,}(?:AC|AD|AE|AERO|AF|AG|AI|AL|AM|AN|AO|AQ|AR|ARPA|AS|ASIA|AT|AU|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BIZ|BJ|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CAT|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|COM|COOP|CR|CU|CV|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EDU|EE|EG|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GOV|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|INFO|INT|IO|IQ|IR|IS|IT|JE|JM|JO|JOBS|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MG|MH|MIL|MK|ML|MM|MN|MO|MOBI|MP|MQ|MR|MS|MT|MU|MUSEUM|MV|MW|MX|MY|MZ|NA|NAME|NC|NE|NET|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|ORG|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PRO|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SY|SZ|TC|TD|TEL|TF|TG|TH|TJ|TK|TL|TM|TN|TO|TP|TR|TRAVEL|TT|TV|TW|TZ|UA|UG|UK|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|XN|XXX|YE|YT|ZA|ZM|ZW)(?=\s|$)#i', $target, $match);

Open in new window

0
 
LVL 108

Assisted Solution

by:Ray Paseur
Ray Paseur earned 250 total points
Comment Utility
I have no test data besides some stuff I can come up with.

Great, let's start with that!  But before you waste too much time guessing about how the code and data might interact, learn about this software development technique.  There is a reason that professional developers get excellent results very quickly, whereas amateurs tend to work more slowly and often write brittle and ineffective code.
http://www.extremeprogramming.org/rules/testfirst.html
http://en.wikipedia.org/wiki/Test-driven_development

Now let's look at the test data posted at ID:36555557:
$target = "test data1 data2 domain.com sdc";
$target = "test data1 data2 domain.nl sdc";


The probable distinguishing characteristics of these strings are (1) the substrings are separated by whitespace, (2) one and only one of the substrings contains a dot, (3) all strings begin with "test" and end with the substring "sdc".  Are all of these distinguishing characteristics relevant and representative of your expected input?  If not, you might try to make some variations as you go about Job#1 which is the creation of the test data set.  Then as you make these variations to the test data, you want to feed the new test data to the algorithm and see if it fails.  If you are building the test data correctly, it should fail with each new iteration, and you should add new features to the software (probably to the regex pattern) to remediate the failures.  At some point you will be satisfied that you have enough test cases and you can stop adding variations to the test data.  Put the test data and algorithms on the "shelf" in case you need them again later.

A good starting place might be something like this code snippet.  A good next step might be to modify the $targets array to include a new element with a subdomain like maybe www.example.org.  

Then maybe you would want to add specific tests for .com, .nl, .museum, etc.

At some point the test data and the regex will become longer and more complex, but you do not want to start with long and complex stuff that throws you into a complicated debugging activity at the beginning.  The point is that you build up the test data and the program together, incrementally.
<?php // RAY_temp_octalys.php
error_reporting(E_ALL);
echo "<pre>";

// THE ONLY TEST DATA THE AUTHOR HAS GIVEN US
$targets
= array
( "test data1 data2 domain.com sdc"
, "test data1 data2 domain.nl sdc"
)
;

// A REGEX THAT FINDS THE DOMAIN SUBSTRINGS
$regex
= '#'         // REGEX DELIMITER
. '('         // START GROUP
. '[A-Z0-9]'  // ALPHANUMERIC
. '+?'        // INDETERMINATE LENGTH
. '[.]'       // THE DOT (BEFORE THE TLD)
. '{1}'       // EXACTLY ONE
. '[A-Z]'     // ALPHA ONLY
. '{2,6}'     // LENGTH 2 TO 6
. ')'         // ENDOF GROUP
. '#'         // REGEX DELIMITER
. 'i'         // CASE-INSENSITIVE
;

// TEST THE DATA STRINGS
foreach ($targets as $target)
{
    preg_match_all($regex, $target, $match);

    // THERE IS ONLY ONE GROUP THAT WE NEED TO FIND
    echo PHP_EOL;
    echo $match[1][0];
}

Open in new window

0
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
This was a great question, and it got me thinking about the programming process.  So I wrote an article showing exactly how I would go about writing a program to grab the domain names.  It's not intended to be a solution, just an illustration of the thought process used in iterative development.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_7830-A-Quick-Tour-of-Test-Driven-Development.html
0
 

Author Closing Comment

by:Octalys
Comment Utility
Hi,

Thank you for the answers and help. My original question got answered. I used a part of the regex given by kaufmed.

But after reading Ray_Paseur post. I completely rewritten my whole approach. Still using the regex, but not all of it.

Good article Ray!
0

Featured Post

Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

Join & Write a Comment

Things That Drive Us Nuts Have you noticed the use of the reCaptcha feature at EE and other web sites?  It wants you to read and retype something that looks like this.Insanity!  It's not EE's fault - that's just the way reCaptcha works.  But it is …
This article discusses how to create an extensible mechanism for linked drop downs.
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

9 Experts available now in Live!

Get 1:1 Help Now