Link to home
Start Free TrialLog in
Avatar of Octalys
Octalys

asked on

How do I make a specific regular expression for PHP that searches for http first and it its not there to search www. I need to grab domain names from a single line.

Hi,

How do I make a specific regular expression for PHP that searches for "http" first and if its not there to search "www.", if its also not there search .TLD's

I need to grab domain names from a single line of text, with in mind working with only dyslectic people. The dots can sometimes be replaced with commas. Or they somehow skip a dot in the domain name.

So my idea to solve this is;

1: search for http (if not found, go to approach 2)
2: search for www (if not found, go to approach 3)
3: search for all available TLD in this list http://data.iana.org/TLD/tlds-alpha-by-domain.txt
4: FAIL, I did my best.

I only need to return,  subdomain.domain.tld and if there's a path a path.

Thank you
ASKER CERTIFIED SOLUTION
Avatar of kaufmed
kaufmed
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I forgot a paren:
preg_match('#((?:http|www\.|\S+\.(?:AC|AD|AE|AERO|AF|AG|AI|AL|AM|AN|AO|AQ|AR|ARPA|AS|ASIA|AT|AU|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BIZ|BJ|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CAT|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|COM|COOP|CR|CU|CV|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EDU|EE|EG|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GOV|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|INFO|INT|IO|IQ|IR|IS|IT|JE|JM|JO|JOBS|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MG|MH|MIL|MK|ML|MM|MN|MO|MOBI|MP|MQ|MR|MS|MT|MU|MUSEUM|MV|MW|MX|MY|MZ|NA|NAME|NC|NE|NET|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|ORG|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PRO|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SY|SZ|TC|TD|TEL|TF|TG|TH|TJ|TK|TL|TM|TN|TO|TP|TR|TRAVEL|TT|TV|TW|TZ|UA|UG|UK|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|XN|XXX|YE|YT|ZA|ZM|ZW))\S+)(?=\s|$)#', $target, $match)

Open in new window

Avatar of Octalys
Octalys

ASKER

Hi,

Thank you for the answer. The regex looks good.

It finds most domains, but not this example;

"asdasdas subdomain,google.com asfds"

And is it possible to cut http:// away from the result, just returning subdomain.domain.tld?

Thanks
It finds most domains, but not this example;
Ah. The TLDs are all in caps. You would probably want case insensitivity on. Put a lower-case I after the last hash ( # ).

e.g.

preg_replace('#...#i', $target, $match)

Open in new window


And is it possible to cut http:// away from the result, just returning subdomain.domain.tld?
For that, I think it would be simpler to do a subsequent replace call. For example:

$result = preg_match...
$result = str_replace("http://", "", $result);

Open in new window

Just a thought... It might be easier to do this in a few steps rather than try to do it all in a single REGEX.  Maybe you would want your script to read the page at http://data.iana.org/TLD/tlds-alpha-by-domain.txt and use the up-to-the-minute list.

If you want to post some test data and show us the expected results we might be able to give you a good function to return the desired information.
Avatar of Octalys

ASKER

Yeah I might split up the whole checking process later for a better overview in the code and easier to customise parts of it. But the regex is working really nice.

The only problem is, it now also grab email addresses ignoring the @.  Is it possible to fix this in the regex or do I have to do an extra check?

The test data is very dynamic, because its all user input. But because we work with dyslectic people, we have to be as creative as possible for the possible faulty inputs. Quite a challenge!
If you want to post some test data and show us the expected results we might be able to give you...

Trying to write code without test data, especially in the context of potentially dyslexic input strings, is a fool's errand.

Here is a code snippet that I have used to validate an email address.  It has a regular expression that seems to work fairly well.
<?php // RAY_email_validation.php
error_reporting(E_ALL);



// A FUNCTION TO TEST FOR A VALID EMAIL ADDRESS, RETURN TRUE OR FALSE



// SEE MAN PAGE: http://php.net/manual/en/intro.filter.php
function check_valid_email($email)
{
    // IF PHP 5.2 OR ABOVE, WE CAN USE THE FILTER
    if (strnatcmp(phpversion(),'5.2') >= 0)
    {
        if(filter_var($email, FILTER_VALIDATE_EMAIL) === FALSE) return FALSE;
    }

    // IF LOWER-LEVEL PHP, WE CAN CONSTRUCT A REGULAR EXPRESSION
    else
    {
        $regex
        = '/'                        // START REGEX DELIMITER
        . '^'                        // START STRING
        . '[A-Z0-9_-]'               // AN EMAIL - SOME CHARACTER(S)
        . '[A-Z0-9._-]*'             // AN EMAIL - SOME CHARACTER(S) PERMITS DOT
        . '@'                        // A SINGLE AT-SIGN
        . '([A-Z0-9][A-Z0-9-]*\.)+'  // A DOMAIN NAME PERMITS DOT, ENDS DOT
        . '[A-Z\.]'                  // A TOP-LEVEL DOMAIN PERMITS DOT
        . '{2,6}'                    // TLD LENGTH >= 2 AND =< 6
        . '$'                        // ENDOF STRING
        . '/'                        // ENDOF REGEX DELIMITER
        . 'i'                        // CASE INSENSITIVE
        ;
        // TEST THE STRING FORMAT
        if (!preg_match($regex, $email)) return FALSE;
    }

    // FILTER_VAR OR PREG_MATCH DOES NOT TEST IF THE DOMAIN IS ROUTABLE
    $domain = explode('@', $email);

    // MAN PAGE: http://php.net/manual/en/function.checkdnsrr.php
    if ( checkdnsrr($domain[1], "MX") || checkdnsrr($domain[1], "A") ) return TRUE;

    // EMAIL IS NOT ROUTABLE
    return FALSE;
}



// DEMONSTRATE THE FUNCTION IN ACTION
$e = NULL;
if (!empty($_GET["e"]))
{
    $e = $_GET["e"];
    if (check_valid_email($e))
    {
        echo "<br/>VALID: $e \n";
    }
    else
    {
        echo "<br/>BOGUS: $e \n";
    }
}


// END OF PROCESSING - CREATE THE FORM USING HEREDOC NOTATION
$form = <<<ENDFORM
<form>
TEST A STRING FOR A VALID EMAIL ADDRESS:
<input name="e" value="$e" />
<input type="submit" />
</form>
ENDFORM;

echo $form;

Open in new window

Avatar of Octalys

ASKER

I have no test data besides some stuff I can come up with.

But kaufmed, i tried this today and results was;

$target = "test data1 data2 domain.com sdc";
the regex you gave me will find it domain.com

$target = "test data1 data2 domain.nl sdc";
domain.nl cant be found
Adjusting kaufmed's pattern, I think this might work a little better for your approach #3 (I suggest keeping the first 2 as separate patterns, to reduce complexity):
preg_match('#(\S+[,.]){1,}(?:AC|AD|AE|AERO|AF|AG|AI|AL|AM|AN|AO|AQ|AR|ARPA|AS|ASIA|AT|AU|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BIZ|BJ|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CAT|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|COM|COOP|CR|CU|CV|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EDU|EE|EG|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GOV|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|INFO|INT|IO|IQ|IR|IS|IT|JE|JM|JO|JOBS|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MG|MH|MIL|MK|ML|MM|MN|MO|MOBI|MP|MQ|MR|MS|MT|MU|MUSEUM|MV|MW|MX|MY|MZ|NA|NAME|NC|NE|NET|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|ORG|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PRO|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SY|SZ|TC|TD|TEL|TF|TG|TH|TJ|TK|TL|TM|TN|TO|TP|TR|TRAVEL|TT|TV|TW|TZ|UA|UG|UK|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|XN|XXX|YE|YT|ZA|ZM|ZW)(?=\s|$)#i', $target, $match);

Open in new window

SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
This was a great question, and it got me thinking about the programming process.  So I wrote an article showing exactly how I would go about writing a program to grab the domain names.  It's not intended to be a solution, just an illustration of the thought process used in iterative development.
https://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_7830-A-Quick-Tour-of-Test-Driven-Development.html
Avatar of Octalys

ASKER

Hi,

Thank you for the answers and help. My original question got answered. I used a part of the regex given by kaufmed.

But after reading Ray_Paseur post. I completely rewritten my whole approach. Still using the regex, but not all of it.

Good article Ray!