Octalys
asked on
How do I make a specific regular expression for PHP that searches for http first and it its not there to search www. I need to grab domain names from a single line.
Hi,
How do I make a specific regular expression for PHP that searches for "http" first and if its not there to search "www.", if its also not there search .TLD's
I need to grab domain names from a single line of text, with in mind working with only dyslectic people. The dots can sometimes be replaced with commas. Or they somehow skip a dot in the domain name.
So my idea to solve this is;
1: search for http (if not found, go to approach 2)
2: search for www (if not found, go to approach 3)
3: search for all available TLD in this list http://data.iana.org/TLD/tlds-alpha-by-domain.txt
4: FAIL, I did my best.
I only need to return, subdomain.domain.tld and if there's a path a path.
Thank you
How do I make a specific regular expression for PHP that searches for "http" first and if its not there to search "www.", if its also not there search .TLD's
I need to grab domain names from a single line of text, with in mind working with only dyslectic people. The dots can sometimes be replaced with commas. Or they somehow skip a dot in the domain name.
So my idea to solve this is;
1: search for http (if not found, go to approach 2)
2: search for www (if not found, go to approach 3)
3: search for all available TLD in this list http://data.iana.org/TLD/tlds-alpha-by-domain.txt
4: FAIL, I did my best.
I only need to return, subdomain.domain.tld and if there's a path a path.
Thank you
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Hi,
Thank you for the answer. The regex looks good.
It finds most domains, but not this example;
"asdasdas subdomain,google.com asfds"
And is it possible to cut http:// away from the result, just returning subdomain.domain.tld?
Thanks
Thank you for the answer. The regex looks good.
It finds most domains, but not this example;
"asdasdas subdomain,google.com asfds"
And is it possible to cut http:// away from the result, just returning subdomain.domain.tld?
Thanks
It finds most domains, but not this example;Ah. The TLDs are all in caps. You would probably want case insensitivity on. Put a lower-case I after the last hash ( # ).
e.g.
preg_replace('#...#i', $target, $match)
And is it possible to cut http:// away from the result, just returning subdomain.domain.tld?For that, I think it would be simpler to do a subsequent replace call. For example:
$result = preg_match...
$result = str_replace("http://", "", $result);
Just a thought... It might be easier to do this in a few steps rather than try to do it all in a single REGEX. Maybe you would want your script to read the page at http://data.iana.org/TLD/tlds-alpha-by-domain.txt and use the up-to-the-minute list.
If you want to post some test data and show us the expected results we might be able to give you a good function to return the desired information.
If you want to post some test data and show us the expected results we might be able to give you a good function to return the desired information.
ASKER
Yeah I might split up the whole checking process later for a better overview in the code and easier to customise parts of it. But the regex is working really nice.
The only problem is, it now also grab email addresses ignoring the @. Is it possible to fix this in the regex or do I have to do an extra check?
The test data is very dynamic, because its all user input. But because we work with dyslectic people, we have to be as creative as possible for the possible faulty inputs. Quite a challenge!
The only problem is, it now also grab email addresses ignoring the @. Is it possible to fix this in the regex or do I have to do an extra check?
The test data is very dynamic, because its all user input. But because we work with dyslectic people, we have to be as creative as possible for the possible faulty inputs. Quite a challenge!
If you want to post some test data and show us the expected results we might be able to give you...
Trying to write code without test data, especially in the context of potentially dyslexic input strings, is a fool's errand.
Here is a code snippet that I have used to validate an email address. It has a regular expression that seems to work fairly well.
Trying to write code without test data, especially in the context of potentially dyslexic input strings, is a fool's errand.
Here is a code snippet that I have used to validate an email address. It has a regular expression that seems to work fairly well.
<?php // RAY_email_validation.php
error_reporting(E_ALL);
// A FUNCTION TO TEST FOR A VALID EMAIL ADDRESS, RETURN TRUE OR FALSE
// SEE MAN PAGE: http://php.net/manual/en/intro.filter.php
function check_valid_email($email)
{
// IF PHP 5.2 OR ABOVE, WE CAN USE THE FILTER
if (strnatcmp(phpversion(),'5.2') >= 0)
{
if(filter_var($email, FILTER_VALIDATE_EMAIL) === FALSE) return FALSE;
}
// IF LOWER-LEVEL PHP, WE CAN CONSTRUCT A REGULAR EXPRESSION
else
{
$regex
= '/' // START REGEX DELIMITER
. '^' // START STRING
. '[A-Z0-9_-]' // AN EMAIL - SOME CHARACTER(S)
. '[A-Z0-9._-]*' // AN EMAIL - SOME CHARACTER(S) PERMITS DOT
. '@' // A SINGLE AT-SIGN
. '([A-Z0-9][A-Z0-9-]*\.)+' // A DOMAIN NAME PERMITS DOT, ENDS DOT
. '[A-Z\.]' // A TOP-LEVEL DOMAIN PERMITS DOT
. '{2,6}' // TLD LENGTH >= 2 AND =< 6
. '$' // ENDOF STRING
. '/' // ENDOF REGEX DELIMITER
. 'i' // CASE INSENSITIVE
;
// TEST THE STRING FORMAT
if (!preg_match($regex, $email)) return FALSE;
}
// FILTER_VAR OR PREG_MATCH DOES NOT TEST IF THE DOMAIN IS ROUTABLE
$domain = explode('@', $email);
// MAN PAGE: http://php.net/manual/en/function.checkdnsrr.php
if ( checkdnsrr($domain[1], "MX") || checkdnsrr($domain[1], "A") ) return TRUE;
// EMAIL IS NOT ROUTABLE
return FALSE;
}
// DEMONSTRATE THE FUNCTION IN ACTION
$e = NULL;
if (!empty($_GET["e"]))
{
$e = $_GET["e"];
if (check_valid_email($e))
{
echo "<br/>VALID: $e \n";
}
else
{
echo "<br/>BOGUS: $e \n";
}
}
// END OF PROCESSING - CREATE THE FORM USING HEREDOC NOTATION
$form = <<<ENDFORM
<form>
TEST A STRING FOR A VALID EMAIL ADDRESS:
<input name="e" value="$e" />
<input type="submit" />
</form>
ENDFORM;
echo $form;
ASKER
I have no test data besides some stuff I can come up with.
But kaufmed, i tried this today and results was;
$target = "test data1 data2 domain.com sdc";
the regex you gave me will find it domain.com
$target = "test data1 data2 domain.nl sdc";
domain.nl cant be found
But kaufmed, i tried this today and results was;
$target = "test data1 data2 domain.com sdc";
the regex you gave me will find it domain.com
$target = "test data1 data2 domain.nl sdc";
domain.nl cant be found
Adjusting kaufmed's pattern, I think this might work a little better for your approach #3 (I suggest keeping the first 2 as separate patterns, to reduce complexity):
preg_match('#(\S+[,.]){1,}(?:AC|AD|AE|AERO|AF|AG|AI|AL|AM|AN|AO|AQ|AR|ARPA|AS|ASIA|AT|AU|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BIZ|BJ|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CAT|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|COM|COOP|CR|CU|CV|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EDU|EE|EG|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GOV|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|INFO|INT|IO|IQ|IR|IS|IT|JE|JM|JO|JOBS|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MG|MH|MIL|MK|ML|MM|MN|MO|MOBI|MP|MQ|MR|MS|MT|MU|MUSEUM|MV|MW|MX|MY|MZ|NA|NAME|NC|NE|NET|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|ORG|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PRO|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SY|SZ|TC|TD|TEL|TF|TG|TH|TJ|TK|TL|TM|TN|TO|TP|TR|TRAVEL|TT|TV|TW|TZ|UA|UG|UK|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|XN|XXX|YE|YT|ZA|ZM|ZW)(?=\s|$)#i', $target, $match);
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
This was a great question, and it got me thinking about the programming process. So I wrote an article showing exactly how I would go about writing a program to grab the domain names. It's not intended to be a solution, just an illustration of the thought process used in iterative development.
https://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_7830-A-Quick-Tour-of-Test-Driven-Development.html
https://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_7830-A-Quick-Tour-of-Test-Driven-Development.html
ASKER
Hi,
Thank you for the answers and help. My original question got answered. I used a part of the regex given by kaufmed.
But after reading Ray_Paseur post. I completely rewritten my whole approach. Still using the regex, but not all of it.
Good article Ray!
Thank you for the answers and help. My original question got answered. I used a part of the regex given by kaufmed.
But after reading Ray_Paseur post. I completely rewritten my whole approach. Still using the regex, but not all of it.
Good article Ray!
Open in new window