$targets = array
( "test chatter"
, "random noise"
)
;
We probably think we know what a "domain name" means. It is a string of characters like domain.com that points to a resource on a network, like the internet. Domain names have very specific rules. Maybe it would be a good idea to look up the rules, right? A quick search leads us to this article: http://en.wikipedia.org/wi$targets = array
( "test domain.com chatter"
, "random example.org noise"
)
;
But What Do We Really Want To Achieve?
$targets = array
( "domain.com" => "test domain.com chatter"
, "example.org" => "random example.org noise"
)
;
Armed with this tiny data set, we can begin constructing our regular expression. At the start of the process, it will look something like this.
$regex
= '#' // REGEX DELIMITER
. '(' // START OF A GROUP
. '[A-Z]' // ALPHABETIC CHARACTERS
. '+?' // INDETERMINATE LENGTH
. '[.]' // THE DOT (BEFORE THE TLD)
. '{1}' // LENGTH IS EXACTLY ONE
. '[A-Z]' // CHARACTER CLASS ALPHA
. ')' // END GROUP
. '#' // REGEX DELIMITER
. 'i' // CASE-INSENSITIVE
;
We put that all together into a script, and put the script on our server. And we run it. And we shake the parse errors out. And we run it again. And we tinker with it a little bit until it seems to be doing something close to what we want. Once it is working (or nearly working), it looks something like example 1, below. What do we mean by "working" at this point? We don't mean programming perfection at all. Instead we mean that the script runs and creates informative and useful output. The useful output contains four key elements. It shows us the input string, the output string, the expected string and the regular expression, all neatly consolidated into an easy-to-read collection. That is what we need to see as we begin to improve and debug our regular expression.
<?php // RAY_EE_tdd_example_1.php
error_reporting(E_ALL);
echo "<pre>";
// TEST DATA
$targets
= array
( "domain.com" => "test domain.com chatter"
, "example.org" => "random example.org noise"
)
;
// A REGEX THAT FINDS THE DOMAIN SUBSTRINGS
$regex
= '#' // REGEX DELIMITER
. '(' // START GROUP
. '[A-Z0-9]' // CHARACTER CLASS ALPHANUMERIC
. '+?' // INDETERMINATE LENGTH
. '[.]' // THE DOT (BEFORE THE TLD)
. '{1}' // EXACTLY ONE
. '[A-Z]' // CHARACTER CLASS ALPHA
. ')' // END GROUP
. '#' // REGEX DELIMITER
. 'i' // CASE-INSENSITIVE
;
// TEST THE DATA STRINGS
foreach ($targets as $expected => $target)
{
preg_match_all($regex, $target, $match);
// SHOW WHAT HAPPENED
echo PHP_EOL;
echo "<b>EXPECT:</b> $expected";
echo PHP_EOL;
echo "<b>INPUTS:</b> $target";
echo PHP_EOL;
echo "<b>REGEXP:</b> $regex";
echo PHP_EOL;
echo "<b>OUTPUT:</b> " . $match[1][0];
echo PHP_EOL;
}
Well, it works. However it does not give us the output we want. Instead of grabbing the entire substrings domain.com and example.org it produces this.
// A REGEX THAT FINDS THE DOMAIN SUBSTRINGS
$regex
= '#' // REGEX DELIMITER
. '(' // START GROUP
. '[A-Z0-9]' // CHARACTER CLASS ALPHANUMERIC
. '+?' // INDETERMINATE LENGTH
. '[.]' // THE DOT (BEFORE THE TLD)
. '{1}' // LENGTH IS EXACTLY ONE
. '[A-Z]' // CHARACTER CLASS ALPHA
. '{2,6}' // LENGTH IS TWO TO SIX
. ')' // END GROUP
. '#' // REGEX DELIMITER
. 'i' // CASE-INSENSITIVE
;
That works well. The output is what we expect.
// TEST DATA
$targets
= array
$targets = array
( "domain.com" => "test domain.com chatter"
, "example.org" => "random example.org noise"
, "NOTHING" => "test chatter random noise"
)
;
Whoa! Something is broken.
// TEST THE DATA STRINGS
foreach ($targets as $expected => $target)
{
preg_match_all($regex, $target, $match);
// SHOW WHAT HAPPENED
echo PHP_EOL;
echo "<b>EXPECT:</b> $expected";
echo PHP_EOL;
echo "<b>INPUTS:</b> $target";
echo PHP_EOL;
echo "<b>REGEXP:</b> $regex";
echo PHP_EOL;
echo "<b>OUTPUT:</b> ";
var_dump($match);
echo PHP_EOL;
}
And the output from our tests looks something like this.
// TEST DATA
$targets
= array
( "NOTHING" => "test chatter random noise"
, "domain.com" => "test domain.com chatter"
, "example.org" => "random example.org noise"
)
;
Let's take a step forward. Now we will try to grab two domain names from a single string. Here is our new test data set.
// TEST DATA
$targets
= array
( "NOTHING" => "test chatter random noise"
, "domain.com example.org" => "test domain.com chatter example.org noise"
, "domain.com" => "test domain.com chatter"
, "example.org" => "random example.org noise"
)
;
And the new output contains everything we had before, plus this, so we now have evidence that we can grab more than one domain name. The domain names appear in the sub-array of the $match array at both key positions zero and one.
// TEST DATA
$targets
= array
( "NOTHING" => "test chatter random noise"
, "www.example.org" => "random www.example.org noise"
, "domain.com example.org" => "test domain.com chatter example.org noise"
, "domain.com" => "test domain.com chatter"
, "example.org" => "random example.org noise"
)
;
And the var_dump() output immediately shows us that the regular expression we are developing cannot handle this new input. Back to the drawing board!
<?php // RAY_EE_tdd_example_6.php
error_reporting(E_ALL);
echo "<pre>";
// TEST DATA IS NOW AN ARRAY OF INDIVIDUAL TESTS
$targets
= array
( array( "" => "test chatter random noise"
), array( "" => "the dot-com bubble"
), array( "" => "foo.bar may give false positive"
), array( "http://example.org" => "random noise http://example.org"
), array( "http://example.org" => "http://example.org? random noise"
), array( "http://example.org" => "random http://example.org noise"
), array( "https://www.example.org" => "random https://www.example.org noise"
), array( "http://test.example.org" => "random http://test.example.org noise"
), array( "www.example.org" => "random www.example.org noise"
), array( "domain.com example.org" => "test domain.com chatter example.org noise"
), array( "domain.com" => "test domain.com chatter"
), array( "" => "http://nonsense."
)
)
;
// A REGEX THAT FINDS URLS AND DOMAIN SUBSTRINGS
$regex
= '#' // REGEX DELIMITER
. '\b' // ON WORD BOUNDARY
. '(' // START GROUP
. 'https?' // HTTP OR HTTPS
. '|' // OR
. 'ftps?' // FTP OR FTPS
. ')' // END GROUP
. '??' // ZERO OR ONE OF THIS GROUP, UNGREEDY
. '(' // START GROUP
. '://' // COLON, SLASH, SLASH
. ')' // END GROUP
. '??' // ZERO OR ONE OF THIS GROUP, UNGREEDY
. '(' // START GROUP
. '[A-Z0-9]' // A SUBDOMAIN
. '+?' // INDETERMINATE LENGTH
. '\.' // A DOT (ESCAPED)
. ')' // END GROUP
. '??' // ZERO OR ONE OF THIS GROUP, UNGREEDY
. '(' // START GROUP
. '[A-Z0-9]' // CHARACTER CLASS ALPHANUMERIC
. '+?' // INDETERMINATE LENGTH
. ')' // END GROUP
. '(' // START GROUP
. '[.]' // THE DOT (BEFORE THE TLD)
. '{1}' // LENGTH IS EXACTLY ONE
. ')' // END GROUP
. '(' // START GROUP
. '[A-Z]' // CHARACTER CLASS ALPHA
. '{2,6}' // LENGTH IS TWO TO SIX
. ')' // END GROUP
. '\b' // ON WORD BOUNDARY
. '#' // REGEX DELIMITER
. 'i' // CASE-INSENSITIVE
;
// TEST THE DATA STRINGS IN THE SUB-ARRAYS
foreach ($targets as $arr)
{
foreach ($arr as $expected => $target)
{
preg_match_all($regex, $target, $match);
// SHOW WHAT HAPPENED
echo PHP_EOL;
echo "<b>EXPECT:</b> $expected";
echo PHP_EOL;
echo "<b>INPUTS:</b> $target";
echo PHP_EOL;
echo "<b>REGEXP:</b> $regex";
echo PHP_EOL;
echo "<b>OUTPUT:</b> ";
var_dump($match);
echo PHP_EOL;
}
}
Toward TDD Perfection
<?php // RAY_EE_tdd_example_7.php
error_reporting(E_ALL);
echo "<pre>";
// TEST DATA IS NOW AN ARRAY OF INDIVIDUAL TESTS
$targets
= array
( array( "" => "test chatter random noise"
), array( "" => "the dot-com bubble"
), array( "" => "foo.bar may give false positive"
), array( "" => "http://nonsense.nothing"
), array( "http://example.org" => "random noise http://example.org"
), array( "http://example.org" => "http://example.org? random noise"
), array( "http://example.org" => "random http://example.org noise"
), array( "https://www.example.org" => "random https://www.example.org noise"
), array( "http://test.example.org" => "random http://test.example.org noise"
), array( "www.example.org" => "random www.example.org noise"
), array( "domain.com example.org" => "test domain.com chatter example.org noise"
), array( "domain.com" => "test domain.com chatter"
)
)
;
// A REGEX THAT FINDS URLS AND DOMAIN SUBSTRINGS
$regex
= '#' // REGEX DELIMITER
. '\b' // ON WORD BOUNDARY
. '(' // START GROUP
. 'https?' // HTTP OR HTTPS
. '|' // OR
. 'ftps?' // FTP OR FTPS
. ')' // END GROUP
. '??' // ZERO OR ONE OF THIS GROUP, UNGREEDY
. '(' // START GROUP
. '://' // COLON, SLASH, SLASH
. ')' // END GROUP
. '??' // ZERO OR ONE OF THIS GROUP, UNGREEDY
. '(' // START GROUP
. '[A-Z0-9]' // A SUBDOMAIN
. '+?' // INDETERMINATE LENGTH
. '\.' // A DOT (ESCAPED)
. ')' // END GROUP
. '??' // ZERO OR ONE OF THIS GROUP, UNGREEDY
. '(' // START GROUP
. '[A-Z0-9]' // CHARACTER CLASS ALPHANUMERIC
. '+?' // INDETERMINATE LENGTH
. ')' // END GROUP
. '(' // START GROUP
. '[.]' // THE DOT (BEFORE THE TLD)
. '{1}' // LENGTH IS EXACTLY ONE
. ')' // END GROUP
. '(' // START GROUP
. '[A-Z]' // CHARACTER CLASS ALPHA
. '{2,6}' // LENGTH IS TWO TO SIX
. ')' // END GROUP
. '\b' // ON WORD BOUNDARY
. '#' // REGEX DELIMITER
. 'i' // CASE-INSENSITIVE
;
// TEST THE DATA STRINGS IN THE SUB-ARRAYS
foreach ($targets as $arr)
{
foreach ($arr as $expected => $target)
{
preg_match_all($regex, $target, $match);
// SHOW WHAT HAPPENED
foreach ($match[0] as $matched)
{
// NO OUTPUT IF THE TEST WORKED AS EXPECTED
if ($matched == $expected) continue;
// EXPOSITION IF THE TEST DID NOT WORK AS EXPECTED
echo PHP_EOL;
echo "<b>EXPECT:</b> $expected";
echo PHP_EOL;
echo "<b>INPUTS:</b> $target";
echo PHP_EOL;
echo "<b>REGEXP:</b> $regex";
echo PHP_EOL;
echo "<b>OUTPUT:</b> ";
print_r($match[0]);
echo PHP_EOL;
}
}
}
Now the volume of output is manageable! Here is what it looks like. A quick visual inspection shows us that the two-URL example is really OK. But foo.bar is not really something we want.
<?php // RAY_EE_tdd_example_8.php
error_reporting(E_ALL);
echo "<pre>";
// TEST DATA IS AN ARRAY OF INDIVIDUAL TEST ARRAYS
$targets
= array
( array( "" => "test chatter random noise"
), array( "" => "the dot-com bubble"
), array( "" => "foo.bar may give false positive"
), array( "" => "http://nonsense.nothing"
), array( "http://example.org" => "random noise http://example.org"
), array( "http://example.org" => "http://example.org? random noise"
), array( "http://example.org" => "random http://example.org noise"
), array( "https://www.example.org" => "random https://www.example.org noise"
), array( "http://test.example.org" => "random http://test.example.org noise"
), array( "www.example.org" => "random www.example.org noise"
), array( "domain.com example.org" => "test domain.com chatter example.org noise"
), array( "domain.com" => "test domain.com chatter"
)
)
;
// READ THE IANA TLD LIST
$tlds = file('http://data.iana.org/TLD/tlds-alpha-by-domain.txt', FILE_IGNORE_NEW_LINES);
// ROUGH-CUT SANITIZE THE IANA TLD LIST REMOVING COMMENTS AND JUNK
foreach ($tlds as $key => $tld)
{
if (strpos($tld, '#') !== FALSE) unset($tlds[$key]);
if (strpos($tld, '--') !== FALSE) unset($tlds[$key]);
}
// COLLAPSE THE TLD ARRAY INTO A GROUP STRING FOR USE IN THE REGEX
$tldg = '(' . implode('|', $tlds) . ')';
// A REGEX THAT FINDS URLS AND DOMAIN SUBSTRINGS
$regex
= '#' // REGEX DELIMITER
. '\b' // ON WORD BOUNDARY
. '(' // START GROUP
. 'https?' // HTTP OR HTTPS
. '|' // OR
. 'ftps?' // FTP OR FTPS
. ')' // END GROUP
. '??' // ZERO OR ONE OF THIS GROUP, UNGREEDY
. '(' // START GROUP
. '://' // COLON, SLASH, SLASH
. ')' // END GROUP
. '??' // ZERO OR ONE OF THIS GROUP, UNGREEDY
. '(' // START GROUP
. '[A-Z0-9]' // A SUBDOMAIN
. '+?' // INDETERMINATE LENGTH
. '\.' // A DOT (ESCAPED)
. ')' // END GROUP
. '??' // ZERO OR ONE OF THIS GROUP, UNGREEDY
. '(' // START GROUP
. '[A-Z0-9]' // CHARACTER CLASS ALPHANUMERIC
. '+?' // INDETERMINATE LENGTH
. ')' // END GROUP
. '(' // START GROUP
. '[.]' // THE DOT (BEFORE THE TLD)
. '{1}' // LENGTH IS EXACTLY ONE
. ')' // END GROUP
. $tldg // THE GROUP OF IANA-ENDORSED TLD STRINGS
. '\b' // ON WORD BOUNDARY
. '#' // REGEX DELIMITER
. 'i' // CASE-INSENSITIVE
;
// TEST THE DATA STRINGS IN THE SUB-ARRAYS
foreach ($targets as $arr)
{
foreach ($arr as $expected => $target)
{
preg_match_all($regex, $target, $match);
// SHOW WHAT HAPPENED
foreach ($match[0] as $matched)
{
// NO OUTPUT IF THE TEST WORKED AS EXPECTED
if ($matched == $expected) continue;
// EXPOSITION IF THE TEST DID NOT WORK AS EXPECTED
echo PHP_EOL;
echo "<b>EXPECT:</b> $expected";
echo PHP_EOL;
echo "<b>INPUTS:</b> $target";
echo PHP_EOL;
echo "<b>REGEXP:</b> $regex";
echo PHP_EOL;
echo "<b>OUTPUT:</b> ";
print_r($match[0]);
echo PHP_EOL;
}
}
}
This process continues until we are satisfied with the regular expression. We can add test cases at will, however any changes we make to the regex string require complete re-tests. The structure of the program and its test data enable us to make these tests instantly.
Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.
Comments (18)
Commented:
Author
Commented:Commented:
I don't disagree with testing. I disagree with a concept of using a QA developed DB.
I work in a a mostly ETL (Extract, Transform, Load) type situation.
There are standardized codes for reporting to the government someone's education level. We have customers that have modified the database default level by adding and modifying the existing levels. Because the end-user added "1 year college" and "2 years college" over 84 times, the ETL process is saying "UTD" (Unable To Determine) for education for 100% is going to go over well?
Commented:
Commented:
I'm also led to wonder why on earth we allowed things like .com without a geo reference first up, and that the sequence wasn't "top down". That is why is the Top level domain last? :) Heaven knows what will happen with TLD's in the future too (e.g. generic names) - be prepared for regular revisits.
Such is life.
View More