Community Pick: Many members of our community have endorsed this article.
Editor's Choice: This article has been selected by our editors as an exceptional contribution.

A Quick Tour of Test-Driven Development

Published:
Updated:
Author Note: Since this E-E article was originally written, years ago, formal testing has come into common use in the world of PHP.  PHPUnit and similar technologies have enjoyed wide adoption, making it possible to write and share unit tests among teams of programmers.  Behat and similar technologies (Laravel, for example) have introduced fluency into the programming process.  Taken together, these advances have brought open-source programming into a state that is somewhat similar to musket making, just after Eli Whitney.  If you're here looking for guidance about modernizing your development process, please take some time to learn about these "new technologies" because you will soon be unemployable if you're not conversant in the details.  These are the greatest advances in software development in our generation.

Now back to our Test-Driven Development article...

On Monday, October 17, 2011, the A.Word.A.Day "Thought for Today" was "A problem well stated is a problem half solved." -- attributed to Charles F. Kettering, inventor and engineer (1876-1958).

The Greatest Handicap a Programmer Can Have
The author of a question here at EE wrote. "I have no test data besides some stuff I can come up with."
http://www.experts-exchange.com/Programming/Languages/Regular_Expressions/Q_27305841.html#a36556359

That statement got me thinking about the difference between professional programmers and amateurs who try to "pick up" programming on their own, perhaps by reading computer programs and copying code written by others.  Certainly that is one way to learn some things about programming, but it overlooks the most important part of programming, which is the process by which the code is created.  Examining a finished computer program is something like examining a freshly baked apple pie.  You can appreciate the finished product, but where would you start if someone gave you a basket of apples?  What other ingredients would you need?  What tools would you use?  What process would you follow?  How much time should it take?  These things are not apparent as you look at that pie.  They are not apparent as you look at computer code, either.

This article takes a programming task and shows in a step-by-step manner how a professional would go about isolating the issues and writing the code.  As such, it is not really a "how to" example, so much as a narrative of the thought process and resulting code structure that you would find along the way from question to solution.

The greatest asset a programmer can have?  Good test data.  The greatest handicap?  No test data.

Professionals Use Different Processes
There are a number of reasons that professional software developers get excellent results very quickly, whereas amateurs tend to work more slowly, and often write brittle and ineffective code.  Certainly a depth of experience helps, as does familiarity with commonly used design patterns.  But professional programmers address new problems all the time, problems that they have never attempted to solve before.  And they get rapid results with accurate solutions.  When measured by common yardsticks like the time required to develop software or the number of bugs, the differences between high and low performers are huge, sometimes an order of magnitude.  Are the professional programmers somehow ten times smarter than amateurs?  That's not likely!  Instead, professionals follow stable, predictable processes when they craft computer software.  One of those processes is called TDD, for short.  It is "test-driven development" and it is an iterative process.  Its goal is to reduce the large task of programming into a smaller, simpler set of subtasks.  A related concept is SSCCE.

This article applies the TDD practice to a question about something that is extraordinarily complex - regular expressions.  Our question can be summed up this way: "I need to grab domain names from a single line."  Regular expressions can do that.  But this article is not really about regular expressions -- it is about how to use TDD to your advantage, to make your programs more dependable, to make testing easier and better organized, to make your software development process into something that is worthy of highly paid rates.

Eating the Elephant One Byte at a Time
At some point the test data and the regular expression will become long and complex, but you do not want to start with long and complex stuff that throws you into a complicated debugging activity at the beginning.  The point of using TDD is that you can build up the test data and the program together, incrementally.

The TDD process would begin, not with programming, but with the creation of test data, as we sought to understand and illustrate the meaning of the terms "a single line" and "domain names."  One good way to create a test data set is to use an array.  We can put all our test cases into the array and use an iterator like foreach() to run our test cases through the code we are building.  By doing it this way we give ourselves a huge advantage over the amateur programmer.  We gain the ability to make a multitude of tests very quickly.

We probably know what a "single line" means.  It means a string of characters.  So our first task is an easy one.  We will make an array containing a string or two.  We will start the strings with just some random stuff that we do not want (what we want to grab, the domain names, will come into play later). Perhaps we would begin with something like this:
 
$targets = array
                      ( "test chatter"
                      , "random noise"
                      )
                      ;

Open in new window

We probably think we know what a "domain name" means.  It is a string of characters like domain.com that points to a resource on a network, like the internet.  Domain names have very specific rules.  Maybe it would be a good idea to look up the rules, right?  A quick search leads us to this article: http://en.wikipedia.org/wiki/Domain_Name_System and this article: http://en.wikipedia.org/wiki/Domain_name and we start reading.  Holy Cow!  These articles are thousands of words long.  There are a lot of "moving parts" to domain names!  We better simplify things.

Let's start by choosing simple examples maybe a single top-level domain ("TLD") like .com or .org, preceeded by a common word like example or a less common word like ExpertsExchange.  What are the characteristics of these strings?  For one thing they are independent of the surrounding data -- example.com is not the same as myexample.commonality.  So we can surmise that they may have blanks before and after.  And they have a dot somewhere in them.  And they can be upper case, lower case or mixed case.  That is enough for us to get started building our test data set.
 
$targets = array
                      ( "test domain.com chatter"
                      , "random example.org noise"
                      )
                      ;

Open in new window

But What Do We Really Want To Achieve?
An array of test data samples is somewhat useful, but (as we shall see) it can grow to an array of many test data samples, so we need some way to coordinate the test data and the desired results.  In our instant case we can use the associative array to make this coordination.  We will let the array key contain information about the desired outputs, and the array value will hold the test data string.  Then as we iterate over each test case, we can see how well our regular expression is working.
 
$targets = array
                      ( "domain.com"  => "test domain.com chatter"
                      , "example.org" => "random example.org noise"
                      )
                      ;

Open in new window

Armed with this tiny data set, we can begin constructing our regular expression.  At the start of the process, it will look something like this.
 
$regex
                      = '#'         // REGEX DELIMITER
                      . '('         // START OF A GROUP
                      . '[A-Z]'     // ALPHABETIC CHARACTERS
                      . '+?'        // INDETERMINATE LENGTH
                      . '[.]'       // THE DOT (BEFORE THE TLD)
                      . '{1}'       // LENGTH IS EXACTLY ONE
                      . '[A-Z]'     // CHARACTER CLASS ALPHA
                      . ')'         // END GROUP
                      . '#'         // REGEX DELIMITER
                      . 'i'         // CASE-INSENSITIVE
                      ;

Open in new window

We put that all together into a script, and put the script on our server.  And we run it.  And we shake the parse errors out.  And we run it again.  And we tinker with it a little bit until it seems to be doing something close to what we want.  Once it is working (or nearly working), it looks something like example 1, below.  What do we mean by "working" at this point?  We don't mean programming perfection at all.  Instead we mean that the script runs and creates informative and useful output.  The useful output contains four key elements.  It shows us the input string, the output string, the expected string and the regular expression, all neatly consolidated into an easy-to-read collection.  That is what we need to see as we begin to improve and debug our regular expression.
 
<?php // RAY_EE_tdd_example_1.php
                      error_reporting(E_ALL);
                      echo "<pre>";
                      
                      // TEST DATA
                      $targets
                      = array
                      ( "domain.com"  => "test domain.com chatter"
                      , "example.org" => "random example.org noise"
                      )
                      ;
                      
                      // A REGEX THAT FINDS THE DOMAIN SUBSTRINGS
                      $regex
                      = '#'         // REGEX DELIMITER
                      . '('         // START GROUP
                      . '[A-Z0-9]'  // CHARACTER CLASS ALPHANUMERIC
                      . '+?'        // INDETERMINATE LENGTH
                      . '[.]'       // THE DOT (BEFORE THE TLD)
                      . '{1}'       // EXACTLY ONE
                      . '[A-Z]'     // CHARACTER CLASS ALPHA
                      . ')'         // END GROUP
                      . '#'         // REGEX DELIMITER
                      . 'i'         // CASE-INSENSITIVE
                      ;
                      
                      // TEST THE DATA STRINGS
                      foreach ($targets as $expected => $target)
                      {
                          preg_match_all($regex, $target, $match);
                      
                          // SHOW WHAT HAPPENED
                          echo PHP_EOL;
                          echo "<b>EXPECT:</b> $expected";
                          echo PHP_EOL;
                          echo "<b>INPUTS:</b> $target";
                          echo PHP_EOL;
                          echo "<b>REGEXP:</b> $regex";
                          echo PHP_EOL;
                          echo "<b>OUTPUT:</b> " . $match[1][0];
                          echo PHP_EOL;
                      }

Open in new window

Well, it works.  However it does not give us the output we want.  Instead of grabbing the entire substrings domain.com and example.org it produces this.

EXPECT: domain.com
INPUTS: test domain.com chatter
REGEXP: #([A-Z0-9]+?[.]{1}[A-Z])#i
OUTPUT: domain.c

EXPECT: example.org
INPUTS: random example.org noise
REGEXP: #([A-Z0-9]+?[.]{1}[A-Z])#i
OUTPUT: example.o

Our regex cut off part of the TLD, so the regex must be amended.  If we look around the internet at the patterns for TLD strings we find that they range from little two-character strings like uk up to longer strings like museum.  For now, we do not care exactly what the TLD contains - we just want to get all of it.  So we might add a length to the regex, and our new expression would look like this.
 
// A REGEX THAT FINDS THE DOMAIN SUBSTRINGS
                      $regex
                      = '#'         // REGEX DELIMITER
                      . '('         // START GROUP
                      . '[A-Z0-9]'  // CHARACTER CLASS ALPHANUMERIC
                      . '+?'        // INDETERMINATE LENGTH
                      . '[.]'       // THE DOT (BEFORE THE TLD)
                      . '{1}'       // LENGTH IS EXACTLY ONE
                      . '[A-Z]'     // CHARACTER CLASS ALPHA
                      . '{2,6}'     // LENGTH IS TWO TO SIX
                      . ')'         // END GROUP
                      . '#'         // REGEX DELIMITER
                      . 'i'         // CASE-INSENSITIVE
                      ;

Open in new window

That works well.  The output is what we expect.

EXPECT: domain.com
INPUTS: test domain.com chatter
REGEXP: #([A-Z0-9]+?[.]{1}[A-Z]{2,6})#i
OUTPUT: domain.com

EXPECT: example.org
INPUTS: random example.org noise
REGEXP: #([A-Z0-9]+?[.]{1}[A-Z]{2,6})#i
OUTPUT: example.org

Now it is time to add an additional test case to our test data set.  To do that we just add another element of the array.  Let's add a string that has no domain name.  Our expected output would be the two domains we have already found.  The additional "noise" should not change the output.
 
// TEST DATA
                      $targets
                      = array
                      $targets = array
                      ( "domain.com"  => "test domain.com chatter"
                      , "example.org" => "random example.org noise"
                      , "NOTHING"     => "test chatter random noise"
                      )
                      ;

Open in new window

Whoa!  Something is broken.

EXPECT: NOTHING
INPUTS: test chatter random noise
REGEXP: #([A-Z0-9]+?[.]{1}[A-Z]{2,6})#i

Notice:  Undefined offset: 0 in /RAY_EE_tdd_example_3.php on line 43

OUTPUT:

We need to adjust our test script.  We can replace the echo statement with var_dump($match).  That will give us more detailed information about how the regular expression is working.  And the output from var_dump() is far more interesting.  It shows us what the regular expression matched, and it show us the valuable effect of regex grouping.  Now our test data visualization code looks something like this.
 
// TEST THE DATA STRINGS
                      foreach ($targets as $expected => $target)
                      {
                          preg_match_all($regex, $target, $match);
                      
                          // SHOW WHAT HAPPENED
                          echo PHP_EOL;
                          echo "<b>EXPECT:</b> $expected";
                          echo PHP_EOL;
                          echo "<b>INPUTS:</b> $target";
                          echo PHP_EOL;
                          echo "<b>REGEXP:</b> $regex";
                          echo PHP_EOL;
                          echo "<b>OUTPUT:</b> ";
                          var_dump($match);
                          echo PHP_EOL;
                      }

Open in new window

And the output from our tests looks something like this.

EXPECT: domain.com
INPUTS: test domain.com chatter
REGEXP: #([A-Z0-9]+?[.]{1}[A-Z]{2,6})#i
OUTPUT: array(2) {
  [0]=>
  array(1) {
    [0]=>
    string(10) "domain.com"
  }
  [1]=>
  array(1) {
    [0]=>
    string(10) "domain.com"
  }
}

EXPECT: example.org
INPUTS: random example.org noise
REGEXP: #([A-Z0-9]+?[.]{1}[A-Z]{2,6})#i
OUTPUT: array(2) {
  [0]=>
  array(1) {
    [0]=>
    string(11) "example.org"
  }
  [1]=>
  array(1) {
    [0]=>
    string(11) "example.org"
  }
}

EXPECT: NOTHING
INPUTS: test chatter random noise
REGEXP: #([A-Z0-9]+?[.]{1}[A-Z]{2,6})#i
OUTPUT: array(2) {
  [0]=>
  array(0) {
  }
  [1]=>
  array(0) {
  }
}

Building Up the Test Cases
Our little test script is working well, so far.  It enables us to add tests quickly and easily, and it enables us to see the results instantaneously.  We will want to add several more test cases to the script because there are many ways that someone might write a domain name.  Since each test case is atomic and complete, we can add more key => value pairs to our $targets array.  We will add these near the top of the array, so that each new test case will be printed at the top of our browser output.  We can rearrange our test data array into something that looks like this.  Each new test will be added immediately after the NOTHING test at the top of this array.  Notice how the array keys and values are neatly lined up in the code?  That gives us a strong visual cue in what might otherwise be a confusing jumble of letters and punctuation.  Everything about the process is designed to add clarity and to remove uncertainty at every step along the way.
 
// TEST DATA
                      $targets
                      = array
                      ( "NOTHING"     => "test chatter random noise"
                      , "domain.com"  => "test domain.com chatter"
                      , "example.org" => "random example.org noise"
                      )
                      ;

Open in new window

Let's take a step forward.  Now we will try to grab two domain names from a single string.  Here is our new test data set.
 
// TEST DATA
                      $targets
                      = array
                      ( "NOTHING"                => "test chatter random noise"
                      , "domain.com example.org" => "test domain.com chatter example.org noise"
                      , "domain.com"             => "test domain.com chatter"
                      , "example.org"            => "random example.org noise"
                      )
                      ;

Open in new window

And the new output contains everything we had before, plus this, so we now have evidence that we can grab more than one domain name.  The domain names appear in the sub-array of the $match array at both key positions zero and one.

EXPECT: domain.com example.org
INPUTS: test domain.com chatter example.org noise
REGEXP: #([A-Z0-9]+?[.]{1}[A-Z]{2,6})#i
OUTPUT: array(2) {
  [0]=>
  array(2) {
    [0]=>
    string(10) "domain.com"
    [1]=>
    string(11) "example.org"
  }
  [1]=>
  array(2) {
    [0]=>
    string(10) "domain.com"
    [1]=>
    string(11) "example.org"
  }
}

Sometimes you see a domain name written with a subdomain in front of it: example.org is written www.example.org or test.example.org.  Adding that to the test is very simple.
 
// TEST DATA
                      $targets
                      = array
                      ( "NOTHING"                => "test chatter random noise"
                      , "www.example.org"        => "random www.example.org noise"
                      , "domain.com example.org" => "test domain.com chatter example.org noise"
                      , "domain.com"             => "test domain.com chatter"
                      , "example.org"            => "random example.org noise"
                      )
                      ;

Open in new window

And the var_dump() output immediately shows us that the regular expression we are developing cannot handle this new input. Back to the drawing board!

EXPECT: www.example.org
INPUTS: random www.example.org noise
REGEXP: #([A-Z0-9]+?[.]{1}[A-Z]{2,6})#i
OUTPUT: array(2) {
  [0]=>
  array(2) {
    [0]=>
    string(10) "www.exampl"
    [1]=>
    string(5) "e.org"
  }
  [1]=>
  array(2) {
    [0]=>
    string(10) "www.exampl"
    [1]=>
    string(5) "e.org"
  }
}

Iterative Development with TDD
This is the critical part of TDD.  We will have to make a change, indeed many changes, to the regex string under development, and once we have made each change we will rerun ALL of our test cases.  And it is amazingly easy to do that -- our test script is organized in such a way that it runs all of the test cases every time it is run!  All we need to do is look at the output and see if it looks right.  

But there is a little problem with the script the way it stands now.  Notice that the expected answer is in the array key?  What if we wanted to test for two identical answers in different strings?  We can't do that the way the script works now, because the expected string is the array key.  The test data sets would overwrite one another if we had duplicate array keys.  To overcome that issue we can make a small modification to the array of test data.  Instead of using one long array, we can use an array of "sub-arrays," with each sub-array containing one individual test case.  This is an easy change and it is reflected in the script below.  We can still keep our program code neatly lined up.

As we add new test cases, we will put them into the sub-arrays of the $targets array.  We will use the same kind of key => value pair notation, and we will keep the keys and values lined up so the code is easy to read.  Note, too, that the regular expression statements are well-commented with appropriate line spacing, and the regex string is built up from a series of concatenated substrings.  By breaking things out into separate lines we make the code easier to read and modify.  Not to mention easier to test.

Here is example 6 of the script now that we have added some additional capabilities and test cases.  You will see that we added line spacing around the regex groups to make the regular expression easier to read and understand.
 
<?php // RAY_EE_tdd_example_6.php
                      error_reporting(E_ALL);
                      echo "<pre>";
                      
                      // TEST DATA IS NOW AN ARRAY OF INDIVIDUAL TESTS
                      $targets
                      = array
                      (  array( ""                         => "test chatter random noise"
                      ), array( ""                         => "the dot-com bubble"
                      ), array( ""                         => "foo.bar may give false positive"
                      ), array( "http://example.org"       => "random noise http://example.org"
                      ), array( "http://example.org"       => "http://example.org? random noise"
                      ), array( "http://example.org"       => "random http://example.org noise"
                      ), array( "https://www.example.org"  => "random https://www.example.org noise"
                      ), array( "http://test.example.org"  => "random http://test.example.org noise"
                      ), array( "www.example.org"          => "random www.example.org noise"
                      ), array( "domain.com example.org"   => "test domain.com chatter example.org noise"
                      ), array( "domain.com"               => "test domain.com chatter"
                      ), array( ""                         => "http://nonsense."
                      )
                      )
                      ;
                      
                      // A REGEX THAT FINDS URLS AND DOMAIN SUBSTRINGS
                      $regex
                      = '#'         // REGEX DELIMITER
                      
                      . '\b'        // ON WORD BOUNDARY
                      
                      . '('         // START GROUP
                      . 'https?'    // HTTP OR HTTPS
                      . '|'         // OR
                      . 'ftps?'     // FTP OR FTPS
                      . ')'         // END GROUP
                      . '??'        // ZERO OR ONE OF THIS GROUP, UNGREEDY
                      
                      . '('         // START GROUP
                      . '://'       // COLON, SLASH, SLASH
                      . ')'         // END GROUP
                      . '??'        // ZERO OR ONE OF THIS GROUP, UNGREEDY
                      
                      . '('         // START GROUP
                      . '[A-Z0-9]'  // A SUBDOMAIN
                      . '+?'        // INDETERMINATE LENGTH
                      . '\.'        // A DOT (ESCAPED)
                      . ')'         // END GROUP
                      . '??'        // ZERO OR ONE OF THIS GROUP, UNGREEDY
                      
                      . '('         // START GROUP
                      . '[A-Z0-9]'  // CHARACTER CLASS ALPHANUMERIC
                      . '+?'        // INDETERMINATE LENGTH
                      . ')'         // END GROUP
                      
                      . '('         // START GROUP
                      . '[.]'       // THE DOT (BEFORE THE TLD)
                      . '{1}'       // LENGTH IS EXACTLY ONE
                      . ')'         // END GROUP
                      
                      . '('         // START GROUP
                      . '[A-Z]'     // CHARACTER CLASS ALPHA
                      . '{2,6}'     // LENGTH IS TWO TO SIX
                      . ')'         // END GROUP
                      
                      . '\b'        // ON WORD BOUNDARY
                      
                      . '#'         // REGEX DELIMITER
                      . 'i'         // CASE-INSENSITIVE
                      ;
                      
                      // TEST THE DATA STRINGS IN THE SUB-ARRAYS
                      foreach ($targets as $arr)
                      {
                          foreach ($arr as $expected => $target)
                          {
                              preg_match_all($regex, $target, $match);
                      
                              // SHOW WHAT HAPPENED
                              echo PHP_EOL;
                              echo "<b>EXPECT:</b> $expected";
                              echo PHP_EOL;
                              echo "<b>INPUTS:</b> $target";
                              echo PHP_EOL;
                              echo "<b>REGEXP:</b> $regex";
                              echo PHP_EOL;
                              echo "<b>OUTPUT:</b> ";
                              var_dump($match);
                              echo PHP_EOL;
                          }
                      }

Open in new window

Toward TDD Perfection
You can copy that script and install it on your own server to see how it works.  If you're like me, you will probably think it produces a lot of output!  In a perfect TDD world, you would not have to read all of the outputs.  Your test script would not just run the tests and display the output, it would go a step further and make the comparison of the expected output and the actual output.  Maybe it would only show you the outputs when the comparison found that the expected output did not match the actual output.  This further layer of automation means even faster and more simplified testing.  Let's look at how we can add the automated checking to the existing code.
 
<?php // RAY_EE_tdd_example_7.php
                      error_reporting(E_ALL);
                      echo "<pre>";
                      
                      // TEST DATA IS NOW AN ARRAY OF INDIVIDUAL TESTS
                      $targets
                      = array
                      (  array( ""                         => "test chatter random noise"
                      ), array( ""                         => "the dot-com bubble"
                      ), array( ""                         => "foo.bar may give false positive"
                      ), array( ""                         => "http://nonsense.nothing"
                      ), array( "http://example.org"       => "random noise http://example.org"
                      ), array( "http://example.org"       => "http://example.org? random noise"
                      ), array( "http://example.org"       => "random http://example.org noise"
                      ), array( "https://www.example.org"  => "random https://www.example.org noise"
                      ), array( "http://test.example.org"  => "random http://test.example.org noise"
                      ), array( "www.example.org"          => "random www.example.org noise"
                      ), array( "domain.com example.org"   => "test domain.com chatter example.org noise"
                      ), array( "domain.com"               => "test domain.com chatter"
                      )
                      )
                      ;
                      
                      // A REGEX THAT FINDS URLS AND DOMAIN SUBSTRINGS
                      $regex
                      = '#'         // REGEX DELIMITER
                      
                      . '\b'        // ON WORD BOUNDARY
                      
                      . '('         // START GROUP
                      . 'https?'    // HTTP OR HTTPS
                      . '|'         // OR
                      . 'ftps?'     // FTP OR FTPS
                      . ')'         // END GROUP
                      . '??'        // ZERO OR ONE OF THIS GROUP, UNGREEDY
                      
                      . '('         // START GROUP
                      . '://'       // COLON, SLASH, SLASH
                      . ')'         // END GROUP
                      . '??'        // ZERO OR ONE OF THIS GROUP, UNGREEDY
                      
                      . '('         // START GROUP
                      . '[A-Z0-9]'  // A SUBDOMAIN
                      . '+?'        // INDETERMINATE LENGTH
                      . '\.'        // A DOT (ESCAPED)
                      . ')'         // END GROUP
                      . '??'        // ZERO OR ONE OF THIS GROUP, UNGREEDY
                      
                      . '('         // START GROUP
                      . '[A-Z0-9]'  // CHARACTER CLASS ALPHANUMERIC
                      . '+?'        // INDETERMINATE LENGTH
                      . ')'         // END GROUP
                      
                      . '('         // START GROUP
                      . '[.]'       // THE DOT (BEFORE THE TLD)
                      . '{1}'       // LENGTH IS EXACTLY ONE
                      . ')'         // END GROUP
                      
                      . '('         // START GROUP
                      . '[A-Z]'     // CHARACTER CLASS ALPHA
                      . '{2,6}'     // LENGTH IS TWO TO SIX
                      . ')'         // END GROUP
                      
                      . '\b'        // ON WORD BOUNDARY
                      
                      . '#'         // REGEX DELIMITER
                      . 'i'         // CASE-INSENSITIVE
                      ;
                      
                      // TEST THE DATA STRINGS IN THE SUB-ARRAYS
                      foreach ($targets as $arr)
                      {
                          foreach ($arr as $expected => $target)
                          {
                              preg_match_all($regex, $target, $match);
                      
                              // SHOW WHAT HAPPENED
                              foreach ($match[0] as $matched)
                              {
                                  // NO OUTPUT IF THE TEST WORKED AS EXPECTED
                                  if ($matched == $expected) continue;
                      
                                  // EXPOSITION IF THE TEST DID NOT WORK AS EXPECTED
                                  echo PHP_EOL;
                                  echo "<b>EXPECT:</b> $expected";
                                  echo PHP_EOL;
                                  echo "<b>INPUTS:</b> $target";
                                  echo PHP_EOL;
                                  echo "<b>REGEXP:</b> $regex";
                                  echo PHP_EOL;
                                  echo "<b>OUTPUT:</b> ";
                                  print_r($match[0]);
                                  echo PHP_EOL;
                              }
                          }
                      }

Open in new window

Now the volume of output is manageable!  Here is what it looks like.  A quick visual inspection shows us that the two-URL example is really OK.  But foo.bar is not really something we want.

EXPECT:
INPUTS: foo.bar may give false positive
REGEXP: #\b(https?|ftps?)??(://)??([A-Z0-9]+?\.)??([A-Z0-9]+?)([.]{1})([A-Z]{2,6})\b#i
OUTPUT: Array
(
    [0] => foo.bar
)

EXPECT: domain.com example.org
INPUTS: test domain.com chatter example.org noise
REGEXP: #\b(https?|ftps?)??(://)??([A-Z0-9]+?\.)??([A-Z0-9]+?)([.]{1})([A-Z]{2,6})\b#i
OUTPUT: Array
(
    [0] => domain.com
    [1] => example.org
)

EXPECT: domain.com example.org
INPUTS: test domain.com chatter example.org noise
REGEXP: #\b(https?|ftps?)??(://)??([A-Z0-9]+?\.)??([A-Z0-9]+?)([.]{1})([A-Z]{2,6})\b#i
OUTPUT: Array
(
    [0] => domain.com
    [1] => example.org
)

Can we live with the results that we are getting here?  That is not a programming question - it is a business requirements question.  The programming appears to be correct, as far as it goes.  But if the business requirements indicate that we need greater accuracy in finding URLs and domain names, we might want to turn to the authority on domain names, the IANA.  They publish the canonical list of top-level domain names at this URL.
http://data.iana.org/TLD/tlds-alpha-by-domain.txt

With just a few lines of code, we can read the information from that URL, remove the comments and the apparent noise, and use the IANA list of TLD strings in our regular expression.  Since our test cases already exist and we know that ".bar" is not one of the IANA-endorsed TLDs our next test should not show a false positive.  And indeed it does not.  Here is the next iteration of the code (example 8) with the IANA TLD information added into the regular expression.
 
<?php // RAY_EE_tdd_example_8.php
                      error_reporting(E_ALL);
                      echo "<pre>";
                      
                      // TEST DATA IS AN ARRAY OF INDIVIDUAL TEST ARRAYS
                      $targets
                      = array
                      (  array( ""                         => "test chatter random noise"
                      ), array( ""                         => "the dot-com bubble"
                      ), array( ""                         => "foo.bar may give false positive"
                      ), array( ""                         => "http://nonsense.nothing"
                      ), array( "http://example.org"       => "random noise http://example.org"
                      ), array( "http://example.org"       => "http://example.org? random noise"
                      ), array( "http://example.org"       => "random http://example.org noise"
                      ), array( "https://www.example.org"  => "random https://www.example.org noise"
                      ), array( "http://test.example.org"  => "random http://test.example.org noise"
                      ), array( "www.example.org"          => "random www.example.org noise"
                      ), array( "domain.com example.org"   => "test domain.com chatter example.org noise"
                      ), array( "domain.com"               => "test domain.com chatter"
                      )
                      )
                      ;
                      
                      // READ THE IANA TLD LIST
                      $tlds = file('http://data.iana.org/TLD/tlds-alpha-by-domain.txt', FILE_IGNORE_NEW_LINES);
                      
                      // ROUGH-CUT SANITIZE THE IANA TLD LIST REMOVING COMMENTS AND JUNK
                      foreach ($tlds as $key => $tld)
                      {
                          if (strpos($tld, '#')  !== FALSE) unset($tlds[$key]);
                          if (strpos($tld, '--') !== FALSE) unset($tlds[$key]);
                      }
                      
                      // COLLAPSE THE TLD ARRAY INTO A GROUP STRING FOR USE IN THE REGEX
                      $tldg = '(' . implode('|', $tlds) . ')';
                      
                      // A REGEX THAT FINDS URLS AND DOMAIN SUBSTRINGS
                      $regex
                      = '#'         // REGEX DELIMITER
                      
                      . '\b'        // ON WORD BOUNDARY
                      
                      . '('         // START GROUP
                      . 'https?'    // HTTP OR HTTPS
                      . '|'         // OR
                      . 'ftps?'     // FTP OR FTPS
                      . ')'         // END GROUP
                      . '??'        // ZERO OR ONE OF THIS GROUP, UNGREEDY
                      
                      . '('         // START GROUP
                      . '://'       // COLON, SLASH, SLASH
                      . ')'         // END GROUP
                      . '??'        // ZERO OR ONE OF THIS GROUP, UNGREEDY
                      
                      . '('         // START GROUP
                      . '[A-Z0-9]'  // A SUBDOMAIN
                      . '+?'        // INDETERMINATE LENGTH
                      . '\.'        // A DOT (ESCAPED)
                      . ')'         // END GROUP
                      . '??'        // ZERO OR ONE OF THIS GROUP, UNGREEDY
                      
                      . '('         // START GROUP
                      . '[A-Z0-9]'  // CHARACTER CLASS ALPHANUMERIC
                      . '+?'        // INDETERMINATE LENGTH
                      . ')'         // END GROUP
                      
                      . '('         // START GROUP
                      . '[.]'       // THE DOT (BEFORE THE TLD)
                      . '{1}'       // LENGTH IS EXACTLY ONE
                      . ')'         // END GROUP
                      
                      . $tldg       // THE GROUP OF IANA-ENDORSED TLD STRINGS
                      
                      . '\b'        // ON WORD BOUNDARY
                      
                      . '#'         // REGEX DELIMITER
                      . 'i'         // CASE-INSENSITIVE
                      ;
                      
                      // TEST THE DATA STRINGS IN THE SUB-ARRAYS
                      foreach ($targets as $arr)
                      {
                          foreach ($arr as $expected => $target)
                          {
                              preg_match_all($regex, $target, $match);
                      
                              // SHOW WHAT HAPPENED
                              foreach ($match[0] as $matched)
                              {
                                  // NO OUTPUT IF THE TEST WORKED AS EXPECTED
                                  if ($matched == $expected) continue;
                      
                                  // EXPOSITION IF THE TEST DID NOT WORK AS EXPECTED
                                  echo PHP_EOL;
                                  echo "<b>EXPECT:</b> $expected";
                                  echo PHP_EOL;
                                  echo "<b>INPUTS:</b> $target";
                                  echo PHP_EOL;
                                  echo "<b>REGEXP:</b> $regex";
                                  echo PHP_EOL;
                                  echo "<b>OUTPUT:</b> ";
                                  print_r($match[0]);
                                  echo PHP_EOL;
                              }
                          }
                      }

Open in new window

This process continues until we are satisfied with the regular expression.  We can add test cases at will, however any changes we make to the regex string require complete re-tests.  The structure of the program and its test data enable us to make these tests instantly.

Summary
Code and data interact in complex ways, and it is the programmer's task to bring order and predictability to this interaction.  Rather than guessing about how the code and data might interact, learn more about to apply data visualization and TDD, two incredibly powerful software development techniques.
http://www.extremeprogramming.org/rules/testfirst.html
http://en.wikipedia.org/wiki/Test-driven_development

Please give us your feedback!
If you found this article helpful, please click the "thumb's up" button below. Doing so lets the E-E community know what is valuable for E-E members and helps provide direction for future articles.  If you have questions or comments, please add them.  Thanks!
 
15
10,260 Views

Comments (18)

Dave BaldwinFixer of Problems
CERTIFIED EXPERT
Most Valuable Expert 2014

Commented:
I'll add another comment to the above.  You can't test with just good data.  You have to use bad data as part of the test to see how your code handles it.  Does it pass it on or does it flag it as an error and provide a way to fix it?
Most Valuable Expert 2011
Author of the Year 2014

Author

Commented:
Using test data is usually useless, because the data is created in 2010 by some department, ...
Eh?! That misses the whole point of testing, which is a specialty.  In my experience, even modestly sophisticated organizations have a sub-entity dedicated to testing, and the people in this group would probably object to the "useless" characterization.  They certainly would not be first against the wall when budget cuts strike.
Most Valuable Expert 2014

Commented:
Eh?! That misses the whole point of testing, which is a specialty.

I don't disagree with testing. I disagree with a concept of using a QA developed DB.

I work in a a mostly ETL (Extract, Transform, Load) type situation.

There are standardized codes for reporting to the government someone's education level. We have customers that have modified the database default level by adding and modifying the existing levels. Because the end-user added  "1 year college" and "2 years college" over 84 times, the ETL process is saying "UTD" (Unable To Determine) for education for 100%  is going to go over well?
Dave BaldwinFixer of Problems
CERTIFIED EXPERT
Most Valuable Expert 2014

Commented:
That exactly corresponds to the date problem.  For whatever reason, the end-user is allowed to enter an unacceptable format.  If it was on my web page, I'd make it a drop-down or radio boxes that only allow the proper formats.  Like I said above, let the salesman test it, they always get it wrong.
PortletPaulEE Topic Advisor
CERTIFIED EXPERT
Most Valuable Expert 2014
Awarded 2013

Commented:
Ray, you write so well and with passion, another impressive article. I have an admission to make however, my "eyes glaze over" when I see regex patterns :)

I'm also led to wonder why on earth we allowed things like .com without a geo reference first up, and that the sequence wasn't "top down". That is why is the Top level domain last? :) Heaven knows what will happen with TLD's in the future too (e.g. generic names)  - be prepared for regular revisits.

Such is life.

Programming without test data is the practice of clairvoyancy.

View More

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.