Removing lines containing numbers

Fernanditos
Fernanditos used Ask the Experts™
on
Hi

i have the attached code which remove all lines with no .com or .net domain names, it also removes all characters after the first "," from domains.txt files:

Example of domains.txt content:

amaze.com,10/20/2010 12:00:00 AM,AUC
ample.asia,10/20/2010 12:00:00 AM,AUC
am12ements.net,10/20/2010 12:00:00 AM,AUC
ant-arctic.com,10/20/2010 12:00:00 AM,AUC
antibiotic.net,10/20/2010 12:00:00 AM,AUC
antitrust.com,10/20/2010 12:00:00 AM,AUC
anyone.de,10/20/2010 12:00:00 AM,AUC
anyoneanyoneanyoneanyone.com,10/20/2010 12:00:00 AM,AUC

The attached code returns a cleaned list: (only .com and .net)

amaze.com
am12ements.net
ant-arctic.com
antibiotic.net
antitrust.com
anyoneanyoneanyoneanyone.com

I need to modify the code in order to remove also domains meeting any of these 3 criterias:

containing numbers
containing "-" character
domain name longer than 10 characters.

How can I add this to my existing code?

Thank you!



<?php // RAY_temp_fernanditos.php
error_reporting(E_ALL);
echo "<pre>";

// TEST DATA FROM THE POST AT EE
$str = file_get_contents('domains.txt');

// THE NEEDLES TO SEARCH FOR
$needles = array
( '.com,'
, '.net,'
)
;

// MAKE AN ARRAY FROM THE TEST DATA STRING
$arr = explode(PHP_EOL, $str);

// ITERATE OVER EACH LINE
foreach ($arr as $key => $val)
{
    // MAN PAGE http://us.php.net/manual/en/function.strpos.php
    if ( (strpos($val, $needles[0]) === FALSE) && (strpos($val, $needles[1]) === FALSE) )
    {
        unset($arr[$key]);
    }
    else
    {
        // FIND THE COMMA AT THE END OF THE TLD
        $poz = strpos($val, ',');
        $arr[$key] = substr($val, 0, $poz);
    }
}
$new = implode(PHP_EOL, $arr);
echo $new;

Open in new window

Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Here's a quick and dirty solution, without and error checking.

Change the line:

        $arr[$key] = substr($val, 0, $poz);

To:

      if (preg_match("/[0-9]]/", $tmpval)) { unset($arr[$key]); continue; }
      if (strpos($tmpval, "-")) { unset($arr[$key]); continue; }
      $dompieces = explode(".", $tmpval);
      if (strlen($dompieces[0]) > 10) { unset($arr[$key]); continue; }

        $arr[$key] = $tmpval;
Oops, I left out one line, so let me try this again.

Change the line:

        $arr[$key] = substr($val, 0, $poz);

To:

      $tmpval = substr($val, 0, $poz);

      if (preg_match("/[0-9]]/", $tmpval)) { unset($arr[$key]); continue; }
      if (strpos($tmpval, "-")) { unset($arr[$key]); continue; }
      $dompieces = explode(".", $tmpval);
      if (strlen($dompieces[0]) > 10) { unset($arr[$key]); continue; }

        $arr[$key] = $tmpval;

Author

Commented:
Thank you!. I am still getting domains with numbers:

4bidpay.com
470algerdr.com
4770794.com
435436.com
466466.com

Please check.
CompTIA Network+

Prepare for the CompTIA Network+ exam by learning how to troubleshoot, configure, and manage both wired and wireless networks.

Author

Commented:
Oh, I fixed adding a "+": preg_match("/[0-9]+/",...

It works like a charm.

Can you please tell me how to tell to remove also lines NOT CONTAINING: "blog" ?

Thank you.
Oh, good catch.  I see I fumble-fingered the double ']]' in my code :-)

To remove lines that do not contain "blog" just add:

  if (! preg_match("/blog/", $tmpval)) {unset($arr[$key]); continue; }
Most Valuable Expert 2011
Top Expert 2016
Commented:
When you say "domain name longer than 10 characters" I am assuming you mean the domain name including the TLD, right?
<?php // RAY_temp_fernanditos.php
error_reporting(E_ALL);
echo "<pre>";

// TEST DATA FROM THE POST AT EE
$str = <<<EOS
amaze.com,10/20/2010 12:00:00 AM,AUC
ample.asia,10/20/2010 12:00:00 AM,AUC
am12ements.net,10/20/2010 12:00:00 AM,AUC
ant-arctic.com,10/20/2010 12:00:00 AM,AUC
antibiotic.net,10/20/2010 12:00:00 AM,AUC
antitrust.com,10/20/2010 12:00:00 AM,AUC
anyone.de,10/20/2010 12:00:00 AM,AUC
anyoneanyoneanyoneanyone.com,10/20/2010 12:00:00 AM,AUC
EOS;

// THE NEEDLES TO SEARCH FOR
$needles = array
( '.com,'
, '.net,'
)
;

// MAKE AN ARRAY FROM THE TEST DATA STRING
$arr = explode(PHP_EOL, $str);

// ITERATE OVER EACH LINE
foreach ($arr as $key => $val)
{
    // MAN PAGE http://us.php.net/manual/en/function.strpos.php
    if ( (strpos($val, $needles[0]) === FALSE) && (strpos($val, $needles[1]) === FALSE) )
    {
        unset($arr[$key]);
    }
    else
    {
        // FIND THE COMMA AT THE END OF THE TLD
        $poz = strpos($val, ',');
        $arr[$key] = substr($val, 0, $poz);
    }
}
$new = implode(PHP_EOL, $arr);

// APPLY THE NEW FILTER CRITERIA TO THE DATA
/*
containing numbers
containing "-" character
domain name longer than 10 characters.
*/
$arr = explode(PHP_EOL, $new);
foreach ($arr as $key => $val)
{
    if (preg_match('/[0-9]/', $val)) unset($arr[$key]);
    if (strpos($val, '-') !== FALSE) unset($arr[$key]);
    if (strlen($val) > 10)           unset($arr[$key]);
}
$new = implode(PHP_EOL, $arr);
var_dump($new);

Open in new window

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial