Link to home
Start Free TrialLog in
Avatar of Fernanditos
Fernanditos

asked on

Removing lines containing numbers

Hi

i have the attached code which remove all lines with no .com or .net domain names, it also removes all characters after the first "," from domains.txt files:

Example of domains.txt content:

amaze.com,10/20/2010 12:00:00 AM,AUC
ample.asia,10/20/2010 12:00:00 AM,AUC
am12ements.net,10/20/2010 12:00:00 AM,AUC
ant-arctic.com,10/20/2010 12:00:00 AM,AUC
antibiotic.net,10/20/2010 12:00:00 AM,AUC
antitrust.com,10/20/2010 12:00:00 AM,AUC
anyone.de,10/20/2010 12:00:00 AM,AUC
anyoneanyoneanyoneanyone.com,10/20/2010 12:00:00 AM,AUC

The attached code returns a cleaned list: (only .com and .net)

amaze.com
am12ements.net
ant-arctic.com
antibiotic.net
antitrust.com
anyoneanyoneanyoneanyone.com

I need to modify the code in order to remove also domains meeting any of these 3 criterias:

containing numbers
containing "-" character
domain name longer than 10 characters.

How can I add this to my existing code?

Thank you!



<?php // RAY_temp_fernanditos.php
error_reporting(E_ALL);
echo "<pre>";

// TEST DATA FROM THE POST AT EE
$str = file_get_contents('domains.txt');

// THE NEEDLES TO SEARCH FOR
$needles = array
( '.com,'
, '.net,'
)
;

// MAKE AN ARRAY FROM THE TEST DATA STRING
$arr = explode(PHP_EOL, $str);

// ITERATE OVER EACH LINE
foreach ($arr as $key => $val)
{
    // MAN PAGE http://us.php.net/manual/en/function.strpos.php
    if ( (strpos($val, $needles[0]) === FALSE) && (strpos($val, $needles[1]) === FALSE) )
    {
        unset($arr[$key]);
    }
    else
    {
        // FIND THE COMMA AT THE END OF THE TLD
        $poz = strpos($val, ',');
        $arr[$key] = substr($val, 0, $poz);
    }
}
$new = implode(PHP_EOL, $arr);
echo $new;

Open in new window

Avatar of TRW-Consulting
TRW-Consulting
Flag of United States of America image

Here's a quick and dirty solution, without and error checking.

Change the line:

        $arr[$key] = substr($val, 0, $poz);

To:

      if (preg_match("/[0-9]]/", $tmpval)) { unset($arr[$key]); continue; }
      if (strpos($tmpval, "-")) { unset($arr[$key]); continue; }
      $dompieces = explode(".", $tmpval);
      if (strlen($dompieces[0]) > 10) { unset($arr[$key]); continue; }

        $arr[$key] = $tmpval;
ASKER CERTIFIED SOLUTION
Avatar of TRW-Consulting
TRW-Consulting
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Fernanditos
Fernanditos

ASKER

Thank you!. I am still getting domains with numbers:

4bidpay.com
470algerdr.com
4770794.com
435436.com
466466.com

Please check.
Oh, I fixed adding a "+": preg_match("/[0-9]+/",...

It works like a charm.

Can you please tell me how to tell to remove also lines NOT CONTAINING: "blog" ?

Thank you.
Oh, good catch.  I see I fumble-fingered the double ']]' in my code :-)

To remove lines that do not contain "blog" just add:

  if (! preg_match("/blog/", $tmpval)) {unset($arr[$key]); continue; }
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial