Solved

Need to fix and update e-mail extractor

Posted on 2009-04-05
4
314 Views
Last Modified: 2013-12-17
Hi i have this script php email extractor and i want it to go to the next page  by simply changing  http://link.com/search=@&_pgn=1 to http://link.com/search=@&_pgn=2 ... until http://link.com/search=@&_pgn=500 

and there is one more problem  at the script curent state when it finishes page1 it starts again page1


can anyone help me with this script?
<?

/*********************************************

* Email Grabber 		                     *

**********************************************

* Creation Date: 04-15-2006                  *

* Version: 3.0                               *

* Last Update: 07-20-2006                    *

*********************************************/

?>

<html>

<head>

<title>Email Grabber v3.0</title>

<style type="text/css">

td { font-family:Tahoma ;font-size:11px; }

body { font-family:Tahoma ;font-size:11px; }

input#btn{

        border: 1px solid #A9B4BE;

        height: 16px;

        height: 18px !important;

        font: bold 10px verdana;

        color: #000000;

        background-color: #F2F2F5;

        font-weight:bold

}

input#field {

        border: 1px solid #A9B4BE;

        height: 16px !important;

        font: bold 10px verdana;

        color: #000000;

        background-color: #F4F4F4;

}       /* Pentru input-uri */
 

input#focus {

        border: 1px solid #A9B4BE;

        height: 16px !important;

        font: bold 10px verdana;

        color: #FF0000;

        background-color: #F4F4F4;

}

</style>

</head>

<body>

<form name="form" method="post" action="">

<table border="1" width=300 cellpadding="5" bordercolor="#D6E3F8" bgcolor="#F3F3F3">

<tr>
 

<tr>

 <td><b>Pages</b></td>

 <td><input type="text" size="4" name="pages" id="field" value="200" onFocus="id='focus'" onBlur="id='field'"></td>

</tr>

<tr>

 <td><b>Start Page</b></td>

 <td><input type="text" size="4" name="start" id="field" value="0" onFocus="id='focus'" onBlur="id='field'"></td>

</tr>

<tr>

 <td><b>Show Emails</b></td>

 <td><input type="radio" name="show" value="yes"><b> YES &nbsp;&nbsp;</b><input type="radio" name="show" value="no" checked="checked"><b> NO</b></td>

</tr>

<tr>

 <td><b>Remove Duplicates</b></td>

 <td><input type="radio" name="remdup" value="yes" checked="checked"><b> YES &nbsp;&nbsp;</b><input type="radio" name="remdup" value="no"><b> NO</b><br><i>Slows down the program</i></td>

</tr>

<tr>

 <td><b>Show Duplicates</b></td>

 <td><input type="radio" name="showdup" value="yes"><b> YES &nbsp;&nbsp;</b><input type="radio" name="showdup" value="no" checked="checked"><b> NO</b></td>

</tr>

<tr height=20><td></td></tr>

<tr>

 <td colspan=2 align="center">

  <input type="submit" id="btn" style="font-weight: bold;" name="fetch" value="Get Emails &raquo;" onClick="this.disabled = true; this.value='Searching...';this.form.submit();">

 </td>

</tr>

</table>

</form><br><br><br><br>

<?php
 

$pages = $_POST['pages']; 			# The number of pages to crawl

if ($_POST['start'] == 1) $nb = 1;

else $nb = 2 * $_POST['start'];	# Start from

$show = $_POST['show'];				# Show emails grabbed

$remdp = $_POST['remdup'];			# Remove Duplicates

$showdp = $_POST['showdup'];        # Show Duplicates

$addyz = 0;							# Total Mails
 
 
 

if ($_POST) {

set_time_limit(0);

$emailList = array("");

$vDomains = array("mail.com", "hotmail.com", "cox.net","aol.com","verizon.com");

$fname = date("Ymd")."-".$domain.".txt";

$sfname = date("Ymd")."-".$domain."-stats.txt";
 

echo 'Emails will be written in: <b>'.$fname.'</b><br>';

echo 'Stats will be written in: <b>'.$sfname.'</b><br>';

echo 'Show emails: <span style="color:red;font-weight:bold">'.strtoupper($show).'</span><br>';

echo 'Remove duplicates: <span style="color:red;font-weight:bold">'.strtoupper($remdp).'</span><br>';

echo 'Show duplicates: <span style="color:red;font-weight:bold">'.strtoupper($showdp).'</span><br>';
 

	for ($z = 1; $z <= $pages; $z++) {

		$link = "http://link.com/search=@&_pgn=1";

		if ($z >= 2) {

			echo '<br>Time Elapsed: <span style="color:red;font-weight:bold">'. (int)($x_micro_time/60) .' minutes</span>';

			echo '<br>Emails: <span style="color:red;font-weight:bold">'.$paddyz.'</span>';

			if ($remdp == "yes") echo '<br>Duplicates Found: <span style="color:red;font-weight:bold">'.$duplicates.'</span>';

			echo '<br>Total Emails: <span style="color:red;font-weight:bold">'.$addyz.'</span>';

            echo "<BR>".str_pad("=", 40, "=", STR_PAD_BOTH);
 

			if ($remdp == "yes") $content = "================== [".date("H:i:s")."] ==================\r\n".$page."\r\nTime Elapsed: ". (int)($x_micro_time/60) ." minutes\r\nEmails: ".$paddyz."\r\nDuplicates Found: ".$duplicates."\r\nTotal Emails: ".$addyz."\r\n";

			else $content = "================== [".date("H:i:s")."] ==================\r\n".$page."\r\nTime Elapsed: ". (int)($x_micro_time/60) ." minutes\r\nEmails: ".$paddyz."\r\nTotal Emails: ".$addyz."\r\n";
 

			//$handler = fopen($sfname, "ab+");

			//fwrite($handler, $content);

			//fclose($handler);
 

		}

	$duplicates = 0;

	$paddyz = 0;

	page($link);

	$nb += 25;

	flush();

	ob_flush();

	sleep(1);

	}
 

}
 
 
 

function microtime_float() {

   list($usec, $sec) = explode(" ", microtime());

   return ((float)$usec + (float)$sec);

}
 

function addy($link) {

	GLOBAL $domain,$oldaddy, $addyz, $paddyz, $vDomains, $fname, $show, $showdp, $emailList, $remdp, $duplicates;				$restrings = array("'", "=", "-", "(", ")", "{", "}", "<", ">", "?", "!", ";", ":", ".", "," , " ", "\"", "`", "\r", "\n", "&");
 

				$user_agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";
 

    			$ch = curl_init();

       			curl_setopt($ch, CURLOPT_URL, $link);

        		curl_setopt($ch, CURLOPT_SSL_VERIFYHOST,  2);

        		curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);

        		curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);

        		curl_setopt($ch, CURLOPT_TIMEOUT, 30);

        		curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

   				curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
 

        		$page = curl_exec ($ch);

        	    $ndpage = trim($page);

        	    $mdpage = strip_tags($ndpage);
 

                # echo $page;

        		curl_close($ch);

                echo "<BR>".str_pad("=", 150, "=", STR_PAD_BOTH)."<BR>";

				echo "<b>Link:</b> ".$link;

    			echo "<BR>".str_pad("=", 150, "=", STR_PAD_BOTH)."<BR>";
 

				$page2 = preg_split("/[\s,]+/", $mdpage);

				for ($x = 0; $x < count($page2); $x++) {

					//echo "<b>$x</b>: ".$page2[$x]."<BR>";

					if (strstr($page2[$x],"@") && strpos($page2[$x],$oldaddy) === FALSE && strpos($page2[$x], "@media") === FALSE) {

									echo "<b>Old:</b> ".$page2[$x]." | ";
 

									list($addy, $dommy) = split("@", $page2[$x]);
 

				 					for ($i=0;$i<count($restrings);$i++){

				   						if (strpos($addy, $restrings[$i])) $addy = substr($addy, strrpos($addy, $restrings[$i])+1);

				         			}
 

				         			if (strpos($dommy,"&nbsp;")) $dommy = substr($dommy,0,strpos($dommy,"&nbsp;"));

				         			if (!preg_match("[a-zA-Z0-9]$",$dommy)) $dommy = substr($dommy,0,strlen($dommy)-1);
 

				            		$email = trim($addy."@".$dommy);

				            		$email = strtolower($email);
 

									if (IsEmail($email) == TRUE) setemail($email);

					}

				}

                unset($email);

				$pattern = '/(<(?:[^<>]+(?:"[^"]*"|\'[^\']*\')?)+>)/';
 

				$html_array = preg_split ($pattern, trim ($page), -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
 

				for ($x = 0; $x < count($html_array); $x++) {

					//echo "<B>".$x."</B>: ".$html_array[$x]."<BR>";

					if (ereg("href=\"mailto:", $html_array[$x])) {

						$webmail = substr($html_array[$x],strpos($html_array[$x],":")+1);

						$webmail = substr($webmail,0,strpos($webmail,"\""));

					}

				}
 

				if ($webmail && IsEmail($webmail) == TRUE) {

					echo "<b><i>Webmail: </b></i>";

					setemail($webmail);

				}
 

				flush();

				ob_flush();

				echo "<BR>".str_pad("=", 150, "=", STR_PAD_BOTH)."<BR><BR><BR>";

}
 

function page($url) {

			GLOBAL $x_micro_time, $x_micro_stop, $x_micro_start, $items, $page, $sfname, $dPage; 	 		$user_agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";
 

    		$ch = curl_init();

   	    	curl_setopt($ch, CURLOPT_URL,$url);

   	     	curl_setopt($ch, CURLOPT_SSL_VERIFYHOST,  2);

        	curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);

        	curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);

        	curl_setopt($ch, CURLOPT_TIMEOUT, 30);

        	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

   			curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
 

        	$result = curl_exec ($ch);
 

        	curl_close ($ch);
 

			if (strpos($result,"Page Not Responding")) {

				echo "Page Error. Skipping Page";

 				sleep(10);

 				}

 			else {
 

				$result2 = explode(" ",$result);

        		$nr = 0;
 

             		if (!$items) {

						$items = substr($result, strpos($result,'<b class="sectiontitle">')+24);

						$items = substr($items,0,strpos($items,"</b> items found for"));

						echo '<br><b>Total emails found: <font color=red>'.$items.'</font><br><br><br></b>';

			 		}
 

            	$page = substr($result, strpos($result,'<tr><td><b> Page'));

            	$page = substr($page,0, strpos($page,'</td><td class="goto"'));

            	$page = strip_tags($page);
 

				echo '<br>==============<b> '.$page.' </b>==============<br>';
 

            	$x_micro_start = microtime_float();
 

				for ($i = 0; $i < count($result2); $i++) {

        		# echo "<b>$i</b>: ".$result2[$i]."<BR>";
 

        			$link = substr($result2[$i], strpos($result2[$i], '<a href="http://zzz/')+6);

					$link = substr($link, 0, strpos($link, 'cmdZViewItem'));

					if ($link && $link != $oldLink) {

						// echo $link."<BR>";

						$oldLink = $link;

      			   	 addy($link);

      					$nr++;
 

      					flush();

						ob_flush();

					}

        		}
 

        		$x_micro_stop = microtime_float();

 				$x_micro_time = $x_micro_stop - $x_micro_start;

 			}
 

}
 

function IsEMail($e)

{

   if(preg_match("^[a-zA-Z0-9]+[_a-zA-Z0-9-]*(\.[_a-z0-9-]+)*@[a-z?G0-9]+(-[a-z?G0-9]+)*(\.[a-z?G0-9-]+)*(\.[a-z]{2,4})$", $e)) return TRUE;

   else return FALSE;

}
 

function setemail($email) {

	GLOBAL $remdp, $emailList, $show, $showdp, $duplicates, $oldaddy, $addyz, $paddyz;
 

	if ($remdp == "yes") {

		if (sizeof($emailList) == 0 || !in_array($email, $emailList)) {

			array_push($emailList, $email);

			if ($show == "yes") echo "<B>".$email."</b><br>\n";

			$oldaddy = $email;	$addyz++;	$paddyz++;

			//$handle = fopen($fname, "ab");

			//fwrite($handle, $email."\r\n");

			//fclose($handle);

		}

		elseif (in_array($email, $emailList)) {

			if ($showdp == "yes") echo "Duplicate found! ( <b>$email</b> )<BR>";

			$duplicates++;

		}

	}

	elseif ($remdp == "no") {

		if ($show == "yes") echo "<B>".$email."</b><br>\n";

		$oldaddy = $email;	$addyz++;	$paddyz++;

		//$handle = fopen($fname, "ab");

		//fwrite($handle, $email."\r\n");

		//fclose($handle);

	}

}
 

?>

</body>

</html>

Open in new window

0
Comment
Question by:luckian121
  • 2
  • 2
4 Comments
 

Author Comment

by:luckian121
ID: 24072074
wow to many comments:))
0
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 24072418
See line 101, where the page number is hardcoded and change it to a variable as shown below.

Best regards, ~Ray
$link = "http://link.com/search=@&_pgn=$z";

Open in new window

0
 

Author Comment

by:luckian121
ID: 24073401
wow thanks... can you help me with one more thing...
i want it to echo only the emails
like this

ebhjk@fgfh.com
sdfg@fdg.com
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 24108232
Strangely that is a bit more complicated... Look near line #270 at the setemail() function.  It looks like that builds an array of email addresses.  This instruction is a strong clue:

array_push($emailList, $email);

So to print out the email addresses, you would want to iterate over that array called $emailList.  I am not sure where in the programming you would want to do this -- presumably after the array is completely filled.  This is, perhaps, and object lesson in why it is good to have a lot of comments interspersed into the code!

HTH, and Thanks for the points, ~Ray
0

Featured Post

Do email signature updates give you a headache?

Constantly trying to correctly format email signatures? Spending all of your time at every user’s desk to make updates? Want high-quality HTML signatures on all devices, including on mobiles and Macs? Then, let Exclaimer solve all your email signature problems today!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Failed to send some messages Negative SMTP reply 550 5.7.1 From address. 7 59
Echo images using file system 2 29
<? versus <?php 5 35
Form not operating correctly. 1 22
New-MailboxSearch Powershell Command and step by step approach to Search and Extract Emails form Exchange 2013 Journaling server.
This article discusses four methods for overlaying images in a container on a web page
In this video we show how to create a Resource Mailbox in Exchange 2013. We show this process by using the Exchange Admin Center. Log into Exchange Admin Center.: Navigate to the Recipients >> Resources tab.: "Recipients" is our default selection …
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…

937 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

7 Experts available now in Live!

Get 1:1 Help Now