Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

Need to fix and update e-mail extractor

Posted on 2009-04-05
4
Medium Priority
?
329 Views
Last Modified: 2013-12-17
Hi i have this script php email extractor and i want it to go to the next page  by simply changing  http://link.com/search=@&_pgn=1 to http://link.com/search=@&_pgn=2 ... until http://link.com/search=@&_pgn=500 

and there is one more problem  at the script curent state when it finishes page1 it starts again page1


can anyone help me with this script?
<?
/*********************************************
* Email Grabber 		                     *
**********************************************
* Creation Date: 04-15-2006                  *
* Version: 3.0                               *
* Last Update: 07-20-2006                    *
*********************************************/
?>
<html>
<head>
<title>Email Grabber v3.0</title>
<style type="text/css">
td { font-family:Tahoma ;font-size:11px; }
body { font-family:Tahoma ;font-size:11px; }
input#btn{
        border: 1px solid #A9B4BE;
        height: 16px;
        height: 18px !important;
        font: bold 10px verdana;
        color: #000000;
        background-color: #F2F2F5;
        font-weight:bold
}
input#field {
        border: 1px solid #A9B4BE;
        height: 16px !important;
        font: bold 10px verdana;
        color: #000000;
        background-color: #F4F4F4;
}       /* Pentru input-uri */
 
input#focus {
        border: 1px solid #A9B4BE;
        height: 16px !important;
        font: bold 10px verdana;
        color: #FF0000;
        background-color: #F4F4F4;
}
</style>
</head>
<body>
<form name="form" method="post" action="">
<table border="1" width=300 cellpadding="5" bordercolor="#D6E3F8" bgcolor="#F3F3F3">
<tr>
 
<tr>
 <td><b>Pages</b></td>
 <td><input type="text" size="4" name="pages" id="field" value="200" onFocus="id='focus'" onBlur="id='field'"></td>
</tr>
<tr>
 <td><b>Start Page</b></td>
 <td><input type="text" size="4" name="start" id="field" value="0" onFocus="id='focus'" onBlur="id='field'"></td>
</tr>
<tr>
 <td><b>Show Emails</b></td>
 <td><input type="radio" name="show" value="yes"><b> YES &nbsp;&nbsp;</b><input type="radio" name="show" value="no" checked="checked"><b> NO</b></td>
</tr>
<tr>
 <td><b>Remove Duplicates</b></td>
 <td><input type="radio" name="remdup" value="yes" checked="checked"><b> YES &nbsp;&nbsp;</b><input type="radio" name="remdup" value="no"><b> NO</b><br><i>Slows down the program</i></td>
</tr>
<tr>
 <td><b>Show Duplicates</b></td>
 <td><input type="radio" name="showdup" value="yes"><b> YES &nbsp;&nbsp;</b><input type="radio" name="showdup" value="no" checked="checked"><b> NO</b></td>
</tr>
<tr height=20><td></td></tr>
<tr>
 <td colspan=2 align="center">
  <input type="submit" id="btn" style="font-weight: bold;" name="fetch" value="Get Emails &raquo;" onClick="this.disabled = true; this.value='Searching...';this.form.submit();">
 </td>
</tr>
</table>
</form><br><br><br><br>
<?php
 
$pages = $_POST['pages']; 			# The number of pages to crawl
if ($_POST['start'] == 1) $nb = 1;
else $nb = 2 * $_POST['start'];	# Start from
$show = $_POST['show'];				# Show emails grabbed
$remdp = $_POST['remdup'];			# Remove Duplicates
$showdp = $_POST['showdup'];        # Show Duplicates
$addyz = 0;							# Total Mails
 
 
 
if ($_POST) {
set_time_limit(0);
$emailList = array("");
$vDomains = array("mail.com", "hotmail.com", "cox.net","aol.com","verizon.com");
$fname = date("Ymd")."-".$domain.".txt";
$sfname = date("Ymd")."-".$domain."-stats.txt";
 
echo 'Emails will be written in: <b>'.$fname.'</b><br>';
echo 'Stats will be written in: <b>'.$sfname.'</b><br>';
echo 'Show emails: <span style="color:red;font-weight:bold">'.strtoupper($show).'</span><br>';
echo 'Remove duplicates: <span style="color:red;font-weight:bold">'.strtoupper($remdp).'</span><br>';
echo 'Show duplicates: <span style="color:red;font-weight:bold">'.strtoupper($showdp).'</span><br>';
 
	for ($z = 1; $z <= $pages; $z++) {
		$link = "http://link.com/search=@&_pgn=1";
		if ($z >= 2) {
			echo '<br>Time Elapsed: <span style="color:red;font-weight:bold">'. (int)($x_micro_time/60) .' minutes</span>';
			echo '<br>Emails: <span style="color:red;font-weight:bold">'.$paddyz.'</span>';
			if ($remdp == "yes") echo '<br>Duplicates Found: <span style="color:red;font-weight:bold">'.$duplicates.'</span>';
			echo '<br>Total Emails: <span style="color:red;font-weight:bold">'.$addyz.'</span>';
            echo "<BR>".str_pad("=", 40, "=", STR_PAD_BOTH);
 
			if ($remdp == "yes") $content = "================== [".date("H:i:s")."] ==================\r\n".$page."\r\nTime Elapsed: ". (int)($x_micro_time/60) ." minutes\r\nEmails: ".$paddyz."\r\nDuplicates Found: ".$duplicates."\r\nTotal Emails: ".$addyz."\r\n";
			else $content = "================== [".date("H:i:s")."] ==================\r\n".$page."\r\nTime Elapsed: ". (int)($x_micro_time/60) ." minutes\r\nEmails: ".$paddyz."\r\nTotal Emails: ".$addyz."\r\n";
 
			//$handler = fopen($sfname, "ab+");
			//fwrite($handler, $content);
			//fclose($handler);
 
		}
	$duplicates = 0;
	$paddyz = 0;
	page($link);
	$nb += 25;
	flush();
	ob_flush();
	sleep(1);
	}
 
}
 
 
 
function microtime_float() {
   list($usec, $sec) = explode(" ", microtime());
   return ((float)$usec + (float)$sec);
}
 
function addy($link) {
	GLOBAL $domain,$oldaddy, $addyz, $paddyz, $vDomains, $fname, $show, $showdp, $emailList, $remdp, $duplicates;				$restrings = array("'", "=", "-", "(", ")", "{", "}", "<", ">", "?", "!", ";", ":", ".", "," , " ", "\"", "`", "\r", "\n", "&");
 
				$user_agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";
 
    			$ch = curl_init();
       			curl_setopt($ch, CURLOPT_URL, $link);
        		curl_setopt($ch, CURLOPT_SSL_VERIFYHOST,  2);
        		curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
        		curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
        		curl_setopt($ch, CURLOPT_TIMEOUT, 30);
        		curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
   				curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
 
        		$page = curl_exec ($ch);
        	    $ndpage = trim($page);
        	    $mdpage = strip_tags($ndpage);
 
                # echo $page;
        		curl_close($ch);
                echo "<BR>".str_pad("=", 150, "=", STR_PAD_BOTH)."<BR>";
				echo "<b>Link:</b> ".$link;
    			echo "<BR>".str_pad("=", 150, "=", STR_PAD_BOTH)."<BR>";
 
				$page2 = preg_split("/[\s,]+/", $mdpage);
				for ($x = 0; $x < count($page2); $x++) {
					//echo "<b>$x</b>: ".$page2[$x]."<BR>";
					if (strstr($page2[$x],"@") && strpos($page2[$x],$oldaddy) === FALSE && strpos($page2[$x], "@media") === FALSE) {
									echo "<b>Old:</b> ".$page2[$x]." | ";
 
									list($addy, $dommy) = split("@", $page2[$x]);
 
				 					for ($i=0;$i<count($restrings);$i++){
				   						if (strpos($addy, $restrings[$i])) $addy = substr($addy, strrpos($addy, $restrings[$i])+1);
				         			}
 
				         			if (strpos($dommy,"&nbsp;")) $dommy = substr($dommy,0,strpos($dommy,"&nbsp;"));
				         			if (!preg_match("[a-zA-Z0-9]$",$dommy)) $dommy = substr($dommy,0,strlen($dommy)-1);
 
				            		$email = trim($addy."@".$dommy);
				            		$email = strtolower($email);
 
									if (IsEmail($email) == TRUE) setemail($email);
					}
				}
                unset($email);
				$pattern = '/(<(?:[^<>]+(?:"[^"]*"|\'[^\']*\')?)+>)/';
 
				$html_array = preg_split ($pattern, trim ($page), -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
 
				for ($x = 0; $x < count($html_array); $x++) {
					//echo "<B>".$x."</B>: ".$html_array[$x]."<BR>";
					if (ereg("href=\"mailto:", $html_array[$x])) {
						$webmail = substr($html_array[$x],strpos($html_array[$x],":")+1);
						$webmail = substr($webmail,0,strpos($webmail,"\""));
					}
				}
 
				if ($webmail && IsEmail($webmail) == TRUE) {
					echo "<b><i>Webmail: </b></i>";
					setemail($webmail);
				}
 
				flush();
				ob_flush();
				echo "<BR>".str_pad("=", 150, "=", STR_PAD_BOTH)."<BR><BR><BR>";
}
 
function page($url) {
			GLOBAL $x_micro_time, $x_micro_stop, $x_micro_start, $items, $page, $sfname, $dPage; 	 		$user_agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";
 
    		$ch = curl_init();
   	    	curl_setopt($ch, CURLOPT_URL,$url);
   	     	curl_setopt($ch, CURLOPT_SSL_VERIFYHOST,  2);
        	curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
        	curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
        	curl_setopt($ch, CURLOPT_TIMEOUT, 30);
        	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
   			curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
 
        	$result = curl_exec ($ch);
 
        	curl_close ($ch);
 
			if (strpos($result,"Page Not Responding")) {
				echo "Page Error. Skipping Page";
 				sleep(10);
 				}
 			else {
 
				$result2 = explode(" ",$result);
        		$nr = 0;
 
             		if (!$items) {
						$items = substr($result, strpos($result,'<b class="sectiontitle">')+24);
						$items = substr($items,0,strpos($items,"</b> items found for"));
						echo '<br><b>Total emails found: <font color=red>'.$items.'</font><br><br><br></b>';
			 		}
 
            	$page = substr($result, strpos($result,'<tr><td><b> Page'));
            	$page = substr($page,0, strpos($page,'</td><td class="goto"'));
            	$page = strip_tags($page);
 
				echo '<br>==============<b> '.$page.' </b>==============<br>';
 
            	$x_micro_start = microtime_float();
 
				for ($i = 0; $i < count($result2); $i++) {
        		# echo "<b>$i</b>: ".$result2[$i]."<BR>";
 
        			$link = substr($result2[$i], strpos($result2[$i], '<a href="http://zzz/')+6);
					$link = substr($link, 0, strpos($link, 'cmdZViewItem'));
					if ($link && $link != $oldLink) {
						// echo $link."<BR>";
						$oldLink = $link;
      			   	 addy($link);
      					$nr++;
 
      					flush();
						ob_flush();
					}
        		}
 
        		$x_micro_stop = microtime_float();
 				$x_micro_time = $x_micro_stop - $x_micro_start;
 			}
 
}
 
function IsEMail($e)
{
   if(preg_match("^[a-zA-Z0-9]+[_a-zA-Z0-9-]*(\.[_a-z0-9-]+)*@[a-z?G0-9]+(-[a-z?G0-9]+)*(\.[a-z?G0-9-]+)*(\.[a-z]{2,4})$", $e)) return TRUE;
   else return FALSE;
}
 
function setemail($email) {
	GLOBAL $remdp, $emailList, $show, $showdp, $duplicates, $oldaddy, $addyz, $paddyz;
 
	if ($remdp == "yes") {
		if (sizeof($emailList) == 0 || !in_array($email, $emailList)) {
			array_push($emailList, $email);
			if ($show == "yes") echo "<B>".$email."</b><br>\n";
			$oldaddy = $email;	$addyz++;	$paddyz++;
			//$handle = fopen($fname, "ab");
			//fwrite($handle, $email."\r\n");
			//fclose($handle);
		}
		elseif (in_array($email, $emailList)) {
			if ($showdp == "yes") echo "Duplicate found! ( <b>$email</b> )<BR>";
			$duplicates++;
		}
	}
	elseif ($remdp == "no") {
		if ($show == "yes") echo "<B>".$email."</b><br>\n";
		$oldaddy = $email;	$addyz++;	$paddyz++;
		//$handle = fopen($fname, "ab");
		//fwrite($handle, $email."\r\n");
		//fclose($handle);
	}
}
 
?>
</body>
</html>

Open in new window

0
Comment
Question by:luckian121
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
4 Comments
 

Author Comment

by:luckian121
ID: 24072074
wow to many comments:))
0
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 2000 total points
ID: 24072418
See line 101, where the page number is hardcoded and change it to a variable as shown below.

Best regards, ~Ray
$link = "http://link.com/search=@&_pgn=$z";

Open in new window

0
 

Author Comment

by:luckian121
ID: 24073401
wow thanks... can you help me with one more thing...
i want it to echo only the emails
like this

ebhjk@fgfh.com
sdfg@fdg.com
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 24108232
Strangely that is a bit more complicated... Look near line #270 at the setemail() function.  It looks like that builds an array of email addresses.  This instruction is a strong clue:

array_push($emailList, $email);

So to print out the email addresses, you would want to iterate over that array called $emailList.  I am not sure where in the programming you would want to do this -- presumably after the array is completely filled.  This is, perhaps, and object lesson in why it is good to have a lot of comments interspersed into the code!

HTH, and Thanks for the points, ~Ray
0

Featured Post

Veeam Disaster Recovery in Microsoft Azure

Veeam PN for Microsoft Azure is a FREE solution designed to simplify and automate the setup of a DR site in Microsoft Azure using lightweight software-defined networking. It reduces the complexity of VPN deployments and is designed for businesses of ALL sizes.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This process describes the steps required to Import and Export data from and to .pst files using Exchange 2010. We can use these steps to export data from a user to a .pst file, import data back to the same or a different user, or even import data t…
There are times when I have encountered the need to decompress a response from a PHP request. This is how it's done, but you must have control of the request and you can set the Accept-Encoding header.
The basic steps you have just learned will be implemented in this video. The basic steps are shown to configure an Exchange DAG in a live working Exchange Server Environment and manage the same (Exchange Server 2010 Software is used in a Windows Ser…
The viewer will learn how to count occurrences of each item in an array.

715 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question