PHP cURL Screen Scraping Problem

Hi guys, I am developing some tools written in php for work and one of them requires me to get some text off of a site but i am pretty sure the site prints its content, then the site uses javascript or some client side code that prints out more content after the page loads. So anyway... my curl request only gets part of the html back and of course it is not the part that i need. Can someone tell me what i am doing wrong or something i can do to get all of the html.

I am trying to get the text before: ".............. This is YOUR Council District!"

Thanks!
-Anthony
//getdistrict.php
<?php
$url = "http://eservices.sccgov.org/district/search.do";
$options = array(
	CURLOPT_RETURNTRANSFER => true,     // return web page
	CURLOPT_HEADER         => true,    // don't return headers
	CURLOPT_FOLLOWLOCATION => true,     // follow redirects
	CURLOPT_ENCODING       => "",       // handle all encodings
	CURLOPT_USERAGENT      => "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)", // who am i
	CURLOPT_AUTOREFERER    => true,     // set referer on redirect
	CURLOPT_CONNECTTIMEOUT => 120,      // timeout on connect
	CURLOPT_TIMEOUT        => 120,      // timeout on response
	CURLOPT_MAXREDIRS      => 1000,     // stop after 10 redirects
	CURLOPT_POST		   => count($fields),
	CURLOPT_POSTFIELDS	   => $fields);

$fields = array("action" => "searchForDistrict",
				"houseno" => $_POST['houseno'],
				"streetname" => $_POST['streetname'],
				"streettype" => $_POST['streettype'],
				"zipcode" => $_POST['zipcode'],
				"Submit" => $_POST['Submit']);

if($fields["Submit"])
{
	if($fields["houseno"])
	{
		if($fields["streetname"])
		{
			if(isset($fields["streettype"]))
			{
				if(isset($fields["zipcode"]))
				{
					$ch      = curl_init( $url );
					curl_setopt_array( $ch, $options );
					$content = curl_exec( $ch );
					$err     = curl_errno( $ch );
					$errmsg  = curl_error( $ch );
					$header  = curl_getinfo( $ch );
					curl_close( $ch );
					
					$header['errno']   = $err;
					$header['errmsg']  = $errmsg;
					$header['content'] = $content;
					echo $header['content'];
					echo $header['errno'];
					echo $header['errmsg'];

					$district = preg_match('/<a href="(*.?)"><strong>(*.?)<\/strong><\/a>					   		      	.............. This is YOUR Council District!/' , $header['content']);
					
					$district = preg_match('/<strong> (*.?)--COUNCIL DIST (*.?)<\/strong>/' , $header['content']);					

					echo $district;
				}
				else
				{
					$error = "You forgot to enter your zip code.";
					echo "<script language=javascript>alert('".$error."');history.back();</script>";
				}
			}
			else
			{
				$error = "You forgot to enter your street type.";
				echo "<script language=javascript>alert('".$error."');history.back();</script>";
			}
		}
		else
		{
			$error = "You forgot to enter your street name.";
			echo "<script language=javascript>alert('".$error."');history.back();</script>";
		}
	}
	else
	{
		$error = "You forgot to enter your house number.";
		echo "<script language=javascript>alert('".$error."');history.back();</script>";
	}
}
else
{
	$error = "You forgot to submit the form? You should never see this error, if you do please contact an administrator.";
	echo "<script language=javascript>alert('".$error."');history.back();</script>";
}

?>

//searchdistrict.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Council District Search Box</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css">
<!--
.style1 {
	font-size: 24px;
	font-weight: bold;
}
-->
<!-- http://eservices.sccgov.org/district/search.do -->
</style>
</head>

<body bgcolor="#59729B">

<table width="450px" border="0" align="center" cellpadding="10" cellspacing="0" bgcolor="#FFFFFF" style="border:1px solid #000;">
<tr>
<td valign="top"><img src="images/SOSSJ_Logo.jpg"/></td>
<td valign="top"><table width="450px" border="0" align="center" cellpadding="0" cellspacing="0">
	<tbody>
		<tr>
			<td colspan="3" valign="top"><form name="districtForm" method="post" action="getdistrict.php">
				<input type="hidden" name="action" value="searchForDistrict"> 				
				<table width="100%" border="0" align="center" cellpadding="0" cellspacing="0" bgcolor="#FFF">
					<tbody>
						<tr class="BkgYel">
							<td colspan="4">                            
                            <br>
<div align="center"><span class="Error">*&nbsp;</span>All Fields  Required *</div>
							  <table id="errorTable">
								<tr>
									<td class="Error" align="left"></td>
								</tr>
						    </table></td>
						</tr>

						<tr class="BkgYel">
							<td width="4%">&nbsp;</td>
							<td width="32%" valign="top" nowrap="">
							<div align="right"><label for="houseno">House
							Number ONLY: </label> </div>
							</td>
							<td width="3%" valign="top"><span class="Error">&nbsp;</span></td>
							<td width="61%" valign="top" bgcolor="#FFF"><div align="left" nowrap=""><input type="text" name="houseno" maxlength="6" size="8" value=""><br />
						  </div></td>
						</tr>
 
						<tr class="BkgYel">
							<td colspan="4" height="15" />
						</tr>
 
						<tr class="BkgYel">
							<td width="4%" height="2" /d>
							<td valign="top" nowrap="">
							<div align="right"><label for="streetType"> Street
							Name ONLY: </label> </div>
							</td>
							<td valign="top"><span class="Error">&nbsp;</span></td>
							<td valign="top"><div align="left" nowrap=""><input type="text" name="streetname" maxlength="35" size="30" value="" style="padding-right:5px;"><br />
							</div></td>
						</tr>
 
						<tr class="BkgYel">
							<td colspan="4" height="15" />
						</tr>
 
						<tr class="BkgYel">
							<td width="4%" height="21" valign="top">&nbsp;</td>
							<td valign="top" nowrap="">
							<div align="right"><label for="streettype">Street
							Type: </label></div>
							</td>
							<td valign="top">&nbsp;</td>
							<td valign="top"><div align="left" nowrap=""><select name="streettype" id="streettype"><option value="">-- Select --</option>
<option value="Avenue">AVE</option>
<option value="Boulevard">BLVD</option>
<option value="Circle">CIR</option>
<option value="Court">CT</option>
<option value="Drive">DR</option>
<option value="Lane">LN</option>
<option value="Place">PL</option>
<option value="Road">RD</option>
<option value="Street">ST</option>
<option value="Alley">ALY</option>
<option value="Annex">ANX</option>
<option value="Arcade">ARC</option>
<option value="Avenue">AVE</option>
<option value="Bayou">BYU</option>
<option value="Beach">BCH</option>
<option value="Bend">BND</option>
<option value="Bluff">BLF</option>
<option value="Bottom">BTM</option>
<option value="Boulevard">BLVD</option>
<option value="Branch">BR</option>
<option value="Bridge">BRG</option>
<option value="Brook">BRK</option>
<option value="Burg">BG</option>
<option value="Bypass">BYP</option>
<option value="Camp">CP</option>
<option value="Canyon">CYN</option>
<option value="Cape">CPE</option>
<option value="Causeway">CSWY</option>
<option value="Center">CTR</option>
<option value="Cliffs">CLFS</option>
<option value="Club">CLB</option>
<option value="College">COL</option>
<option value="Corner">COR</option>
<option value="Corners">CORS</option>
<option value="Course">CRSE</option>
<option value="Court">CT</option>
<option value="Courts">CTS</option>
<option value="Cove">CV</option>
<option value="Creek">CRK</option>
<option value="Crescent">CRES</option>
<option value="Crossing">XING</option>
<option value="Dale">DL</option>
<option value="Dam">DM</option>
<option value="Divide">DV</option>
<option value="Drive">DR</option>
<option value="Estate">EST</option>
<option value="Expressway">EXPY</option>
<option value="Extension">EXT</option>
<option value="Fall">FALL</option>
<option value="Falls">FLS</option>
<option value="Ferry">FRY</option>
<option value="Field">FLD</option>
<option value="Fields">FLDS</option>
<option value="Flat">FLT</option>
<option value="Ford">FRD</option>
<option value="Forest">FRST</option>
<option value="Forge">FRG</option>
<option value="Fork">FRK</option>
<option value="Forks">FRKS</option>
<option value="Fort">FT</option>
<option value="Freeway">FWY</option>
<option value="Gardens">GDNS</option>
<option value="Gateway">GTWY</option>
<option value="Glen">GLEN</option>
<option value="Green">GRN</option>
<option value="Grove">GRV</option>
<option value="Harbor">HBR</option>
<option value="Haven">HVN</option>
<option value="Heights">HTS</option>
<option value="Highway">HWY</option>
<option value="Hill">HL</option>
<option value="Hollow">HOLW</option>
<option value="Hills">HLS</option>
<option value="Inlet">INLT</option>
<option value="Island">IS</option>
<option value="Islands">ISS</option>
<option value="Isle">ISLE</option>
<option value="Junction">JCT</option>
<option value="Key">KY</option>
<option value="Knolls">KNLS</option>
<option value="Lake">LK</option>
<option value="Lakes">LKS</option>
<option value="Landing">LNDG</option>
<option value="Lane">LN</option>
<option value="Light">LGT</option>
<option value="Loaf">LF</option>
<option value="Locks">LCKS</option>
<option value="Lodge">LDG</option>
<option value="Loop">LOOP</option>
<option value="Mall">MALL</option>
<option value="Manor">MNR</option>
<option value="Meadows">MDWS</option>
<option value="Mill">ML</option>
<option value="Mills">MLS</option>
<option value="Mission">MSN</option>
<option value="Mount">MT</option>
<option value="Mountain">MTN</option>
<option value="Neck">NCK</option>
<option value="Orchard">ORCH</option>
<option value="Oval">OVAL</option>
<option value="Park">PARK</option>
<option value="Parkway">PKY</option>
<option value="Pass">PASS</option>
<option value="Path">PATH</option>
<option value="Pike">PIKE</option>
<option value="Pines">PNES</option>
<option value="Plain">PLN</option>
<option value="Plains">PLNS</option>
<option value="Plaza">PLZ</option>
<option value="Point">PT</option>
<option value="Port">PRT</option>
<option value="Prairie">PR</option>
<option value="Radial">RADL</option>
<option value="Ranch">RNCH</option>
<option value="Rapids">RPDS</option>
<option value="Rest">RST</option>
<option value="Ridge">RDG</option>
<option value="River">RIV</option>
<option value="Road">RD</option>
<option value="Row">ROW</option>
<option value="Run">RUN</option>
<option value="Shoal">SHL</option>
<option value="Shoals">SHLS</option>
<option value="Shore">SHR</option>
<option value="Shores">SHRS</option>
<option value="Spring">SPG</option>
<option value="Springs">SPGS</option>
<option value="Spur">SPUR</option>
<option value="Square">SQ</option>
<option value="Stravenue">STRA</option>
<option value="Stream">STRM</option>
<option value="Street">ST</option>
<option value="Station">STA</option>
<option value="Steps">STP</option>
<option value="Summit">SMT</option>
<option value="Terrace">TER</option>
<option value="Trace">TRCE</option>
<option value="Track">TRAK</option>
<option value="Trafficway">TRFY</option>
<option value="Trail">TRL</option>
<option value="Trailer">TRLR</option>
<option value="Turnpike">TPKE</option>
<option value="Tunnel">TUNL</option>
<option value="Union">UN</option>
<option value="Valley">VLY</option>
<option value="Viaduct">VIA</option>
<option value="View">VW</option>
<option value="Village">VLG</option>
<option value="Ville">VL</option>
<option value="Vista">VIS</option>
<option value="Walk">WALK</option>
<option value="Way">WAY</option>
<option value="Wells">WLS</option></select><br />
							</div></td>
 
						</tr>
 
						<tr class="BkgYel">
							<td colspan="4" height="15" />
						</tr>
 
						<tr class="BkgYel">
							<td valign="top" width="4%">&nbsp;</td>
							<td valign="top" nowrap="">
							<div align="right"><label for="zipCode"> Zip
							Code: </label>
							</div>
							</td>
							<td valign="top"><span class="Error">&nbsp;</span></td>
							<td valign="top">						      <select name="zipcode" id="zipcode">
							      <option value="">-- Select --</option>
                                  <option value="94089">94089</option>
                                  <option value="95002">95002</option>
                                  <option value="95008">95008</option>
                                  <option value="95014">95014</option>
                                  <option value="95030">95030</option>
                                  <option value="95032">95032</option>
                                  <option value="95035">95035</option>
                                  <option value="95037">95037</option>
                                  <option value="95050">95050</option>
                                  <option value="95051">95051</option>
                                  <option value="95054">95054</option>
                                  <option value="95070">95070</option>
                                  <option value="95110">95110</option>
                                  <option value="95111">95111</option>
                                  <option value="95112">95112</option>
                                  <option value="95113">95113</option>
                                  <option value="95116">95116</option>
                                  <option value="95117">95117</option>
                                  <option value="95118">95118</option>
                                  <option value="95119">95119</option>
                                  <option value="95120">95120</option>
                                  <option value="95121">95121</option>
                                  <option value="95122">95122</option>
                                  <option value="95123">95123</option>
                                  <option value="95124">95124</option>
                                  <option value="95125">95125</option>
                                  <option value="95126">95126</option>
                                  <option value="95127">95127</option>
                                  <option value="95128">95128</option>
                                  <option value="95129">95129</option>
                                  <option value="95130">95130</option>
                                  <option value="95131">95131</option>
                                  <option value="95132">95132</option>
                                  <option value="95133">95133</option>
                                  <option value="95134">95134</option>
                                  <option value="95135">95135</option>
                                  <option value="95136">95136</option>
                                  <option value="95138">95138</option>
                                  <option value="95139">95139</option>
                                  <option value="95141">95141</option>
                                  <option value="95148">95148</option>
 						        </select>
						    </td></tr>
						<tr class="BkgYel">
							<td colspan="4" height="15" />
						</tr>
 
						<tr>
							<td colspan="4" height="4" />
						</tr>
 
						<tr class="BkgYel">
							<td>&nbsp;</td>
						  <td colspan="3" height="30" align="center"><input type="submit" name="Submit" value="Submit">
							&nbsp;&nbsp; <a href="#">  
							<input name="Reset" type="reset" id="Reset" value="Reset">
							</a></td>
						</tr>
						<tr>
							<td colspan="4" height="10">&nbsp;</td>
						</tr>
 
					</tbody>
			  </table>
			</form></td>
		</tr>
	</tbody>
</table> </td>
</tr>
</table>
</body>
</html>

Open in new window

LVL 1
Anthony408Asked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Michel PlungjanIT ExpertCommented:
For what you post it is impossible to guess without actually running the php

Can you perhaps look at the string you get to see if it at all includes the string you are looking for?

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Steve BinkCommented:
As mplungjan stated, what you want to do is very likely not possible with curl.  If the information you want is written to the page by javascript (including AJAX functionality) after the original request is complete, the curl response will never see it because curl does not execute javascript.  curl pretty much works like a text-only browser, such as lynx.  

My best recommendation to you is to contact the site in question and see if they have other resources (such as an API, a developer gateway, RSS feed, etc) you can use to capture the data you need.
Michel PlungjanIT ExpertCommented:
If they ajax the info, it might still be possible to curl that using the same url they do...
Become a CompTIA Certified Healthcare IT Tech

This course will help prep you to earn the CompTIA Healthcare IT Technician certification showing that you have the knowledge and skills needed to succeed in installing, managing, and troubleshooting IT systems in medical and clinical settings.

Steve BinkCommented:
>>> If they ajax the info, it might still be possible to curl that using the same url they do...

Theoretically, yes.  I tried the same thing before, and found it immensely painful.  Most AJAX calls are going to use a querystring or POST data created on the fly, so you need some way of reconstructing that.  Even on a known application, that is quite a hurdle.  On an unknown application, that would be some serious voodoo.  
Anthony408Author Commented:
That is what I was afraid of =(

Thank you both very much for taking the time to help.
Ray PaseurCommented:
mplungjan is right that you can curl the backend AJAX scripts if you know what the URLs and expected inputs might be, but that is a complicated task with a lot of moving parts.

I agree with routinet about asking for a formal interface for data exchange.  Two big reasons.  

First, you can write all the screen-scraper code, spend time debugging it, etc., and a tiny change on their side can make your script break without warning.  APIs are usually versioned, so you can depend on them for as long as the version is supported.  

Second, if they are the publishers, they own the data.  You may have copyright or TOS issues if you take their data and repurpose it without formal permission.  If the data you're scraping comes from public records (of Santa Clara County, for example) there should be no problem getting it via a Freedom of Information request or similar inquiry.

best regards, ~Ray
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.