?
Solved

Script to extract data from a web page and put it into a db

Posted on 2004-11-21
15
Medium Priority
?
203 Views
Last Modified: 2013-12-12
Hello,

I need a script that extracts data from this site:

http://sandgate.co.clark.nv.us/servlet/Assessor?instance=pcl7&parcel=137-01-411-034

and puts it into a MySql table.

I would need it to run though a range of records from the site, they are stored with parcel numbers eg. 137-01-411-034.  I only need to cycle though the last 3 like from 034 to 125.

Thanks!
0
Comment
Question by:damtimmah
14 Comments
 
LVL 48

Expert Comment

by:hernst42
ID: 12642252
The function you can use for this are:
http://de.php.net/file_get_contents
http://de.php.net/pcre

<?php
// get the complete HTML-page into php
$page = file_get_contents('http://sandgate.co.clark.nv.us/servlet/Assessor?instance=pcl7&parcel=137-01-411-034');

// use regular expressions to take the data you need from that page.
0
 
LVL 40

Expert Comment

by:Richard Quadling
ID: 12645451
What specific information do you want to extract?

When you use commands like file_get_contents (or fopen/fgets), etc, the information you get is the HTML output.

If you take a look at the HTML output of the URL you supplied, you will see that there is a LOT of code!

Ideally, if you intend to data mine this site, it would be a good idea to get hold of the database that is used.

When you say the last three, do you mean the last three parts of the number?

137-xx-xxx-xxx

or the last three digits?

137-01-411-xxx

?

Parcel numbers must be in the form 999-99-999-999, so leading zeros are required.

<?php
$aParcels = array();
$sParcelURLPrefix = 'http://sandgate.co.clark.nv.us/servlet/Assessor?instance=pcl7&parcel=';
$aParcelParts = array(0,0,0,0);

for($aParcelParts[0] = 0; $aParcelParts[0] <= 999 ; $aParcelParts[0]++)
      {
      for ($aParcelParts[1] = 0 ; $aParcelParts[1] <= 99 ; $aParcelParts[1]++)
            {
            for ($aParcelParts[2] = 0 ; $aParcelParts[2] <= 999 ; $aParcelParts[2]++)
                  {
                  for ($aParcelParts[3] = 0 ; $aParcelParts[3] <= 999 ; $aParcelParts[3]++)
                        {
                        $sParcelNumber = str_pad($aParcelParts[0], 3, '0', STR_PAD_LEFT) . '-' . str_pad($aParcelParts[1], 2, '0', STR_PAD_LEFT) . '-' . str_pad($aParcelParts[2], 3, '0', STR_PAD_LEFT) . '-' . str_pad($aParcelParts[3], 3, '0', STR_PAD_LEFT);
                        $sParcelURL = $sParcelURLPrefix . $sParcelNumber;
                        $aParcels[$sParcelNumber] = file_get_contents($sParcelURL);
                        echo 'Retrieved parcel ' . $sParcelURL . ' (' . strlen($aParcels[$sParcelNumber]) . ')<br />';
                        }
                  }
            }
      }

?>

The above script will look for ALL parcel numbers. This will take forever.

If you let us know what you want to do with the data once you've got it (and what data you want to extract), then more help can follow.

Richard.
0
 

Author Comment

by:damtimmah
ID: 12645912
Thanks for the help...

By last 3 I ment 137-01-411-xxx and to be able to select a range in the last 3 digits, like 150-300.  
For the first 3 parts of the number xxx-xx-xxx-[999] I need to be able to set those as a variable.

The data that I need would be everything that is displaied except the html, so the name address, property information and so on.  Pretty much everything in the tables.  I would want to split the name and address into seperate fields in the db as well as all the other information.

This data will be exported into another application that will use this data to create users.  But there are also other applications for this data so I want to capture as much as possible for future use.

I was going to use strip_tags(); to remove the html and parse the results but if it's easier leaving it in there then I'm up for that.

I would love to get my hands on the db that the county has but I don't think they are going to hand it over, ;).  The data constantly changes as well so it would be outdated within weeks.

0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
LVL 10

Expert Comment

by:frugle
ID: 12648689
Any server admin worth his/her salt would spot rapid multiple accesses from the same IP and ban it.

What you're attempting to do sounds pretty dodgy, If this were in Great Britain I'm sure it would be covered by the Data Protection Act...

Mike
0
 

Author Comment

by:damtimmah
ID: 12649357
Well... I don't plan to mine all the data... i would only need blocks of data of perhpas 50-100 records at a time.  I would have no problem creating a delay between requests.

We need this because this information is currently entered by hand for each homewoner and takes a long time and lots of resources.

I also want to implement this in perhaps an access db frontend to capture all the information with one click.
0
 
LVL 40

Expert Comment

by:Richard Quadling
ID: 12652286
I agree with frugle.

Invariably the information you are looking at is copyrighted to the website owners. Unless you have persmission to copy it, I think I will have to decline answering this question.

If you DO get permission, then ask them for access to the database. You would only need read access to extract the relevant details, but this would save you a LOT of time. If the work is legitimate, then please explain further.

I would think that EE would NOT like us experts to help you break the law.

Richard.
0
 

Author Comment

by:damtimmah
ID: 12652702
What else is there to say?  This information is from the local government and is public information.  The data will be used for property managers that need all the Assesor information in their databases and need current information.  

I also don't think that the County will just let me have access to their db as well... even if they do it would probably be a single occurance.
0
 

Author Comment

by:damtimmah
ID: 12652742
I'm not sure about the exact law on public records but would the County "own" this data?  Anyone from the US have any insight into this?
0
 
LVL 40

Expert Comment

by:Richard Quadling
ID: 12652756
It is very much work asking.

If you intend to mine the information (and you are allowed to), then you have a good case to say that you will be keeping the pressure off there servers by having direct DB access. This is not very difficult. They will allow your server access to the database using a specific username/password and only allow read access.

I do this a LOT for sites I set up where the front end is on one server and the db is on another. The webserver does not have the overhead of handling a DB server too.

Also, and this is the real reason, if a script is supplied which uses a lot of regular expressions (as this is by far the easiest way to get the data), if the layout changes, you will need to start from scratch.

0
 
LVL 40

Expert Comment

by:Richard Quadling
ID: 12652758
It is very much WORTH asking!!!

Gee. One day, spelling and grammar errors will be checked by browsers. Until then, we have to all read my bad spelling and grammar.

Sorry.
0
 

Author Comment

by:damtimmah
ID: 12652807
Well, I can ask, not sure how far I can get in the American bureaucracy.

If not, I'm willing to take the time to update the script if the site changes...  It's just that I'm not at a level where I can produce an efficient script for this purpose otherwise I would crank this out.  If I did it it would probably take me months and be pretty clunky...

I'll see what they say tomorrow morning.

Thanks
0
 
LVL 40

Accepted Solution

by:
Richard Quadling earned 2000 total points
ID: 12653168
This small script is a work in progress.

<?php
function ExtractData($sPage)
      {
// Define the regular expressions to extract the required data.
      $aRegExps = array(
            'GI_Parcel No.' => '<TR>\s*<TD .*?>\s*<A href="http://www\.co\.clark\.nv\.us/assessor/glosary2\.htm#PARCEL">Parcel No\.</A>\s*</TD>\s*<TD .*?>(\d{3}-\d{2}-\d{3}-\d{3})\s*</TD>\s*</TR>',
            'GI_Owner and Mailing Address' => '\s*<TR>\s*<TD .*?>\s*<p align="left">\s*<A href="http://www.co.clark.nv.us/assessor/glosary2.htm#OWNER">Owner</A> and\s*<A href="http://www.co.clark.nv.us/assessor/glosary2.htm#MAILING">Mailing Address</A></TD>\s*<TD .*?>\s*(.*?)\s*</TD>\s*</TR>',
            'GI_Location Address' => '',
            'GI_Assessor Description' => '',
            'GI_Recorded Document Number' => '',
            'GI_Recorded Date' => '',
            'GI_Vesting' => '',
            'AISV_Tax Disctrict' => '',
            'AISV_Appraisal Year' => '',
            'AISV_Fiscal Year' => '',
            'AISV_Supplemental Improvement Value' => '',
            'AISV_Supplemental Improvement' => '',
            'AISV_Account Number' => '',
            'RPAV_Fiscal Year 1' => '',
            'RPAV_Fiscal Year 2' => '',
            'RPAV_Land 1' => '',
            'RPAV_Land 2' => '',
            'RPAV_Improvements 1' => '',
            'RPAV_Improvements 2' => '',
            'RPAV_Personal Property 1' => '',
            'RPAV_Personal Property 2' => '',
            'RPAV_Exempt 1' => '',
            'RPAV_Exempt 2' => '',
            'RPAV_Gross Assessed 1' => '',
            'RPAV_Gross Assessed 2' => '',
            'RPAV_Taxable Value Land+Imp 1' => '',
            'RPAV_Taxable Value Land+Imp 2' => '',
            'PRS_Total Living Sq. Ft.' => '',
            'PRS_Carport Sq. Ft.' => '',
            'PRS_Addn/Conv' => '',
            'PRS_1st Floor Sq. Ft.' => '',
            'PRS_Stories' => '',
            'PRS_Pool' => '',
            'PRS_2nd Floor Sq. Ft.' => '',
            'PRS_Bedrooms' => '',
            'PRS_Spa' => '',
            'PRS_Basement' => '',
            'PRS_Bathrooms' => '',
            'PRS_Type of Construction' => '',
            'PRS_Garage Sq. Ft.' => '',
            'PRS_Fireplace' => '',
            'PRS_Roof Type' => '',
            );

      $aResult = array();
      foreach($aRegExps as $sLabel => $sRegexp)
            {
            if (strlen($sRegexp) > 0)
                  {
                  if (preg_match_all("`$sRegexp`sim", $sPage, $aMatches) !== False)
                        {
                        $aResult[$sLabel] = $aMatches[1][0];
                        }
                  }
            }
      return $aResult;
      }

// Define local storage.
$aParcels = array();
$sParcelURLPrefix = 'http://sandgate.co.clark.nv.us/servlet/Assessor?instance=pcl7&parcel=';
$aExtractedData = array();

// See what has been POSTED.
if (isset($_POST['ParcelPrefix']))
      {
// The parcel number prefix has been supplied, so get some data.
      for ($iParcel = 0 ; $iParcel <= 999 ; $iParcel++)
            {
            $sParcelNumber = $_POST['ParcelPrefix'] . '-' . str_pad($iParcel, 3, '0', STR_PAD_LEFT);
            $sParcelURL = $sParcelURLPrefix . $sParcelNumber;
            $aParcels[$sParcelNumber] = ExtractData(file_get_contents($sParcelURL));
            }
// DEBUG: Show the data extracted.
      print_r($aParcels);
      }

?>

The work is building the regular expressions.

An example output (debugging only a single parcel) ...

Array
(
    [137-01-411-034] => Array
        (
            [GI_Parcel No.] => 137-01-411-034
            [GI_Owner and Mailing Address] => STONE DAVID M<BR>
            10769 TIN MOUNTAIN CT<BR>
            LAS VEGAS NV  89129-3271<BR>
            &nbsp;<BR>
            &nbsp;<BR>
            &nbsp;
        )

)

Things like the name and address will need further handling.

explode('<BR>', $var);

would probably do.

Once you've got the data, you then loop through the $aParcels array and populate/update your database as required.

Richard.
0
 

Author Comment

by:damtimmah
ID: 12712638
I'll try this out, thanks a lot for the help.  I'm sorry I didn't respond quicker but I was out of town for a while.

-Timmah
0
 
LVL 40

Expert Comment

by:Richard Quadling
ID: 12713666
Be aware that this is not a complete script yet.

If it is the sort of thing you are looking for, then it is just a case of creating the correct regexps.

Hopefully the few I've supplied should help you work out what is required.
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

It’s a season to be thankful, and we’re thankful for users like you who engage on site, solve technology problems, and network with others in the industry. What tech are we most thankful for? Keep reading.
The title says it all. Writing any type of PHP Application or API code that provides high throughput, while under a heavy load, seems to be an arcane art form (Black Magic). This article aims to provide some general guidelines for producing this typ…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
Suggested Courses
Course of the Month14 days, 11 hours left to enroll

839 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question