Script to extract data from a web page and put it into a db

Hello,

I need a script that extracts data from this site:

http://sandgate.co.clark.nv.us/servlet/Assessor?instance=pcl7&parcel=137-01-411-034

and puts it into a MySql table.

I would need it to run though a range of records from the site, they are stored with parcel numbers eg. 137-01-411-034.  I only need to cycle though the last 3 like from 034 to 125.

Thanks!
damtimmahAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

hernst42Commented:
The function you can use for this are:
http://de.php.net/file_get_contents
http://de.php.net/pcre

<?php
// get the complete HTML-page into php
$page = file_get_contents('http://sandgate.co.clark.nv.us/servlet/Assessor?instance=pcl7&parcel=137-01-411-034');

// use regular expressions to take the data you need from that page.
0
Richard QuadlingSenior Software DeveloperCommented:
What specific information do you want to extract?

When you use commands like file_get_contents (or fopen/fgets), etc, the information you get is the HTML output.

If you take a look at the HTML output of the URL you supplied, you will see that there is a LOT of code!

Ideally, if you intend to data mine this site, it would be a good idea to get hold of the database that is used.

When you say the last three, do you mean the last three parts of the number?

137-xx-xxx-xxx

or the last three digits?

137-01-411-xxx

?

Parcel numbers must be in the form 999-99-999-999, so leading zeros are required.

<?php
$aParcels = array();
$sParcelURLPrefix = 'http://sandgate.co.clark.nv.us/servlet/Assessor?instance=pcl7&parcel=';
$aParcelParts = array(0,0,0,0);

for($aParcelParts[0] = 0; $aParcelParts[0] <= 999 ; $aParcelParts[0]++)
      {
      for ($aParcelParts[1] = 0 ; $aParcelParts[1] <= 99 ; $aParcelParts[1]++)
            {
            for ($aParcelParts[2] = 0 ; $aParcelParts[2] <= 999 ; $aParcelParts[2]++)
                  {
                  for ($aParcelParts[3] = 0 ; $aParcelParts[3] <= 999 ; $aParcelParts[3]++)
                        {
                        $sParcelNumber = str_pad($aParcelParts[0], 3, '0', STR_PAD_LEFT) . '-' . str_pad($aParcelParts[1], 2, '0', STR_PAD_LEFT) . '-' . str_pad($aParcelParts[2], 3, '0', STR_PAD_LEFT) . '-' . str_pad($aParcelParts[3], 3, '0', STR_PAD_LEFT);
                        $sParcelURL = $sParcelURLPrefix . $sParcelNumber;
                        $aParcels[$sParcelNumber] = file_get_contents($sParcelURL);
                        echo 'Retrieved parcel ' . $sParcelURL . ' (' . strlen($aParcels[$sParcelNumber]) . ')<br />';
                        }
                  }
            }
      }

?>

The above script will look for ALL parcel numbers. This will take forever.

If you let us know what you want to do with the data once you've got it (and what data you want to extract), then more help can follow.

Richard.
0
damtimmahAuthor Commented:
Thanks for the help...

By last 3 I ment 137-01-411-xxx and to be able to select a range in the last 3 digits, like 150-300.  
For the first 3 parts of the number xxx-xx-xxx-[999] I need to be able to set those as a variable.

The data that I need would be everything that is displaied except the html, so the name address, property information and so on.  Pretty much everything in the tables.  I would want to split the name and address into seperate fields in the db as well as all the other information.

This data will be exported into another application that will use this data to create users.  But there are also other applications for this data so I want to capture as much as possible for future use.

I was going to use strip_tags(); to remove the html and parse the results but if it's easier leaving it in there then I'm up for that.

I would love to get my hands on the db that the county has but I don't think they are going to hand it over, ;).  The data constantly changes as well so it would be outdated within weeks.

0
Cloud Class® Course: MCSA MCSE Windows Server 2012

This course teaches how to install and configure Windows Server 2012 R2.  It is the first step on your path to becoming a Microsoft Certified Solutions Expert (MCSE).

frugleCommented:
Any server admin worth his/her salt would spot rapid multiple accesses from the same IP and ban it.

What you're attempting to do sounds pretty dodgy, If this were in Great Britain I'm sure it would be covered by the Data Protection Act...

Mike
0
damtimmahAuthor Commented:
Well... I don't plan to mine all the data... i would only need blocks of data of perhpas 50-100 records at a time.  I would have no problem creating a delay between requests.

We need this because this information is currently entered by hand for each homewoner and takes a long time and lots of resources.

I also want to implement this in perhaps an access db frontend to capture all the information with one click.
0
Richard QuadlingSenior Software DeveloperCommented:
I agree with frugle.

Invariably the information you are looking at is copyrighted to the website owners. Unless you have persmission to copy it, I think I will have to decline answering this question.

If you DO get permission, then ask them for access to the database. You would only need read access to extract the relevant details, but this would save you a LOT of time. If the work is legitimate, then please explain further.

I would think that EE would NOT like us experts to help you break the law.

Richard.
0
damtimmahAuthor Commented:
What else is there to say?  This information is from the local government and is public information.  The data will be used for property managers that need all the Assesor information in their databases and need current information.  

I also don't think that the County will just let me have access to their db as well... even if they do it would probably be a single occurance.
0
damtimmahAuthor Commented:
I'm not sure about the exact law on public records but would the County "own" this data?  Anyone from the US have any insight into this?
0
Richard QuadlingSenior Software DeveloperCommented:
It is very much work asking.

If you intend to mine the information (and you are allowed to), then you have a good case to say that you will be keeping the pressure off there servers by having direct DB access. This is not very difficult. They will allow your server access to the database using a specific username/password and only allow read access.

I do this a LOT for sites I set up where the front end is on one server and the db is on another. The webserver does not have the overhead of handling a DB server too.

Also, and this is the real reason, if a script is supplied which uses a lot of regular expressions (as this is by far the easiest way to get the data), if the layout changes, you will need to start from scratch.

0
Richard QuadlingSenior Software DeveloperCommented:
It is very much WORTH asking!!!

Gee. One day, spelling and grammar errors will be checked by browsers. Until then, we have to all read my bad spelling and grammar.

Sorry.
0
damtimmahAuthor Commented:
Well, I can ask, not sure how far I can get in the American bureaucracy.

If not, I'm willing to take the time to update the script if the site changes...  It's just that I'm not at a level where I can produce an efficient script for this purpose otherwise I would crank this out.  If I did it it would probably take me months and be pretty clunky...

I'll see what they say tomorrow morning.

Thanks
0
Richard QuadlingSenior Software DeveloperCommented:
This small script is a work in progress.

<?php
function ExtractData($sPage)
      {
// Define the regular expressions to extract the required data.
      $aRegExps = array(
            'GI_Parcel No.' => '<TR>\s*<TD .*?>\s*<A href="http://www\.co\.clark\.nv\.us/assessor/glosary2\.htm#PARCEL">Parcel No\.</A>\s*</TD>\s*<TD .*?>(\d{3}-\d{2}-\d{3}-\d{3})\s*</TD>\s*</TR>',
            'GI_Owner and Mailing Address' => '\s*<TR>\s*<TD .*?>\s*<p align="left">\s*<A href="http://www.co.clark.nv.us/assessor/glosary2.htm#OWNER">Owner</A> and\s*<A href="http://www.co.clark.nv.us/assessor/glosary2.htm#MAILING">Mailing Address</A></TD>\s*<TD .*?>\s*(.*?)\s*</TD>\s*</TR>',
            'GI_Location Address' => '',
            'GI_Assessor Description' => '',
            'GI_Recorded Document Number' => '',
            'GI_Recorded Date' => '',
            'GI_Vesting' => '',
            'AISV_Tax Disctrict' => '',
            'AISV_Appraisal Year' => '',
            'AISV_Fiscal Year' => '',
            'AISV_Supplemental Improvement Value' => '',
            'AISV_Supplemental Improvement' => '',
            'AISV_Account Number' => '',
            'RPAV_Fiscal Year 1' => '',
            'RPAV_Fiscal Year 2' => '',
            'RPAV_Land 1' => '',
            'RPAV_Land 2' => '',
            'RPAV_Improvements 1' => '',
            'RPAV_Improvements 2' => '',
            'RPAV_Personal Property 1' => '',
            'RPAV_Personal Property 2' => '',
            'RPAV_Exempt 1' => '',
            'RPAV_Exempt 2' => '',
            'RPAV_Gross Assessed 1' => '',
            'RPAV_Gross Assessed 2' => '',
            'RPAV_Taxable Value Land+Imp 1' => '',
            'RPAV_Taxable Value Land+Imp 2' => '',
            'PRS_Total Living Sq. Ft.' => '',
            'PRS_Carport Sq. Ft.' => '',
            'PRS_Addn/Conv' => '',
            'PRS_1st Floor Sq. Ft.' => '',
            'PRS_Stories' => '',
            'PRS_Pool' => '',
            'PRS_2nd Floor Sq. Ft.' => '',
            'PRS_Bedrooms' => '',
            'PRS_Spa' => '',
            'PRS_Basement' => '',
            'PRS_Bathrooms' => '',
            'PRS_Type of Construction' => '',
            'PRS_Garage Sq. Ft.' => '',
            'PRS_Fireplace' => '',
            'PRS_Roof Type' => '',
            );

      $aResult = array();
      foreach($aRegExps as $sLabel => $sRegexp)
            {
            if (strlen($sRegexp) > 0)
                  {
                  if (preg_match_all("`$sRegexp`sim", $sPage, $aMatches) !== False)
                        {
                        $aResult[$sLabel] = $aMatches[1][0];
                        }
                  }
            }
      return $aResult;
      }

// Define local storage.
$aParcels = array();
$sParcelURLPrefix = 'http://sandgate.co.clark.nv.us/servlet/Assessor?instance=pcl7&parcel=';
$aExtractedData = array();

// See what has been POSTED.
if (isset($_POST['ParcelPrefix']))
      {
// The parcel number prefix has been supplied, so get some data.
      for ($iParcel = 0 ; $iParcel <= 999 ; $iParcel++)
            {
            $sParcelNumber = $_POST['ParcelPrefix'] . '-' . str_pad($iParcel, 3, '0', STR_PAD_LEFT);
            $sParcelURL = $sParcelURLPrefix . $sParcelNumber;
            $aParcels[$sParcelNumber] = ExtractData(file_get_contents($sParcelURL));
            }
// DEBUG: Show the data extracted.
      print_r($aParcels);
      }

?>

The work is building the regular expressions.

An example output (debugging only a single parcel) ...

Array
(
    [137-01-411-034] => Array
        (
            [GI_Parcel No.] => 137-01-411-034
            [GI_Owner and Mailing Address] => STONE DAVID M<BR>
            10769 TIN MOUNTAIN CT<BR>
            LAS VEGAS NV  89129-3271<BR>
            &nbsp;<BR>
            &nbsp;<BR>
            &nbsp;
        )

)

Things like the name and address will need further handling.

explode('<BR>', $var);

would probably do.

Once you've got the data, you then loop through the $aParcels array and populate/update your database as required.

Richard.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
damtimmahAuthor Commented:
I'll try this out, thanks a lot for the help.  I'm sorry I didn't respond quicker but I was out of town for a while.

-Timmah
0
Richard QuadlingSenior Software DeveloperCommented:
Be aware that this is not a complete script yet.

If it is the sort of thing you are looking for, then it is just a case of creating the correct regexps.

Hopefully the few I've supplied should help you work out what is required.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.