Use PHP to 'crawl' website and collect info

movieprodw
movieprodw used Ask the Experts™
on
Hello,

I have a large 100+ page website and I am going to build a wordpress page for it. It is currently HTML.

Is there any way I can enter the links into a php script and it would pull all of the content, keywords and title and enter them in a csv file, or db.

Thanks,
Matt
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Most Valuable Expert 2011
Top Expert 2016

Commented:
The answer is "yes" but it may be quite a project.  You can read the HTML document with file_get_contents().  Parsing options include SimpleHTMLDom or using explode() ane regular expressions to tease the data apart.  You can also write PHP scripts that will follow links, fetching the pages automatically.  Another option might be to attach a search engine to the site and look at what it finds.  With search engine output you may be able to shorten the path to all of the data acquisition.

Author

Commented:
Okay, those are good ideas, thank you.

I think it may not be too hard because the content is small and well organized.

The content I would want to grab is between:
<div class="bodyLeft"></div>

And I would also want
<title>SAVE THIS CONTENT</title>
<meta name="description" content="SAVE THIS CONTENT" />
<meta name="keywords" content="SAVE THIS CONTENT" />
ǩa̹̼͍̓̂ͪͤͭ̓u͈̳̟͕̬ͩ͂̌͌̾̀ͪf̭̤͉̅̋͛͂̓͛̈m̩̘̱̃e͙̳͊̑̂ͦ̌ͯ̚d͋̋ͧ̑ͯ͛̉Glanced up at my screen and thought I had coded the Matrix...  Turns out, I just fell asleep on the keyboard.
Most Valuable Expert 2011
Top Expert 2015

Commented:
The content I would want to grab is between:
<div class="bodyLeft"></div>
Can any <div> appear inside of the <div class="bodyLeft">?
Angular Fundamentals

Learn the fundamentals of Angular 2, a JavaScript framework for developing dynamic single page applications.

Author

Commented:
Yes, a div can be in there
Most Valuable Expert 2011
Top Expert 2016

Commented:
Maybe post a link to the URL?  Or copy and paste the HTML here?  The quality of any answer you can get will be directly related to the quality of the test data you provide.

Author

Commented:
Hello,

You can see the link below.

http://www.nationalpeo.com/background_screening.php
Most Valuable Expert 2011
Top Expert 2016

Commented:

Author

Commented:
Hello Ray,

That site is going to be trashed so I don't want to put a ton of time into it. I will check out the errors though. Thank you
Most Valuable Expert 2011
Top Expert 2016
Commented:
Here's a start.

<?php // RAY_temp_movidprodw.php
error_reporting(E_ALL);

// READ THE DOCUMENT INTO A STRING VARIABLE
$url = 'http://www.nationalpeo.com/background_screening.php';
$all = file_get_contents($url);
$htm = $all;

// DECLOP THE LEFTMOST PART OF THE STRING
$sig = '<div class="bodyLeft" >';
$arr = explode($sig, $htm);
$htm = $sig . $arr[1];

// DECLOP THE RIGHTMOST PART OF THE STRING
$sig = '<!-- END BODY -->';
$arr = explode($sig, $htm);
$htm = $arr[0];

// SAVE THE WORK PRODUCT
$bodyleft = $htm;

// GET HEADER INFORMATION IN PSEUDO-XML
$htm = $all;
$sig = '<script';
$arr = explode($sig, $htm);
$htm = $arr[0];
$sig = '<head';
$arr = explode($sig, $htm);
$htm = $sig . $arr[1] . '</head>';

// ACTIVATE THIS TO SEE THE HEADER INFO
// echo htmlentities($htm);

// USE OBJECT ORIENTED NOTATION TO EXTRACT DATA
$obj = simpleXML_Load_String($htm);
$title = $obj->title;
foreach ($obj->meta as $meta)
{
    if ($meta->attributes()->name == 'keywords')    $keywords    = $meta->attributes()->content;
    if ($meta->attributes()->name == 'description') $description = $meta->attributes()->content;
}

// SHOW THE WORK PRODUCT
echo '<pre>';
echo PHP_EOL . $title;
echo PHP_EOL . $description;
echo PHP_EOL . $keywords;
echo PHP_EOL . htmlentities($bodyleft);

Open in new window

Author

Commented:
Awesome!
Most Valuable Expert 2011
Top Expert 2016

Commented:
Thanks for the points! ~Ray

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial