Use PHP to 'crawl' website and collect info

Hello,

I have a large 100+ page website and I am going to build a wordpress page for it. It is currently HTML.

Is there any way I can enter the links into a php script and it would pull all of the content, keywords and title and enter them in a csv file, or db.

Thanks,
Matt
LVL 1
movieprodwAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Ray PaseurCommented:
The answer is "yes" but it may be quite a project.  You can read the HTML document with file_get_contents().  Parsing options include SimpleHTMLDom or using explode() ane regular expressions to tease the data apart.  You can also write PHP scripts that will follow links, fetching the pages automatically.  Another option might be to attach a search engine to the site and look at what it finds.  With search engine output you may be able to shorten the path to all of the data acquisition.
0
movieprodwAuthor Commented:
Okay, those are good ideas, thank you.

I think it may not be too hard because the content is small and well organized.

The content I would want to grab is between:
<div class="bodyLeft"></div>

And I would also want
<title>SAVE THIS CONTENT</title>
<meta name="description" content="SAVE THIS CONTENT" />
<meta name="keywords" content="SAVE THIS CONTENT" />
0
käµfm³d 👽Commented:
The content I would want to grab is between:
<div class="bodyLeft"></div>
Can any <div> appear inside of the <div class="bodyLeft">?
0
Learn Ruby Fundamentals

This course will introduce you to Ruby, as well as teach you about classes, methods, variables, data structures, loops, enumerable methods, and finishing touches.

movieprodwAuthor Commented:
Yes, a div can be in there
0
Ray PaseurCommented:
Maybe post a link to the URL?  Or copy and paste the HTML here?  The quality of any answer you can get will be directly related to the quality of the test data you provide.
0
movieprodwAuthor Commented:
Hello,

You can see the link below.

http://www.nationalpeo.com/background_screening.php
0
Ray PaseurCommented:
0
movieprodwAuthor Commented:
Hello Ray,

That site is going to be trashed so I don't want to put a ton of time into it. I will check out the errors though. Thank you
0
Ray PaseurCommented:
Here's a start.

<?php // RAY_temp_movidprodw.php
error_reporting(E_ALL);

// READ THE DOCUMENT INTO A STRING VARIABLE
$url = 'http://www.nationalpeo.com/background_screening.php';
$all = file_get_contents($url);
$htm = $all;

// DECLOP THE LEFTMOST PART OF THE STRING
$sig = '<div class="bodyLeft" >';
$arr = explode($sig, $htm);
$htm = $sig . $arr[1];

// DECLOP THE RIGHTMOST PART OF THE STRING
$sig = '<!-- END BODY -->';
$arr = explode($sig, $htm);
$htm = $arr[0];

// SAVE THE WORK PRODUCT
$bodyleft = $htm;

// GET HEADER INFORMATION IN PSEUDO-XML
$htm = $all;
$sig = '<script';
$arr = explode($sig, $htm);
$htm = $arr[0];
$sig = '<head';
$arr = explode($sig, $htm);
$htm = $sig . $arr[1] . '</head>';

// ACTIVATE THIS TO SEE THE HEADER INFO
// echo htmlentities($htm);

// USE OBJECT ORIENTED NOTATION TO EXTRACT DATA
$obj = simpleXML_Load_String($htm);
$title = $obj->title;
foreach ($obj->meta as $meta)
{
    if ($meta->attributes()->name == 'keywords')    $keywords    = $meta->attributes()->content;
    if ($meta->attributes()->name == 'description') $description = $meta->attributes()->content;
}

// SHOW THE WORK PRODUCT
echo '<pre>';
echo PHP_EOL . $title;
echo PHP_EOL . $description;
echo PHP_EOL . $keywords;
echo PHP_EOL . htmlentities($bodyleft);

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
movieprodwAuthor Commented:
Awesome!
0
Ray PaseurCommented:
Thanks for the points! ~Ray
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.