Parse every page of a site recursively with Php

Hello,

I need to parse a given web site recursively, writing in a table how many times the specific keyword was found on each page.

It should work like this:

-parse the main page (e.g. www.whatever.com) and count how many times the "test keyword" is on the page
-put all links of the main page in an array (but only the links which belong to the same domain)

-parse the first link of the main page and count how many times the "test keyword" is on the page
-put all links of the first page in an array

-parse the first link of the first page and count how many times the "test keyword" is on the page
-put all links of that page in an array

.......
parse the last link of the main page etc. ...

There are two things I don't know:
-how to parse every link recursively (so as every link would be processed)
-how to count the occurrences of a  keyword

Thank you very much
starhuAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Ray PaseurCommented:
I can help you with some experience and with part of the question.  First the experience... Unless this is a fairly small web site, you can plan on this taking a very long time..

How to find all of the links on a web page:
1. Read the page with file_get_contents()
2. Use strip_tags() preserving <a>
3. Use a regular expression to remove everything that is not inside the strings <a and </a>
4. Use explode() to create an array with each anchor tag in one position.
5. Use foreach() to access each element of the array
6. With each element of the array, repeat this process recursively.

How to count the occurrences of a keyword.
<?php // RAY_count_words.php
error_reporting(E_ALL);
echo "<pre>";

// DEMONSTRATE HOW TO COUNT ALL THE WORDS USED ON A WEB PAGE

// USEFUL MAN PAGES:
// http://php.net/manual/en/function.file-get-contents.php
// http://php.net/manual/en/function.preg-replace.php
// http://php.net/manual/en/function.explode.php
// http://php.net/manual/en/array.sorting.php

// ACQUIRE THE DATA
$url = 'http://www.apache.org/';
$htm = file_get_contents($url);

// MUNG THE DATA INTO LOWER-CASE
$htm = strtolower($htm);

// REMOVE CSS AND JAVASCRIPT
$htm = preg_replace("/\<style.*style\>/", NULL, $htm);
$htm = preg_replace("/\<script.*script\>/", NULL, $htm);

// REMOVE THE HTML TAGS
$htm = strip_tags($htm);

// REMOVE EVERYTHING ELSE BUT LETTERS AND BLANKS
$htm = preg_replace('/[^a-z ]/', ' ', $htm);

// CONVERT ANY EXCESS WHITESPACE TO SINGLE BLANKS
$htm = trim(preg_replace('/\s\s+/', ' ', $htm));

// ACTIVATE THIS TO SEE THE "CLEAN" STRING
// echo PHP_EOL . htmlentities($htm);

// MAKE AN ARRAY OF WORDS
$arr = explode(' ', $htm);

// TURN THE ARRAY OF WORDS INTO UNIQUE KEYS, AND ZERO THE COUNTS
$unq = array_flip($arr);
foreach ($unq as $key => $nothing)
{
    $unq[$key] = 0;
}

// COUNT THE WORDS
foreach ($arr as $wrd)
{
    $unq[$wrd]++;
}

// SHOW THE WORK PRODUCTS
echo PHP_EOL . "THERE ARE " . count($unq) . " UNIQUE WORDS AMONG ". count($arr) . " TOTAL WORDS";

echo PHP_EOL . "IN ALPHABETICAL ORDER: ";
ksort($unq);
print_r($unq);

echo PHP_EOL . "IN FREQUENCY ORDER: ";
arsort($unq);
print_r($unq);

Open in new window

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
starhuAuthor Commented:
Hello,

Can this parse Pdf too?

Thank you
Ray PaseurCommented:
No, PDF is a different thing.   HTML is semantic; PDF is page layout.  It sounds like what you want has already been invented.  It is called Google Site Search.  There are others that you might find helpful including Atomz, PicoSearch, FreeFind, Wrensoft Zoom Search, etc.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.