Solved

Parse every page of a site recursively with Php

Posted on 2012-03-12
3
277 Views
Last Modified: 2012-03-16
Hello,

I need to parse a given web site recursively, writing in a table how many times the specific keyword was found on each page.

It should work like this:

-parse the main page (e.g. www.whatever.com) and count how many times the "test keyword" is on the page
-put all links of the main page in an array (but only the links which belong to the same domain)

-parse the first link of the main page and count how many times the "test keyword" is on the page
-put all links of the first page in an array

-parse the first link of the first page and count how many times the "test keyword" is on the page
-put all links of that page in an array

.......
parse the last link of the main page etc. ...

There are two things I don't know:
-how to parse every link recursively (so as every link would be processed)
-how to count the occurrences of a  keyword

Thank you very much
0
Comment
Question by:starhu
  • 2
3 Comments
 
LVL 109

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 37712053
I can help you with some experience and with part of the question.  First the experience... Unless this is a fairly small web site, you can plan on this taking a very long time..

How to find all of the links on a web page:
1. Read the page with file_get_contents()
2. Use strip_tags() preserving <a>
3. Use a regular expression to remove everything that is not inside the strings <a and </a>
4. Use explode() to create an array with each anchor tag in one position.
5. Use foreach() to access each element of the array
6. With each element of the array, repeat this process recursively.

How to count the occurrences of a keyword.
<?php // RAY_count_words.php
error_reporting(E_ALL);
echo "<pre>";

// DEMONSTRATE HOW TO COUNT ALL THE WORDS USED ON A WEB PAGE

// USEFUL MAN PAGES:
// http://php.net/manual/en/function.file-get-contents.php
// http://php.net/manual/en/function.preg-replace.php
// http://php.net/manual/en/function.explode.php
// http://php.net/manual/en/array.sorting.php

// ACQUIRE THE DATA
$url = 'http://www.apache.org/';
$htm = file_get_contents($url);

// MUNG THE DATA INTO LOWER-CASE
$htm = strtolower($htm);

// REMOVE CSS AND JAVASCRIPT
$htm = preg_replace("/\<style.*style\>/", NULL, $htm);
$htm = preg_replace("/\<script.*script\>/", NULL, $htm);

// REMOVE THE HTML TAGS
$htm = strip_tags($htm);

// REMOVE EVERYTHING ELSE BUT LETTERS AND BLANKS
$htm = preg_replace('/[^a-z ]/', ' ', $htm);

// CONVERT ANY EXCESS WHITESPACE TO SINGLE BLANKS
$htm = trim(preg_replace('/\s\s+/', ' ', $htm));

// ACTIVATE THIS TO SEE THE "CLEAN" STRING
// echo PHP_EOL . htmlentities($htm);

// MAKE AN ARRAY OF WORDS
$arr = explode(' ', $htm);

// TURN THE ARRAY OF WORDS INTO UNIQUE KEYS, AND ZERO THE COUNTS
$unq = array_flip($arr);
foreach ($unq as $key => $nothing)
{
    $unq[$key] = 0;
}

// COUNT THE WORDS
foreach ($arr as $wrd)
{
    $unq[$wrd]++;
}

// SHOW THE WORK PRODUCTS
echo PHP_EOL . "THERE ARE " . count($unq) . " UNIQUE WORDS AMONG ". count($arr) . " TOTAL WORDS";

echo PHP_EOL . "IN ALPHABETICAL ORDER: ";
ksort($unq);
print_r($unq);

echo PHP_EOL . "IN FREQUENCY ORDER: ";
arsort($unq);
print_r($unq);

Open in new window

0
 

Author Comment

by:starhu
ID: 37713581
Hello,

Can this parse Pdf too?

Thank you
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 37714035
No, PDF is a different thing.   HTML is semantic; PDF is page layout.  It sounds like what you want has already been invented.  It is called Google Site Search.  There are others that you might find helpful including Atomz, PicoSearch, FreeFind, Wrensoft Zoom Search, etc.
0

Featured Post

DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction HTML checkboxes provide the perfect way for a web developer to receive client input when the client's options might be none, one or many.  But the PHP code for processing the checkboxes can be confusing at first.  What if a checkbox is…
Introduction This article is intended for those who are new to PHP error handling (https://www.experts-exchange.com/articles/11769/And-by-the-way-I-am-New-to-PHP.html).  It addresses one of the most common problems that plague beginning PHP develop…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

773 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question