Solved

Parse every page of a site recursively with Php

Posted on 2012-03-12
3
278 Views
Last Modified: 2012-03-16
Hello,

I need to parse a given web site recursively, writing in a table how many times the specific keyword was found on each page.

It should work like this:

-parse the main page (e.g. www.whatever.com) and count how many times the "test keyword" is on the page
-put all links of the main page in an array (but only the links which belong to the same domain)

-parse the first link of the main page and count how many times the "test keyword" is on the page
-put all links of the first page in an array

-parse the first link of the first page and count how many times the "test keyword" is on the page
-put all links of that page in an array

.......
parse the last link of the main page etc. ...

There are two things I don't know:
-how to parse every link recursively (so as every link would be processed)
-how to count the occurrences of a  keyword

Thank you very much
0
Comment
Question by:starhu
  • 2
3 Comments
 
LVL 109

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 37712053
I can help you with some experience and with part of the question.  First the experience... Unless this is a fairly small web site, you can plan on this taking a very long time..

How to find all of the links on a web page:
1. Read the page with file_get_contents()
2. Use strip_tags() preserving <a>
3. Use a regular expression to remove everything that is not inside the strings <a and </a>
4. Use explode() to create an array with each anchor tag in one position.
5. Use foreach() to access each element of the array
6. With each element of the array, repeat this process recursively.

How to count the occurrences of a keyword.
<?php // RAY_count_words.php
error_reporting(E_ALL);
echo "<pre>";

// DEMONSTRATE HOW TO COUNT ALL THE WORDS USED ON A WEB PAGE

// USEFUL MAN PAGES:
// http://php.net/manual/en/function.file-get-contents.php
// http://php.net/manual/en/function.preg-replace.php
// http://php.net/manual/en/function.explode.php
// http://php.net/manual/en/array.sorting.php

// ACQUIRE THE DATA
$url = 'http://www.apache.org/';
$htm = file_get_contents($url);

// MUNG THE DATA INTO LOWER-CASE
$htm = strtolower($htm);

// REMOVE CSS AND JAVASCRIPT
$htm = preg_replace("/\<style.*style\>/", NULL, $htm);
$htm = preg_replace("/\<script.*script\>/", NULL, $htm);

// REMOVE THE HTML TAGS
$htm = strip_tags($htm);

// REMOVE EVERYTHING ELSE BUT LETTERS AND BLANKS
$htm = preg_replace('/[^a-z ]/', ' ', $htm);

// CONVERT ANY EXCESS WHITESPACE TO SINGLE BLANKS
$htm = trim(preg_replace('/\s\s+/', ' ', $htm));

// ACTIVATE THIS TO SEE THE "CLEAN" STRING
// echo PHP_EOL . htmlentities($htm);

// MAKE AN ARRAY OF WORDS
$arr = explode(' ', $htm);

// TURN THE ARRAY OF WORDS INTO UNIQUE KEYS, AND ZERO THE COUNTS
$unq = array_flip($arr);
foreach ($unq as $key => $nothing)
{
    $unq[$key] = 0;
}

// COUNT THE WORDS
foreach ($arr as $wrd)
{
    $unq[$wrd]++;
}

// SHOW THE WORK PRODUCTS
echo PHP_EOL . "THERE ARE " . count($unq) . " UNIQUE WORDS AMONG ". count($arr) . " TOTAL WORDS";

echo PHP_EOL . "IN ALPHABETICAL ORDER: ";
ksort($unq);
print_r($unq);

echo PHP_EOL . "IN FREQUENCY ORDER: ";
arsort($unq);
print_r($unq);

Open in new window

0
 

Author Comment

by:starhu
ID: 37713581
Hello,

Can this parse Pdf too?

Thank you
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 37714035
No, PDF is a different thing.   HTML is semantic; PDF is page layout.  It sounds like what you want has already been invented.  It is called Google Site Search.  There are others that you might find helpful including Atomz, PicoSearch, FreeFind, Wrensoft Zoom Search, etc.
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
000webhost.com default error log 1 39
How to submit record from external php form to a Sharepoint list? 5 55
Wordpress Security 29 48
Ajax and PHP 9 29
Things That Drive Us Nuts Have you noticed the use of the reCaptcha feature at EE and other web sites?  It wants you to read and retype something that looks like this.Insanity!  It's not EE's fault - that's just the way reCaptcha works.  But it is …
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to dynamically set the form action using jQuery.

840 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question