?
Solved

Parse every page of a site recursively with Php

Posted on 2012-03-12
3
Medium Priority
?
285 Views
Last Modified: 2012-03-16
Hello,

I need to parse a given web site recursively, writing in a table how many times the specific keyword was found on each page.

It should work like this:

-parse the main page (e.g. www.whatever.com) and count how many times the "test keyword" is on the page
-put all links of the main page in an array (but only the links which belong to the same domain)

-parse the first link of the main page and count how many times the "test keyword" is on the page
-put all links of the first page in an array

-parse the first link of the first page and count how many times the "test keyword" is on the page
-put all links of that page in an array

.......
parse the last link of the main page etc. ...

There are two things I don't know:
-how to parse every link recursively (so as every link would be processed)
-how to count the occurrences of a  keyword

Thank you very much
0
Comment
Question by:starhu
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 2000 total points
ID: 37712053
I can help you with some experience and with part of the question.  First the experience... Unless this is a fairly small web site, you can plan on this taking a very long time..

How to find all of the links on a web page:
1. Read the page with file_get_contents()
2. Use strip_tags() preserving <a>
3. Use a regular expression to remove everything that is not inside the strings <a and </a>
4. Use explode() to create an array with each anchor tag in one position.
5. Use foreach() to access each element of the array
6. With each element of the array, repeat this process recursively.

How to count the occurrences of a keyword.
<?php // RAY_count_words.php
error_reporting(E_ALL);
echo "<pre>";

// DEMONSTRATE HOW TO COUNT ALL THE WORDS USED ON A WEB PAGE

// USEFUL MAN PAGES:
// http://php.net/manual/en/function.file-get-contents.php
// http://php.net/manual/en/function.preg-replace.php
// http://php.net/manual/en/function.explode.php
// http://php.net/manual/en/array.sorting.php

// ACQUIRE THE DATA
$url = 'http://www.apache.org/';
$htm = file_get_contents($url);

// MUNG THE DATA INTO LOWER-CASE
$htm = strtolower($htm);

// REMOVE CSS AND JAVASCRIPT
$htm = preg_replace("/\<style.*style\>/", NULL, $htm);
$htm = preg_replace("/\<script.*script\>/", NULL, $htm);

// REMOVE THE HTML TAGS
$htm = strip_tags($htm);

// REMOVE EVERYTHING ELSE BUT LETTERS AND BLANKS
$htm = preg_replace('/[^a-z ]/', ' ', $htm);

// CONVERT ANY EXCESS WHITESPACE TO SINGLE BLANKS
$htm = trim(preg_replace('/\s\s+/', ' ', $htm));

// ACTIVATE THIS TO SEE THE "CLEAN" STRING
// echo PHP_EOL . htmlentities($htm);

// MAKE AN ARRAY OF WORDS
$arr = explode(' ', $htm);

// TURN THE ARRAY OF WORDS INTO UNIQUE KEYS, AND ZERO THE COUNTS
$unq = array_flip($arr);
foreach ($unq as $key => $nothing)
{
    $unq[$key] = 0;
}

// COUNT THE WORDS
foreach ($arr as $wrd)
{
    $unq[$wrd]++;
}

// SHOW THE WORK PRODUCTS
echo PHP_EOL . "THERE ARE " . count($unq) . " UNIQUE WORDS AMONG ". count($arr) . " TOTAL WORDS";

echo PHP_EOL . "IN ALPHABETICAL ORDER: ";
ksort($unq);
print_r($unq);

echo PHP_EOL . "IN FREQUENCY ORDER: ";
arsort($unq);
print_r($unq);

Open in new window

0
 

Author Comment

by:starhu
ID: 37713581
Hello,

Can this parse Pdf too?

Thank you
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 37714035
No, PDF is a different thing.   HTML is semantic; PDF is page layout.  It sounds like what you want has already been invented.  It is called Google Site Search.  There are others that you might find helpful including Atomz, PicoSearch, FreeFind, Wrensoft Zoom Search, etc.
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Part of the Global Positioning System A geocode (https://developers.google.com/maps/documentation/geocoding/) is the major subset of a GPS coordinate (http://en.wikipedia.org/wiki/Global_Positioning_System), the other parts being the altitude and t…
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
Suggested Courses

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question