Solved

Parse every page of a site recursively with Php

Posted on 2012-03-12
3
274 Views
Last Modified: 2012-03-16
Hello,

I need to parse a given web site recursively, writing in a table how many times the specific keyword was found on each page.

It should work like this:

-parse the main page (e.g. www.whatever.com) and count how many times the "test keyword" is on the page
-put all links of the main page in an array (but only the links which belong to the same domain)

-parse the first link of the main page and count how many times the "test keyword" is on the page
-put all links of the first page in an array

-parse the first link of the first page and count how many times the "test keyword" is on the page
-put all links of that page in an array

.......
parse the last link of the main page etc. ...

There are two things I don't know:
-how to parse every link recursively (so as every link would be processed)
-how to count the occurrences of a  keyword

Thank you very much
0
Comment
Question by:starhu
  • 2
3 Comments
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 37712053
I can help you with some experience and with part of the question.  First the experience... Unless this is a fairly small web site, you can plan on this taking a very long time..

How to find all of the links on a web page:
1. Read the page with file_get_contents()
2. Use strip_tags() preserving <a>
3. Use a regular expression to remove everything that is not inside the strings <a and </a>
4. Use explode() to create an array with each anchor tag in one position.
5. Use foreach() to access each element of the array
6. With each element of the array, repeat this process recursively.

How to count the occurrences of a keyword.
<?php // RAY_count_words.php
error_reporting(E_ALL);
echo "<pre>";

// DEMONSTRATE HOW TO COUNT ALL THE WORDS USED ON A WEB PAGE

// USEFUL MAN PAGES:
// http://php.net/manual/en/function.file-get-contents.php
// http://php.net/manual/en/function.preg-replace.php
// http://php.net/manual/en/function.explode.php
// http://php.net/manual/en/array.sorting.php

// ACQUIRE THE DATA
$url = 'http://www.apache.org/';
$htm = file_get_contents($url);

// MUNG THE DATA INTO LOWER-CASE
$htm = strtolower($htm);

// REMOVE CSS AND JAVASCRIPT
$htm = preg_replace("/\<style.*style\>/", NULL, $htm);
$htm = preg_replace("/\<script.*script\>/", NULL, $htm);

// REMOVE THE HTML TAGS
$htm = strip_tags($htm);

// REMOVE EVERYTHING ELSE BUT LETTERS AND BLANKS
$htm = preg_replace('/[^a-z ]/', ' ', $htm);

// CONVERT ANY EXCESS WHITESPACE TO SINGLE BLANKS
$htm = trim(preg_replace('/\s\s+/', ' ', $htm));

// ACTIVATE THIS TO SEE THE "CLEAN" STRING
// echo PHP_EOL . htmlentities($htm);

// MAKE AN ARRAY OF WORDS
$arr = explode(' ', $htm);

// TURN THE ARRAY OF WORDS INTO UNIQUE KEYS, AND ZERO THE COUNTS
$unq = array_flip($arr);
foreach ($unq as $key => $nothing)
{
    $unq[$key] = 0;
}

// COUNT THE WORDS
foreach ($arr as $wrd)
{
    $unq[$wrd]++;
}

// SHOW THE WORK PRODUCTS
echo PHP_EOL . "THERE ARE " . count($unq) . " UNIQUE WORDS AMONG ". count($arr) . " TOTAL WORDS";

echo PHP_EOL . "IN ALPHABETICAL ORDER: ";
ksort($unq);
print_r($unq);

echo PHP_EOL . "IN FREQUENCY ORDER: ";
arsort($unq);
print_r($unq);

Open in new window

0
 

Author Comment

by:starhu
ID: 37713581
Hello,

Can this parse Pdf too?

Thank you
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 37714035
No, PDF is a different thing.   HTML is semantic; PDF is page layout.  It sounds like what you want has already been invented.  It is called Google Site Search.  There are others that you might find helpful including Atomz, PicoSearch, FreeFind, Wrensoft Zoom Search, etc.
0

Featured Post

Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

Join & Write a Comment

Foreword (July, 2015) Since I first wrote this article, years ago, a great many more people have begun using the internet.  They are coming online from every part of the globe, learning, reading, shopping and spending money at an ever-increasing ra…
These days socially coordinated efforts have turned into a critical requirement for enterprises.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now