Solved

Grab webpage, strip tags, count and unique values

Posted on 2011-02-24
11
363 Views
Last Modified: 2012-05-11
I'm trying to grab a webpage, strip the html down to just the contents, lowercase all, remove excess spaces, and display the unique strings after counting how many of those strings were not unique...
The code below grabs the page, strips the tags, but doesn't have the count or trimming yet.
1 Grab page
2 strip tags
3 trim??
4 count how many times each string appears
5 print the unique strings with the previous count next to them separated by a comma.
I don't know php all that well, so I'll require it to be written for me.
Thanks!
-rich
<?php
$homepage = file_get_contents('http://www.example.com/');
$strip = strtolower(strip_tags($homepage));
var_dump(array_unique(explode(' ',$strip)));
?>

Open in new window

0
Comment
Question by:Rich Rumble
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 4
11 Comments
 
LVL 34

Expert Comment

by:Beverley Portlock
ID: 34970685
This should do most of what you want

<?php

$homepage = file_get_contents('http://www.example.com/');

// Pick out only the body
//
$homepage = preg_replace('#.+?<body[^>]*>(.*)</body>.+?#s', '$1', $homepage );


$strip = strtolower(strip_tags($homepage));

// Replace white or multiple spaces with a single space
//
$strip = preg_replace('#(\s{2,})#s', ' ', $strip );

// Get rid of newlines or tabs
//
$strip = str_replace( array("\n", "\t"), ' ', $strip );


var_dump(array_unique(explode(' ',$strip)));
?>

Open in new window

0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971069
It's very close, someone sent me the following code below, it's closer still, it works on simple pages, but a page like yahoo.com for example it doesn't. This will be an intranet tool for our organization, but we have complex pages like yahoo's all over as well, and those too seem to break the script below:
I used the var_dump because I didn't know how to print the array properly (still questionable)...
-rich
<?php
header("Content-Type: text/plain");
$file = file('http://example.com');
$words = array();
foreach($file as $line) {
  $line = trim(strip_tags(trim($line)));
  $line = explode(" ", trim($line));
  foreach ($line as $word) {
    if (!in_array($word, array_keys($words))) {
	  $words[$word] = 1;
	} else {
	  $words[$word]++;
	};
  };
};
asort($words);
foreach ($words as $key => $value) {
  print "String: " . $key . "\tCount: " . $value . "\r\n";
};
?>

Open in new window

0
 
LVL 34

Expert Comment

by:Beverley Portlock
ID: 34971122
"....it works on simple pages, but a page like yahoo.com for example it doesn't...."

Why? How does it break?
0
PeopleSoft Has Never Been Easier

PeopleSoft Adoption Made Smooth & Simple!

On-The-Job Training Is made Intuitive & Easy With WalkMe's On-Screen Guidance Tool.  Claim Your Free WalkMe Account Now

 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971183
Notice: Undefined index: 1024) in C:\wamp\www\file.php on line <i>12</i></th></tr>
Notice: Undefined index: -1){ in C:\wamp\www\file.php on line <i>12</i></th></tr>
Undefined index: background-position: in C:\wamp\www\file.php on line <i>12</i></th></tr>
etc... perhaps the page isn't loaded fully when it begins to parse? He used File() as opposed to file_get_contents()... not sure if we should use curl instead or what..
-rich
0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971333
yours woks "better" on such pages so perhaps it's the strip_tags that isn't working as well as it should...
By better I mean it's not erroring out, but it's a vardump as opposed to a count and string and print \r\n
:)
-rich
0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971383
Ahh it's inline css that's doing it. weird, limitation of strip_tags() I suppose.
-rich
0
 
LVL 34

Expert Comment

by:Beverley Portlock
ID: 34971433
I could probably come up with a regex that would strip inline CSS. As for the var_dump I just used what was in your sample...
0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971468
I know, I'm a noob :) Thanks for sticking with me through it.
-rich
0
 
LVL 34

Accepted Solution

by:
Beverley Portlock earned 500 total points
ID: 34971555
I've just tried my original sample on Yahoo and it seems fine on CSS. It is probably that the sample you posted processes the whole HTML document whereas mine only processes the HTML between the <BODY></BODY> tags as that is the visible content.

As per your comments on cURL - I would always use cURL for pulling web content because file_get_contents can be tripped up be dozens of things.

0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971845
More times than not we will be parsing through html files on disk, but some will be through http as well.
So it'd be nice to have the option to use both, by whatever method(s) are best for them.
-rich
0
 
LVL 38

Author Closing Comment

by:Rich Rumble
ID: 35894246
This was good enough, not perfect but I have enough data.
0

Featured Post

Creating Instructional Tutorials  

For Any Use & On Any Platform

Contextual Guidance at the moment of need helps your employees/users adopt software o& achieve even the most complex tasks instantly. Boost knowledge retention, software adoption & employee engagement with easy solution.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Generating table dynamically is the most common issue faced by php developers.... So it seems there is a need of an article that explains the basic concept of generating tables dynamically. It just requires a basic knowledge of html and little maths…
I imagine that there are some, like me, who require a way of getting currency exchange rates for implementation in web project from time to time, so I thought I would share a solution that I have developed for this purpose. It turns out that Yaho…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

734 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question