?
Solved

Grab webpage, strip tags, count and unique values

Posted on 2011-02-24
11
Medium Priority
?
373 Views
Last Modified: 2012-05-11
I'm trying to grab a webpage, strip the html down to just the contents, lowercase all, remove excess spaces, and display the unique strings after counting how many of those strings were not unique...
The code below grabs the page, strips the tags, but doesn't have the count or trimming yet.
1 Grab page
2 strip tags
3 trim??
4 count how many times each string appears
5 print the unique strings with the previous count next to them separated by a comma.
I don't know php all that well, so I'll require it to be written for me.
Thanks!
-rich
<?php
$homepage = file_get_contents('http://www.example.com/');
$strip = strtolower(strip_tags($homepage));
var_dump(array_unique(explode(' ',$strip)));
?>

Open in new window

0
Comment
Question by:Rich Rumble
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 4
11 Comments
 
LVL 34

Expert Comment

by:Beverley Portlock
ID: 34970685
This should do most of what you want

<?php

$homepage = file_get_contents('http://www.example.com/');

// Pick out only the body
//
$homepage = preg_replace('#.+?<body[^>]*>(.*)</body>.+?#s', '$1', $homepage );


$strip = strtolower(strip_tags($homepage));

// Replace white or multiple spaces with a single space
//
$strip = preg_replace('#(\s{2,})#s', ' ', $strip );

// Get rid of newlines or tabs
//
$strip = str_replace( array("\n", "\t"), ' ', $strip );


var_dump(array_unique(explode(' ',$strip)));
?>

Open in new window

0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971069
It's very close, someone sent me the following code below, it's closer still, it works on simple pages, but a page like yahoo.com for example it doesn't. This will be an intranet tool for our organization, but we have complex pages like yahoo's all over as well, and those too seem to break the script below:
I used the var_dump because I didn't know how to print the array properly (still questionable)...
-rich
<?php
header("Content-Type: text/plain");
$file = file('http://example.com');
$words = array();
foreach($file as $line) {
  $line = trim(strip_tags(trim($line)));
  $line = explode(" ", trim($line));
  foreach ($line as $word) {
    if (!in_array($word, array_keys($words))) {
	  $words[$word] = 1;
	} else {
	  $words[$word]++;
	};
  };
};
asort($words);
foreach ($words as $key => $value) {
  print "String: " . $key . "\tCount: " . $value . "\r\n";
};
?>

Open in new window

0
 
LVL 34

Expert Comment

by:Beverley Portlock
ID: 34971122
"....it works on simple pages, but a page like yahoo.com for example it doesn't...."

Why? How does it break?
0
Don't Cry: How Liquid Web is Ensuring Security

WannaCry is just the start. Read how Liquid Web is protecting itself and its customers against new threats.

 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971183
Notice: Undefined index: 1024) in C:\wamp\www\file.php on line <i>12</i></th></tr>
Notice: Undefined index: -1){ in C:\wamp\www\file.php on line <i>12</i></th></tr>
Undefined index: background-position: in C:\wamp\www\file.php on line <i>12</i></th></tr>
etc... perhaps the page isn't loaded fully when it begins to parse? He used File() as opposed to file_get_contents()... not sure if we should use curl instead or what..
-rich
0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971333
yours woks "better" on such pages so perhaps it's the strip_tags that isn't working as well as it should...
By better I mean it's not erroring out, but it's a vardump as opposed to a count and string and print \r\n
:)
-rich
0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971383
Ahh it's inline css that's doing it. weird, limitation of strip_tags() I suppose.
-rich
0
 
LVL 34

Expert Comment

by:Beverley Portlock
ID: 34971433
I could probably come up with a regex that would strip inline CSS. As for the var_dump I just used what was in your sample...
0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971468
I know, I'm a noob :) Thanks for sticking with me through it.
-rich
0
 
LVL 34

Accepted Solution

by:
Beverley Portlock earned 1500 total points
ID: 34971555
I've just tried my original sample on Yahoo and it seems fine on CSS. It is probably that the sample you posted processes the whole HTML document whereas mine only processes the HTML between the <BODY></BODY> tags as that is the visible content.

As per your comments on cURL - I would always use cURL for pulling web content because file_get_contents can be tripped up be dozens of things.

0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971845
More times than not we will be parsing through html files on disk, but some will be through http as well.
So it'd be nice to have the option to use both, by whatever method(s) are best for them.
-rich
0
 
LVL 38

Author Closing Comment

by:Rich Rumble
ID: 35894246
This was good enough, not perfect but I have enough data.
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Author Note: Since this E-E article was originally written, years ago, formal testing has come into common use in the world of PHP.  PHPUnit (http://en.wikipedia.org/wiki/PHPUnit) and similar technologies have enjoyed wide adoption, making it possib…
Originally, this post was published on Monitis Blog, you can check it here . In business circles, we sometimes hear that today is the “age of the customer.” And so it is. Thanks to the enormous advances over the past few years in consumer techno…
The viewer will learn how to dynamically set the form action using jQuery.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.
Suggested Courses

764 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question