Solved

Grab webpage, strip tags, count and unique values

Posted on 2011-02-24
11
332 Views
Last Modified: 2012-05-11
I'm trying to grab a webpage, strip the html down to just the contents, lowercase all, remove excess spaces, and display the unique strings after counting how many of those strings were not unique...
The code below grabs the page, strips the tags, but doesn't have the count or trimming yet.
1 Grab page
2 strip tags
3 trim??
4 count how many times each string appears
5 print the unique strings with the previous count next to them separated by a comma.
I don't know php all that well, so I'll require it to be written for me.
Thanks!
-rich
<?php
$homepage = file_get_contents('http://www.example.com/');
$strip = strtolower(strip_tags($homepage));
var_dump(array_unique(explode(' ',$strip)));
?>

Open in new window

0
Comment
Question by:Rich Rumble
  • 7
  • 4
11 Comments
 
LVL 34

Expert Comment

by:Beverley Portlock
ID: 34970685
This should do most of what you want

<?php

$homepage = file_get_contents('http://www.example.com/');

// Pick out only the body
//
$homepage = preg_replace('#.+?<body[^>]*>(.*)</body>.+?#s', '$1', $homepage );


$strip = strtolower(strip_tags($homepage));

// Replace white or multiple spaces with a single space
//
$strip = preg_replace('#(\s{2,})#s', ' ', $strip );

// Get rid of newlines or tabs
//
$strip = str_replace( array("\n", "\t"), ' ', $strip );


var_dump(array_unique(explode(' ',$strip)));
?>

Open in new window

0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971069
It's very close, someone sent me the following code below, it's closer still, it works on simple pages, but a page like yahoo.com for example it doesn't. This will be an intranet tool for our organization, but we have complex pages like yahoo's all over as well, and those too seem to break the script below:
I used the var_dump because I didn't know how to print the array properly (still questionable)...
-rich
<?php
header("Content-Type: text/plain");
$file = file('http://example.com');
$words = array();
foreach($file as $line) {
  $line = trim(strip_tags(trim($line)));
  $line = explode(" ", trim($line));
  foreach ($line as $word) {
    if (!in_array($word, array_keys($words))) {
	  $words[$word] = 1;
	} else {
	  $words[$word]++;
	};
  };
};
asort($words);
foreach ($words as $key => $value) {
  print "String: " . $key . "\tCount: " . $value . "\r\n";
};
?>

Open in new window

0
 
LVL 34

Expert Comment

by:Beverley Portlock
ID: 34971122
"....it works on simple pages, but a page like yahoo.com for example it doesn't...."

Why? How does it break?
0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971183
Notice: Undefined index: 1024) in C:\wamp\www\file.php on line <i>12</i></th></tr>
Notice: Undefined index: -1){ in C:\wamp\www\file.php on line <i>12</i></th></tr>
Undefined index: background-position: in C:\wamp\www\file.php on line <i>12</i></th></tr>
etc... perhaps the page isn't loaded fully when it begins to parse? He used File() as opposed to file_get_contents()... not sure if we should use curl instead or what..
-rich
0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971333
yours woks "better" on such pages so perhaps it's the strip_tags that isn't working as well as it should...
By better I mean it's not erroring out, but it's a vardump as opposed to a count and string and print \r\n
:)
-rich
0
What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971383
Ahh it's inline css that's doing it. weird, limitation of strip_tags() I suppose.
-rich
0
 
LVL 34

Expert Comment

by:Beverley Portlock
ID: 34971433
I could probably come up with a regex that would strip inline CSS. As for the var_dump I just used what was in your sample...
0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971468
I know, I'm a noob :) Thanks for sticking with me through it.
-rich
0
 
LVL 34

Accepted Solution

by:
Beverley Portlock earned 500 total points
ID: 34971555
I've just tried my original sample on Yahoo and it seems fine on CSS. It is probably that the sample you posted processes the whole HTML document whereas mine only processes the HTML between the <BODY></BODY> tags as that is the visible content.

As per your comments on cURL - I would always use cURL for pulling web content because file_get_contents can be tripped up be dozens of things.

0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971845
More times than not we will be parsing through html files on disk, but some will be through http as well.
So it'd be nice to have the option to use both, by whatever method(s) are best for them.
-rich
0
 
LVL 38

Author Closing Comment

by:Rich Rumble
ID: 35894246
This was good enough, not perfect but I have enough data.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Popularity Can Be Measured Sometimes we deal with questions of popularity, and we need a way to collect opinions from our clients.  This article shows a simple teaching example of how we might elect a favorite color by letting our clients vote for …
This article discusses four methods for overlaying images in a container on a web page
The viewer will learn how to count occurrences of each item in an array.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now