Solved

Grab webpage, strip tags, count and unique values

Posted on 2011-02-24
11
342 Views
Last Modified: 2012-05-11
I'm trying to grab a webpage, strip the html down to just the contents, lowercase all, remove excess spaces, and display the unique strings after counting how many of those strings were not unique...
The code below grabs the page, strips the tags, but doesn't have the count or trimming yet.
1 Grab page
2 strip tags
3 trim??
4 count how many times each string appears
5 print the unique strings with the previous count next to them separated by a comma.
I don't know php all that well, so I'll require it to be written for me.
Thanks!
-rich
<?php
$homepage = file_get_contents('http://www.example.com/');
$strip = strtolower(strip_tags($homepage));
var_dump(array_unique(explode(' ',$strip)));
?>

Open in new window

0
Comment
Question by:Rich Rumble
  • 7
  • 4
11 Comments
 
LVL 34

Expert Comment

by:Beverley Portlock
ID: 34970685
This should do most of what you want

<?php

$homepage = file_get_contents('http://www.example.com/');

// Pick out only the body
//
$homepage = preg_replace('#.+?<body[^>]*>(.*)</body>.+?#s', '$1', $homepage );


$strip = strtolower(strip_tags($homepage));

// Replace white or multiple spaces with a single space
//
$strip = preg_replace('#(\s{2,})#s', ' ', $strip );

// Get rid of newlines or tabs
//
$strip = str_replace( array("\n", "\t"), ' ', $strip );


var_dump(array_unique(explode(' ',$strip)));
?>

Open in new window

0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971069
It's very close, someone sent me the following code below, it's closer still, it works on simple pages, but a page like yahoo.com for example it doesn't. This will be an intranet tool for our organization, but we have complex pages like yahoo's all over as well, and those too seem to break the script below:
I used the var_dump because I didn't know how to print the array properly (still questionable)...
-rich
<?php
header("Content-Type: text/plain");
$file = file('http://example.com');
$words = array();
foreach($file as $line) {
  $line = trim(strip_tags(trim($line)));
  $line = explode(" ", trim($line));
  foreach ($line as $word) {
    if (!in_array($word, array_keys($words))) {
	  $words[$word] = 1;
	} else {
	  $words[$word]++;
	};
  };
};
asort($words);
foreach ($words as $key => $value) {
  print "String: " . $key . "\tCount: " . $value . "\r\n";
};
?>

Open in new window

0
 
LVL 34

Expert Comment

by:Beverley Portlock
ID: 34971122
"....it works on simple pages, but a page like yahoo.com for example it doesn't...."

Why? How does it break?
0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971183
Notice: Undefined index: 1024) in C:\wamp\www\file.php on line <i>12</i></th></tr>
Notice: Undefined index: -1){ in C:\wamp\www\file.php on line <i>12</i></th></tr>
Undefined index: background-position: in C:\wamp\www\file.php on line <i>12</i></th></tr>
etc... perhaps the page isn't loaded fully when it begins to parse? He used File() as opposed to file_get_contents()... not sure if we should use curl instead or what..
-rich
0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971333
yours woks "better" on such pages so perhaps it's the strip_tags that isn't working as well as it should...
By better I mean it's not erroring out, but it's a vardump as opposed to a count and string and print \r\n
:)
-rich
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971383
Ahh it's inline css that's doing it. weird, limitation of strip_tags() I suppose.
-rich
0
 
LVL 34

Expert Comment

by:Beverley Portlock
ID: 34971433
I could probably come up with a regex that would strip inline CSS. As for the var_dump I just used what was in your sample...
0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971468
I know, I'm a noob :) Thanks for sticking with me through it.
-rich
0
 
LVL 34

Accepted Solution

by:
Beverley Portlock earned 500 total points
ID: 34971555
I've just tried my original sample on Yahoo and it seems fine on CSS. It is probably that the sample you posted processes the whole HTML document whereas mine only processes the HTML between the <BODY></BODY> tags as that is the visible content.

As per your comments on cURL - I would always use cURL for pulling web content because file_get_contents can be tripped up be dozens of things.

0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 34971845
More times than not we will be parsing through html files on disk, but some will be through http as well.
So it'd be nice to have the option to use both, by whatever method(s) are best for them.
-rich
0
 
LVL 38

Author Closing Comment

by:Rich Rumble
ID: 35894246
This was good enough, not perfect but I have enough data.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Things That Drive Us Nuts Have you noticed the use of the reCaptcha feature at EE and other web sites?  It wants you to read and retype something that looks like this.Insanity!  It's not EE's fault - that's just the way reCaptcha works.  But it is …
Build an array called $myWeek which will hold the array elements Today, Yesterday and then builds up the rest of the week by the name of the day going back 1 week.   (CODE) (CODE) Then you just need to pass your date to the function. If i…
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to count occurrences of each item in an array.

863 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

25 Experts available now in Live!

Get 1:1 Help Now