Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 382
  • Last Modified:

Grab webpage, strip tags, count and unique values

I'm trying to grab a webpage, strip the html down to just the contents, lowercase all, remove excess spaces, and display the unique strings after counting how many of those strings were not unique...
The code below grabs the page, strips the tags, but doesn't have the count or trimming yet.
1 Grab page
2 strip tags
3 trim??
4 count how many times each string appears
5 print the unique strings with the previous count next to them separated by a comma.
I don't know php all that well, so I'll require it to be written for me.
Thanks!
-rich
<?php
$homepage = file_get_contents('http://www.example.com/');
$strip = strtolower(strip_tags($homepage));
var_dump(array_unique(explode(' ',$strip)));
?>

Open in new window

0
Rich Rumble
Asked:
Rich Rumble
  • 7
  • 4
1 Solution
 
Beverley PortlockCommented:
This should do most of what you want

<?php

$homepage = file_get_contents('http://www.example.com/');

// Pick out only the body
//
$homepage = preg_replace('#.+?<body[^>]*>(.*)</body>.+?#s', '$1', $homepage );


$strip = strtolower(strip_tags($homepage));

// Replace white or multiple spaces with a single space
//
$strip = preg_replace('#(\s{2,})#s', ' ', $strip );

// Get rid of newlines or tabs
//
$strip = str_replace( array("\n", "\t"), ' ', $strip );


var_dump(array_unique(explode(' ',$strip)));
?>

Open in new window

0
 
Rich RumbleSecurity SamuraiAuthor Commented:
It's very close, someone sent me the following code below, it's closer still, it works on simple pages, but a page like yahoo.com for example it doesn't. This will be an intranet tool for our organization, but we have complex pages like yahoo's all over as well, and those too seem to break the script below:
I used the var_dump because I didn't know how to print the array properly (still questionable)...
-rich
<?php
header("Content-Type: text/plain");
$file = file('http://example.com');
$words = array();
foreach($file as $line) {
  $line = trim(strip_tags(trim($line)));
  $line = explode(" ", trim($line));
  foreach ($line as $word) {
    if (!in_array($word, array_keys($words))) {
	  $words[$word] = 1;
	} else {
	  $words[$word]++;
	};
  };
};
asort($words);
foreach ($words as $key => $value) {
  print "String: " . $key . "\tCount: " . $value . "\r\n";
};
?>

Open in new window

0
 
Beverley PortlockCommented:
"....it works on simple pages, but a page like yahoo.com for example it doesn't...."

Why? How does it break?
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
Rich RumbleSecurity SamuraiAuthor Commented:
Notice: Undefined index: 1024) in C:\wamp\www\file.php on line <i>12</i></th></tr>
Notice: Undefined index: -1){ in C:\wamp\www\file.php on line <i>12</i></th></tr>
Undefined index: background-position: in C:\wamp\www\file.php on line <i>12</i></th></tr>
etc... perhaps the page isn't loaded fully when it begins to parse? He used File() as opposed to file_get_contents()... not sure if we should use curl instead or what..
-rich
0
 
Rich RumbleSecurity SamuraiAuthor Commented:
yours woks "better" on such pages so perhaps it's the strip_tags that isn't working as well as it should...
By better I mean it's not erroring out, but it's a vardump as opposed to a count and string and print \r\n
:)
-rich
0
 
Rich RumbleSecurity SamuraiAuthor Commented:
Ahh it's inline css that's doing it. weird, limitation of strip_tags() I suppose.
-rich
0
 
Beverley PortlockCommented:
I could probably come up with a regex that would strip inline CSS. As for the var_dump I just used what was in your sample...
0
 
Rich RumbleSecurity SamuraiAuthor Commented:
I know, I'm a noob :) Thanks for sticking with me through it.
-rich
0
 
Beverley PortlockCommented:
I've just tried my original sample on Yahoo and it seems fine on CSS. It is probably that the sample you posted processes the whole HTML document whereas mine only processes the HTML between the <BODY></BODY> tags as that is the visible content.

As per your comments on cURL - I would always use cURL for pulling web content because file_get_contents can be tripped up be dozens of things.

0
 
Rich RumbleSecurity SamuraiAuthor Commented:
More times than not we will be parsing through html files on disk, but some will be through http as well.
So it'd be nice to have the option to use both, by whatever method(s) are best for them.
-rich
0
 
Rich RumbleSecurity SamuraiAuthor Commented:
This was good enough, not perfect but I have enough data.
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

  • 7
  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now