[Webinar] Streamline your web hosting managementRegister Today

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 333
  • Last Modified:

Complete information Gatherer from Site

I want to create a gatherer of information that retrieve parts of a page from a partner site. The idea is to completely retrieve a text from the site, formatting the generated background, font, font-size and creating a zipfile from the result that is maintained in the server for an hour only. I'm having some problems with the retrieving part of the script. For the example of this script I'll use the www.fanfiction.net site since my client has a confidentiality agreement. Imagine that I want to retrieve the contents of a particular story, that in this case is in a URL like: http://www.fanfiction.net/s/451545/1/ where the first part is the main site, the second part is the reference that it is a story, the third part the number of the story, and the last number is the chapter of the story. What I need is to create a single file or several files of this site capturing the text of the story itself, and ignoring the parts that are there only for the adds and for the management of the site. I'm having a complete block with it if someone could help me at the retrieving part at least, it would help immensely.
0
doRodrigo
Asked:
doRodrigo
  • 3
  • 3
  • 2
1 Solution
 
joshbennerCommented:
This is a method called scraping. If possible, I'd recommend you try to get your partner to make the data available to you via a formatted web service such as XML-RPC, SOAP, REST/XML, etc.

If you must scrape, then regular expressions are a long-standing approach to this problem. Regular expressions aren't perfect, since they can't always perfectly parse really complex content. However, if the HTML you are scraping always has the content you want in a div with a specific ID (and there are no DIV tags within it), then regular expressions can do the trick.

Not having the actual data you'll be working with, here is an example of scraping the content out of a fanfic.net story:

/<div.*?id="storytext".*?>(.*?)<\/div>/

The captured group should contain the story text. This can be repeated for any data segment you need to pull out.

An alternate approach is to load the HTML from the source URI and parse it using something like SimpleXML. This is a rather involved process that I couldn't cover in its entirety here. Here is a snippet of how things would start with SimpleXML:
$dom = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html); // $html contains HTML pulled from page
$xml = simplexml_import_dom($doc);
 
Now you can use SimpleXML methods found on the $xml instance to retrieve your data.

Open in new window

0
 
doRodrigoAuthor Commented:
Thanks joshbenner but sorry you will need to be a little bit more specific. How do I retrieve the content from the URL? Sorry but I know less than nothing on XML, could you get me an example on it? Let's say that I want to retrieve the content of the URL: http://www.fanfiction.net/s/3693693/1/ How would I go about? Thanks and sorry for the inconvenience.
0
 
joshbennerCommented:
See: http://us2.php.net/file_get_contents
<?php
$html = file_get_contents('http://www.fanfiction.net/s/3693693/1/');

Open in new window

0
Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
doRodrigoAuthor Commented:
Dear joshbenner when I try this I get the following error:

Call to undefined method stdClass::loadHTML()...

What should I do after the "$xml = simplexml_import_dom($doc);" in my code? Could I echo the variable $xml and get the content of the text that I want?

Thanks.


0
 
NerdsOfTechTechnology ScientistCommented:
Here is the raw common scrape including page break removal
</php
 
$url = "http://www.theremotesitehere.com/page.html";
 
$raw = file_get_contents($url);
 
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
 
$content = str_replace($newlines, "", html_entity_decode($raw));
?>

Open in new window

0
 
NerdsOfTechTechnology ScientistCommented:
<?php
 
$url = "http://www.theremotesitehere.com/page.html";
 
$raw = file_get_contents($url);
 
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
 
$content = str_replace($newlines, "", html_entity_decode($raw));
?>
0
 
doRodrigoAuthor Commented:
Hi NerdsOfTech and joshbenner, I'm having the following problem after I retrieve and echo the contents successfully:
Original text in page:

Well isn't that a sight for sore eyes.

What is being echoed:

â¬SWell isn't that a sight for soreeyes.⬝

Do any of you have a solution for that? Thanks in advance.
0
 
NerdsOfTechTechnology ScientistCommented:
Try outputting just $raw

if this causes any issue uncomment lines
<?php
 
$url = "http://www.theremotesitehere.com/page.html";
 
$raw = file_get_contents($url);
 
// $newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
// $content = str_replace($newlines, "", html_entity_decode($raw));
// $content = strip_tags($content);
// echo $content;
 
echo $raw;
 
?>

Open in new window

0

Featured Post

The 14th Annual Expert Award Winners

The results are in! Meet the top members of our 2017 Expert Awards. Congratulations to all who qualified!

  • 3
  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now