Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 327
  • Last Modified:

Complete information Gatherer from Site

I want to create a gatherer of information that retrieve parts of a page from a partner site. The idea is to completely retrieve a text from the site, formatting the generated background, font, font-size and creating a zipfile from the result that is maintained in the server for an hour only. I'm having some problems with the retrieving part of the script. For the example of this script I'll use the www.fanfiction.net site since my client has a confidentiality agreement. Imagine that I want to retrieve the contents of a particular story, that in this case is in a URL like: http://www.fanfiction.net/s/451545/1/ where the first part is the main site, the second part is the reference that it is a story, the third part the number of the story, and the last number is the chapter of the story. What I need is to create a single file or several files of this site capturing the text of the story itself, and ignoring the parts that are there only for the adds and for the management of the site. I'm having a complete block with it if someone could help me at the retrieving part at least, it would help immensely.
0
doRodrigo
Asked:
doRodrigo
  • 3
  • 3
  • 2
1 Solution
 
joshbennerCommented:
This is a method called scraping. If possible, I'd recommend you try to get your partner to make the data available to you via a formatted web service such as XML-RPC, SOAP, REST/XML, etc.

If you must scrape, then regular expressions are a long-standing approach to this problem. Regular expressions aren't perfect, since they can't always perfectly parse really complex content. However, if the HTML you are scraping always has the content you want in a div with a specific ID (and there are no DIV tags within it), then regular expressions can do the trick.

Not having the actual data you'll be working with, here is an example of scraping the content out of a fanfic.net story:

/<div.*?id="storytext".*?>(.*?)<\/div>/

The captured group should contain the story text. This can be repeated for any data segment you need to pull out.

An alternate approach is to load the HTML from the source URI and parse it using something like SimpleXML. This is a rather involved process that I couldn't cover in its entirety here. Here is a snippet of how things would start with SimpleXML:
$dom = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html); // $html contains HTML pulled from page
$xml = simplexml_import_dom($doc);
 
Now you can use SimpleXML methods found on the $xml instance to retrieve your data.

Open in new window

0
 
doRodrigoAuthor Commented:
Thanks joshbenner but sorry you will need to be a little bit more specific. How do I retrieve the content from the URL? Sorry but I know less than nothing on XML, could you get me an example on it? Let's say that I want to retrieve the content of the URL: http://www.fanfiction.net/s/3693693/1/ How would I go about? Thanks and sorry for the inconvenience.
0
 
joshbennerCommented:
See: http://us2.php.net/file_get_contents
<?php
$html = file_get_contents('http://www.fanfiction.net/s/3693693/1/');

Open in new window

0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
doRodrigoAuthor Commented:
Dear joshbenner when I try this I get the following error:

Call to undefined method stdClass::loadHTML()...

What should I do after the "$xml = simplexml_import_dom($doc);" in my code? Could I echo the variable $xml and get the content of the text that I want?

Thanks.


0
 
NerdsOfTechCommented:
Here is the raw common scrape including page break removal
</php
 
$url = "http://www.theremotesitehere.com/page.html";
 
$raw = file_get_contents($url);
 
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
 
$content = str_replace($newlines, "", html_entity_decode($raw));
?>

Open in new window

0
 
NerdsOfTechCommented:
<?php
 
$url = "http://www.theremotesitehere.com/page.html";
 
$raw = file_get_contents($url);
 
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
 
$content = str_replace($newlines, "", html_entity_decode($raw));
?>
0
 
doRodrigoAuthor Commented:
Hi NerdsOfTech and joshbenner, I'm having the following problem after I retrieve and echo the contents successfully:
Original text in page:

Well isn't that a sight for sore eyes.

What is being echoed:

â¬SWell isn't that a sight for soreeyes.⬝

Do any of you have a solution for that? Thanks in advance.
0
 
NerdsOfTechCommented:
Try outputting just $raw

if this causes any issue uncomment lines
<?php
 
$url = "http://www.theremotesitehere.com/page.html";
 
$raw = file_get_contents($url);
 
// $newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
// $content = str_replace($newlines, "", html_entity_decode($raw));
// $content = strip_tags($content);
// echo $content;
 
echo $raw;
 
?>

Open in new window

0

Featured Post

Hire Technology Freelancers with Gigs

Work with freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely, and get projects done right.

  • 3
  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now