Solved

Complete information Gatherer from Site

Posted on 2009-04-13
8
319 Views
Last Modified: 2013-12-12
I want to create a gatherer of information that retrieve parts of a page from a partner site. The idea is to completely retrieve a text from the site, formatting the generated background, font, font-size and creating a zipfile from the result that is maintained in the server for an hour only. I'm having some problems with the retrieving part of the script. For the example of this script I'll use the www.fanfiction.net site since my client has a confidentiality agreement. Imagine that I want to retrieve the contents of a particular story, that in this case is in a URL like: http://www.fanfiction.net/s/451545/1/ where the first part is the main site, the second part is the reference that it is a story, the third part the number of the story, and the last number is the chapter of the story. What I need is to create a single file or several files of this site capturing the text of the story itself, and ignoring the parts that are there only for the adds and for the management of the site. I'm having a complete block with it if someone could help me at the retrieving part at least, it would help immensely.
0
Comment
Question by:doRodrigo
  • 3
  • 3
  • 2
8 Comments
 
LVL 1

Expert Comment

by:joshbenner
ID: 24133249
This is a method called scraping. If possible, I'd recommend you try to get your partner to make the data available to you via a formatted web service such as XML-RPC, SOAP, REST/XML, etc.

If you must scrape, then regular expressions are a long-standing approach to this problem. Regular expressions aren't perfect, since they can't always perfectly parse really complex content. However, if the HTML you are scraping always has the content you want in a div with a specific ID (and there are no DIV tags within it), then regular expressions can do the trick.

Not having the actual data you'll be working with, here is an example of scraping the content out of a fanfic.net story:

/<div.*?id="storytext".*?>(.*?)<\/div>/

The captured group should contain the story text. This can be repeated for any data segment you need to pull out.

An alternate approach is to load the HTML from the source URI and parse it using something like SimpleXML. This is a rather involved process that I couldn't cover in its entirety here. Here is a snippet of how things would start with SimpleXML:
$dom = new DOMDocument();

$doc->strictErrorChecking = false;

$doc->loadHTML($html); // $html contains HTML pulled from page

$xml = simplexml_import_dom($doc);
 

Now you can use SimpleXML methods found on the $xml instance to retrieve your data.

Open in new window

0
 

Author Comment

by:doRodrigo
ID: 24133345
Thanks joshbenner but sorry you will need to be a little bit more specific. How do I retrieve the content from the URL? Sorry but I know less than nothing on XML, could you get me an example on it? Let's say that I want to retrieve the content of the URL: http://www.fanfiction.net/s/3693693/1/ How would I go about? Thanks and sorry for the inconvenience.
0
 
LVL 1

Expert Comment

by:joshbenner
ID: 24133395
See: http://us2.php.net/file_get_contents
<?php

$html = file_get_contents('http://www.fanfiction.net/s/3693693/1/');

Open in new window

0
 

Author Comment

by:doRodrigo
ID: 24133725
Dear joshbenner when I try this I get the following error:

Call to undefined method stdClass::loadHTML()...

What should I do after the "$xml = simplexml_import_dom($doc);" in my code? Could I echo the variable $xml and get the content of the text that I want?

Thanks.


0
What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 19

Expert Comment

by:NerdsOfTech
ID: 24135875
Here is the raw common scrape including page break removal
</php

 

$url = "http://www.theremotesitehere.com/page.html";

 

$raw = file_get_contents($url);

 

$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");

 

$content = str_replace($newlines, "", html_entity_decode($raw));

?>

Open in new window

0
 
LVL 19

Expert Comment

by:NerdsOfTech
ID: 24135877
<?php
 
$url = "http://www.theremotesitehere.com/page.html";
 
$raw = file_get_contents($url);
 
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
 
$content = str_replace($newlines, "", html_entity_decode($raw));
?>
0
 

Author Comment

by:doRodrigo
ID: 24136652
Hi NerdsOfTech and joshbenner, I'm having the following problem after I retrieve and echo the contents successfully:
Original text in page:

Well isn't that a sight for sore eyes.

What is being echoed:

â¬SWell isn't that a sight for soreeyes.⬝

Do any of you have a solution for that? Thanks in advance.
0
 
LVL 19

Accepted Solution

by:
NerdsOfTech earned 500 total points
ID: 24142579
Try outputting just $raw

if this causes any issue uncomment lines
<?php

 

$url = "http://www.theremotesitehere.com/page.html";

 

$raw = file_get_contents($url);

 

// $newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");

// $content = str_replace($newlines, "", html_entity_decode($raw));

// $content = strip_tags($content);

// echo $content;
 

echo $raw;
 

?>

Open in new window

0

Featured Post

Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

Join & Write a Comment

This article is meant to give a basic understanding of how to use R Sweave as a way to merge LaTeX and R code seamlessly into one presentable document.
These days socially coordinated efforts have turned into a critical requirement for enterprises.
The viewer will learn how to count occurrences of each item in an array.
The viewer will the learn the benefit of plain text editors and code an HTML5 based template for use in further tutorials.

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now