Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

Complete information Gatherer from Site

Posted on 2009-04-13
8
Medium Priority
?
326 Views
Last Modified: 2013-12-12
I want to create a gatherer of information that retrieve parts of a page from a partner site. The idea is to completely retrieve a text from the site, formatting the generated background, font, font-size and creating a zipfile from the result that is maintained in the server for an hour only. I'm having some problems with the retrieving part of the script. For the example of this script I'll use the www.fanfiction.net site since my client has a confidentiality agreement. Imagine that I want to retrieve the contents of a particular story, that in this case is in a URL like: http://www.fanfiction.net/s/451545/1/ where the first part is the main site, the second part is the reference that it is a story, the third part the number of the story, and the last number is the chapter of the story. What I need is to create a single file or several files of this site capturing the text of the story itself, and ignoring the parts that are there only for the adds and for the management of the site. I'm having a complete block with it if someone could help me at the retrieving part at least, it would help immensely.
0
Comment
Question by:doRodrigo
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
  • 2
8 Comments
 
LVL 1

Expert Comment

by:joshbenner
ID: 24133249
This is a method called scraping. If possible, I'd recommend you try to get your partner to make the data available to you via a formatted web service such as XML-RPC, SOAP, REST/XML, etc.

If you must scrape, then regular expressions are a long-standing approach to this problem. Regular expressions aren't perfect, since they can't always perfectly parse really complex content. However, if the HTML you are scraping always has the content you want in a div with a specific ID (and there are no DIV tags within it), then regular expressions can do the trick.

Not having the actual data you'll be working with, here is an example of scraping the content out of a fanfic.net story:

/<div.*?id="storytext".*?>(.*?)<\/div>/

The captured group should contain the story text. This can be repeated for any data segment you need to pull out.

An alternate approach is to load the HTML from the source URI and parse it using something like SimpleXML. This is a rather involved process that I couldn't cover in its entirety here. Here is a snippet of how things would start with SimpleXML:
$dom = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html); // $html contains HTML pulled from page
$xml = simplexml_import_dom($doc);
 
Now you can use SimpleXML methods found on the $xml instance to retrieve your data.

Open in new window

0
 

Author Comment

by:doRodrigo
ID: 24133345
Thanks joshbenner but sorry you will need to be a little bit more specific. How do I retrieve the content from the URL? Sorry but I know less than nothing on XML, could you get me an example on it? Let's say that I want to retrieve the content of the URL: http://www.fanfiction.net/s/3693693/1/ How would I go about? Thanks and sorry for the inconvenience.
0
 
LVL 1

Expert Comment

by:joshbenner
ID: 24133395
See: http://us2.php.net/file_get_contents
<?php
$html = file_get_contents('http://www.fanfiction.net/s/3693693/1/');

Open in new window

0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:doRodrigo
ID: 24133725
Dear joshbenner when I try this I get the following error:

Call to undefined method stdClass::loadHTML()...

What should I do after the "$xml = simplexml_import_dom($doc);" in my code? Could I echo the variable $xml and get the content of the text that I want?

Thanks.


0
 
LVL 19

Expert Comment

by:NerdsOfTech
ID: 24135875
Here is the raw common scrape including page break removal
</php
 
$url = "http://www.theremotesitehere.com/page.html";
 
$raw = file_get_contents($url);
 
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
 
$content = str_replace($newlines, "", html_entity_decode($raw));
?>

Open in new window

0
 
LVL 19

Expert Comment

by:NerdsOfTech
ID: 24135877
<?php
 
$url = "http://www.theremotesitehere.com/page.html";
 
$raw = file_get_contents($url);
 
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
 
$content = str_replace($newlines, "", html_entity_decode($raw));
?>
0
 

Author Comment

by:doRodrigo
ID: 24136652
Hi NerdsOfTech and joshbenner, I'm having the following problem after I retrieve and echo the contents successfully:
Original text in page:

Well isn't that a sight for sore eyes.

What is being echoed:

â¬SWell isn't that a sight for soreeyes.⬝

Do any of you have a solution for that? Thanks in advance.
0
 
LVL 19

Accepted Solution

by:
NerdsOfTech earned 2000 total points
ID: 24142579
Try outputting just $raw

if this causes any issue uncomment lines
<?php
 
$url = "http://www.theremotesitehere.com/page.html";
 
$raw = file_get_contents($url);
 
// $newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
// $content = str_replace($newlines, "", html_entity_decode($raw));
// $content = strip_tags($content);
// echo $content;
 
echo $raw;
 
?>

Open in new window

0

Featured Post

Major Serverless Shift

Comparison of major players like AWS, Microsoft Azure, IBM Bluemix, and Google Cloud Platform

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Build an array called $myWeek which will hold the array elements Today, Yesterday and then builds up the rest of the week by the name of the day going back 1 week.   (CODE) (CODE) Then you just need to pass your date to the function. If i…
The Windows functions GetTickCount and timeGetTime retrieve the number of milliseconds since the system was started. However, the value is stored in a DWORD, which means that it wraps around to zero every 49.7 days. This article shows how to solve t…
The viewer will the learn the benefit of plain text editors and code an HTML5 based template for use in further tutorials.
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question