Solved

Scraping Test2

Posted on 2014-10-12
2
87 Views
Last Modified: 2014-10-12
Hi folks, I need to be able to retrieve the contents of a site I intend to publish in a different domain and I would like to test everything to see how I would go about doing it. What I need is to retrieve the contents of a single "div" by its "id" and depending on the amount of values in a "select" scrape the news published by the same author in a single output. So if there is several stories like on the example bellow it would retrieve all the content on the first story and afterwards would move on to the next article within the "select" and it would add the second article bellow the forst one.

Any help would be much appreciated.

<html>
<head>
	<title></title>
</head>
<body>
	<form action="story.php?storyNumber=storyNumber.value">
		<select id="storyNumber" name="storyNumber" title="Story Navigation">
		    <option selected="" value="1"></option>
		    <option value="2"></option>
		    <option value="3"></option>
		    <option value="4"></option>
		    <option value="5"></option>
		    <option value="6"></option>
		    <option value="7"></option>
		    <option value="8"></option>
		    <option value="9"></option>
		    <option value="10"></option>
		    <option value="11"></option>
		    <option value="12"></option>
		    <option value="13"></option>
		    <option value="14"></option>
		    <option value="15"></option>
		    <option value="16"></option>
		    <option value="17"></option>
		    <option value="18"></option>
		    <option value="19"></option>
		    <option value="20"></option>
		</select>
	</form>
	<div id="div1">
		ESPN.com is an online platform for multiple sports news, statistics, team information, and player information. ESPN.com offers sports scores, standings, and other statistics for a variety of sports...
	</div>
</body>
</html>

Open in new window

0
Comment
Question by:R Wolf
2 Comments
 
LVL 108

Assisted Solution

by:Ray Paseur
Ray Paseur earned 250 total points
Comment Utility
I can show you how to get the contents of the <div> and since there is only one <div> in the test data, this will be fairly easy.  The bigger problem goes to the nature of using the HTML <select> to find the other articles.  Where are they?  What was selected?  We cannot tell from the code snippet shown here.  And to be honest, we may never be able to tell because of the nature of HTTP protocols.  It's a complex subject, but this article may help you get started.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/A_11271-Understanding-Client-Server-Protocols-and-Web-Applications.html

Here's the dilemma as I see it: Your script can read this page from a server, but it will always get the same page.  It's easy to parse the HTML as you can see below.  Since your script is not a web browser it cannot readily process JavaScript, and since your script is not a human being, it may not be able to make selections and follow the requests from page to page.

My advise to anyone who wants to "scrape" web pages is to rethink the requirement.  If a web publisher is willing to make its data available to you, the publisher will provide an API that gives your programming stable, consistent, structured access to the data.  You may have to pay for the API (this is one way publishers make money today) but the costs are likely to be modest and the benefits are great.

<?php // demo/temp_r05.php
error_reporting(E_ALL);

// SEE: http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28535978.html

// TEST DATA FROM THE POST AT E-E, SIMULATING A WEB PAGE ACQUIRED VIA file_get_contents()
$htm = <<<EOD
<html>
<head>
	<title></title>
</head>
<body>
	<form action="story.php?storyNumber=storyNumber.value">
		<select id="storyNumber" name="storyNumber" title="Story Navigation">
		    <option selected="" value="1"></option>
		    <option value="2"></option>
		    <option value="3"></option>
		    <option value="4"></option>
		    <option value="5"></option>
		    <option value="6"></option>
		    <option value="7"></option>
		    <option value="8"></option>
		    <option value="9"></option>
		    <option value="10"></option>
		    <option value="11"></option>
		    <option value="12"></option>
		    <option value="13"></option>
		    <option value="14"></option>
		    <option value="15"></option>
		    <option value="16"></option>
		    <option value="17"></option>
		    <option value="18"></option>
		    <option value="19"></option>
		    <option value="20"></option>
		</select>
	</form>
	<div id="div1">
		ESPN.com is an online platform for multiple sports news, statistics, team information, and player information. ESPN.com offers sports scores, standings, and other statistics for a variety of sports...
	</div>
</body>
</html>
EOD;

// BREAK THE STRING APART BASED ON SIGNAL AND ENDING SUBSTRINGS
$sig = '<div id="div1">';
$end = '</div>';
$arr = explode($sig, $htm);
$arr = explode($end, $arr[1]);

// SHOW THE WORK PRODUCT
echo htmlentities($arr[0]);

Open in new window

0
 
LVL 34

Accepted Solution

by:
Beverley Portlock earned 250 total points
Comment Utility
I would use DOMDocument to tease the HTML apart. The code below picks out the elements and (for demo purposes) puts them in a list like so

test 2
test 3
text 4
text 5

http://uk1.php.net/manual/en/class.domdocument.php

<?php


// Set up test data
//


$data = <<<DATA
<html>
<head>
     <title></title>
</head>
<body>
     <form action="story.php?storyNumber=storyNumber.value">
          <select id="storyNumber" name="storyNumber" title="Story Navigation">
              <option selected="" value="1"></option>
              <option value="2">test 2</option>
              <option value="3">test 3</option>
              <option value="4">text 4</option>
              <option value="5">text 5</option>
              <option value="6"></option>
              <option value="7"></option>
              <option value="8"></option>
              <option value="9"></option>
              <option value="10"></option>
              <option value="11"></option>
              <option value="12"></option>
              <option value="13"></option>
              <option value="14"></option>
              <option value="15"></option>
              <option value="16"></option>
              <option value="17"></option>
              <option value="18"></option>
              <option value="19"></option>
              <option value="20"></option>
          </select>
     </form>
     <div id="div1">
          ESPN.com is an online platform for multiple sports news, statistics, team information, and player information. ESPN.com offers sports scores, standings, and other statistics for a variety of sports...
     </div>
</body>
</html>
DATA;


// ----- Process test data -----
//


// Create a DOMDocument object and load the test data into it
//
$doc = new DOMDocument();
$doc->loadHTML( $data );


// Locate and extract the SELECT using its ID attribute
//
$items = $doc->getElementById("storyNumber");


// Pick the tags belonging to the SELECT apart to get the option text
//
$options = $items->getElementsByTagName("option");


// For demo purposes, print the text out
//
for ($i = 0; $i < $options->length; $i++)
        echo $options->item($i)->nodeValue . "<br/>";

Open in new window

0

Featured Post

Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
Mail Not Sent 6 41
PHP Script - Am I missing anything here? 8 34
PHP loop not working 4 29
mysql left join sentence 7 19
Generating table dynamically is the most common issue faced by php developers.... So it seems there is a need of an article that explains the basic concept of generating tables dynamically. It just requires a basic knowledge of html and little maths…
Things That Drive Us Nuts Have you noticed the use of the reCaptcha feature at EE and other web sites?  It wants you to read and retype something that looks like this.Insanity!  It's not EE's fault - that's just the way reCaptcha works.  But it is …
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now