Solved

Scraping Test2

Posted on 2014-10-12
2
91 Views
Last Modified: 2014-10-12
Hi folks, I need to be able to retrieve the contents of a site I intend to publish in a different domain and I would like to test everything to see how I would go about doing it. What I need is to retrieve the contents of a single "div" by its "id" and depending on the amount of values in a "select" scrape the news published by the same author in a single output. So if there is several stories like on the example bellow it would retrieve all the content on the first story and afterwards would move on to the next article within the "select" and it would add the second article bellow the forst one.

Any help would be much appreciated.

<html>
<head>
	<title></title>
</head>
<body>
	<form action="story.php?storyNumber=storyNumber.value">
		<select id="storyNumber" name="storyNumber" title="Story Navigation">
		    <option selected="" value="1"></option>
		    <option value="2"></option>
		    <option value="3"></option>
		    <option value="4"></option>
		    <option value="5"></option>
		    <option value="6"></option>
		    <option value="7"></option>
		    <option value="8"></option>
		    <option value="9"></option>
		    <option value="10"></option>
		    <option value="11"></option>
		    <option value="12"></option>
		    <option value="13"></option>
		    <option value="14"></option>
		    <option value="15"></option>
		    <option value="16"></option>
		    <option value="17"></option>
		    <option value="18"></option>
		    <option value="19"></option>
		    <option value="20"></option>
		</select>
	</form>
	<div id="div1">
		ESPN.com is an online platform for multiple sports news, statistics, team information, and player information. ESPN.com offers sports scores, standings, and other statistics for a variety of sports...
	</div>
</body>
</html>

Open in new window

0
Comment
Question by:Rodric MacOliver
2 Comments
 
LVL 109

Assisted Solution

by:Ray Paseur
Ray Paseur earned 250 total points
ID: 40375542
I can show you how to get the contents of the <div> and since there is only one <div> in the test data, this will be fairly easy.  The bigger problem goes to the nature of using the HTML <select> to find the other articles.  Where are they?  What was selected?  We cannot tell from the code snippet shown here.  And to be honest, we may never be able to tell because of the nature of HTTP protocols.  It's a complex subject, but this article may help you get started.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/A_11271-Understanding-Client-Server-Protocols-and-Web-Applications.html

Here's the dilemma as I see it: Your script can read this page from a server, but it will always get the same page.  It's easy to parse the HTML as you can see below.  Since your script is not a web browser it cannot readily process JavaScript, and since your script is not a human being, it may not be able to make selections and follow the requests from page to page.

My advise to anyone who wants to "scrape" web pages is to rethink the requirement.  If a web publisher is willing to make its data available to you, the publisher will provide an API that gives your programming stable, consistent, structured access to the data.  You may have to pay for the API (this is one way publishers make money today) but the costs are likely to be modest and the benefits are great.

<?php // demo/temp_r05.php
error_reporting(E_ALL);

// SEE: http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28535978.html

// TEST DATA FROM THE POST AT E-E, SIMULATING A WEB PAGE ACQUIRED VIA file_get_contents()
$htm = <<<EOD
<html>
<head>
	<title></title>
</head>
<body>
	<form action="story.php?storyNumber=storyNumber.value">
		<select id="storyNumber" name="storyNumber" title="Story Navigation">
		    <option selected="" value="1"></option>
		    <option value="2"></option>
		    <option value="3"></option>
		    <option value="4"></option>
		    <option value="5"></option>
		    <option value="6"></option>
		    <option value="7"></option>
		    <option value="8"></option>
		    <option value="9"></option>
		    <option value="10"></option>
		    <option value="11"></option>
		    <option value="12"></option>
		    <option value="13"></option>
		    <option value="14"></option>
		    <option value="15"></option>
		    <option value="16"></option>
		    <option value="17"></option>
		    <option value="18"></option>
		    <option value="19"></option>
		    <option value="20"></option>
		</select>
	</form>
	<div id="div1">
		ESPN.com is an online platform for multiple sports news, statistics, team information, and player information. ESPN.com offers sports scores, standings, and other statistics for a variety of sports...
	</div>
</body>
</html>
EOD;

// BREAK THE STRING APART BASED ON SIGNAL AND ENDING SUBSTRINGS
$sig = '<div id="div1">';
$end = '</div>';
$arr = explode($sig, $htm);
$arr = explode($end, $arr[1]);

// SHOW THE WORK PRODUCT
echo htmlentities($arr[0]);

Open in new window

0
 
LVL 34

Accepted Solution

by:
Beverley Portlock earned 250 total points
ID: 40375618
I would use DOMDocument to tease the HTML apart. The code below picks out the elements and (for demo purposes) puts them in a list like so

test 2
test 3
text 4
text 5

http://uk1.php.net/manual/en/class.domdocument.php

<?php


// Set up test data
//


$data = <<<DATA
<html>
<head>
     <title></title>
</head>
<body>
     <form action="story.php?storyNumber=storyNumber.value">
          <select id="storyNumber" name="storyNumber" title="Story Navigation">
              <option selected="" value="1"></option>
              <option value="2">test 2</option>
              <option value="3">test 3</option>
              <option value="4">text 4</option>
              <option value="5">text 5</option>
              <option value="6"></option>
              <option value="7"></option>
              <option value="8"></option>
              <option value="9"></option>
              <option value="10"></option>
              <option value="11"></option>
              <option value="12"></option>
              <option value="13"></option>
              <option value="14"></option>
              <option value="15"></option>
              <option value="16"></option>
              <option value="17"></option>
              <option value="18"></option>
              <option value="19"></option>
              <option value="20"></option>
          </select>
     </form>
     <div id="div1">
          ESPN.com is an online platform for multiple sports news, statistics, team information, and player information. ESPN.com offers sports scores, standings, and other statistics for a variety of sports...
     </div>
</body>
</html>
DATA;


// ----- Process test data -----
//


// Create a DOMDocument object and load the test data into it
//
$doc = new DOMDocument();
$doc->loadHTML( $data );


// Locate and extract the SELECT using its ID attribute
//
$items = $doc->getElementById("storyNumber");


// Pick the tags belonging to the SELECT apart to get the option text
//
$options = $items->getElementsByTagName("option");


// For demo purposes, print the text out
//
for ($i = 0; $i < $options->length; $i++)
        echo $options->item($i)->nodeValue . "<br/>";

Open in new window

0

Featured Post

Live: Real-Time Solutions, Start Here

Receive instant 1:1 support from technology experts, using our real-time conversation and whiteboard interface. Your first 5 minutes are always free.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Password hashing is better than message digests or encryption, and you should be using it instead of message digests or encryption.  Find out why and how in this article, which supplements the original article on PHP Client Registration, Login, Logo‚Ķ
This article discusses four methods for overlaying images in a container on a web page
The viewer will learn how to count occurrences of each item in an array.
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.

813 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now