Scraping Test2

Hi folks, I need to be able to retrieve the contents of a site I intend to publish in a different domain and I would like to test everything to see how I would go about doing it. What I need is to retrieve the contents of a single "div" by its "id" and depending on the amount of values in a "select" scrape the news published by the same author in a single output. So if there is several stories like on the example bellow it would retrieve all the content on the first story and afterwards would move on to the next article within the "select" and it would add the second article bellow the forst one.

Any help would be much appreciated.

<html>
<head>
	<title></title>
</head>
<body>
	<form action="story.php?storyNumber=storyNumber.value">
		<select id="storyNumber" name="storyNumber" title="Story Navigation">
		    <option selected="" value="1"></option>
		    <option value="2"></option>
		    <option value="3"></option>
		    <option value="4"></option>
		    <option value="5"></option>
		    <option value="6"></option>
		    <option value="7"></option>
		    <option value="8"></option>
		    <option value="9"></option>
		    <option value="10"></option>
		    <option value="11"></option>
		    <option value="12"></option>
		    <option value="13"></option>
		    <option value="14"></option>
		    <option value="15"></option>
		    <option value="16"></option>
		    <option value="17"></option>
		    <option value="18"></option>
		    <option value="19"></option>
		    <option value="20"></option>
		</select>
	</form>
	<div id="div1">
		ESPN.com is an online platform for multiple sports news, statistics, team information, and player information. ESPN.com offers sports scores, standings, and other statistics for a variety of sports...
	</div>
</body>
</html>

Open in new window

Rodric MacOliverResearcherAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Ray PaseurCommented:
I can show you how to get the contents of the <div> and since there is only one <div> in the test data, this will be fairly easy.  The bigger problem goes to the nature of using the HTML <select> to find the other articles.  Where are they?  What was selected?  We cannot tell from the code snippet shown here.  And to be honest, we may never be able to tell because of the nature of HTTP protocols.  It's a complex subject, but this article may help you get started.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/A_11271-Understanding-Client-Server-Protocols-and-Web-Applications.html

Here's the dilemma as I see it: Your script can read this page from a server, but it will always get the same page.  It's easy to parse the HTML as you can see below.  Since your script is not a web browser it cannot readily process JavaScript, and since your script is not a human being, it may not be able to make selections and follow the requests from page to page.

My advise to anyone who wants to "scrape" web pages is to rethink the requirement.  If a web publisher is willing to make its data available to you, the publisher will provide an API that gives your programming stable, consistent, structured access to the data.  You may have to pay for the API (this is one way publishers make money today) but the costs are likely to be modest and the benefits are great.

<?php // demo/temp_r05.php
error_reporting(E_ALL);

// SEE: http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28535978.html

// TEST DATA FROM THE POST AT E-E, SIMULATING A WEB PAGE ACQUIRED VIA file_get_contents()
$htm = <<<EOD
<html>
<head>
	<title></title>
</head>
<body>
	<form action="story.php?storyNumber=storyNumber.value">
		<select id="storyNumber" name="storyNumber" title="Story Navigation">
		    <option selected="" value="1"></option>
		    <option value="2"></option>
		    <option value="3"></option>
		    <option value="4"></option>
		    <option value="5"></option>
		    <option value="6"></option>
		    <option value="7"></option>
		    <option value="8"></option>
		    <option value="9"></option>
		    <option value="10"></option>
		    <option value="11"></option>
		    <option value="12"></option>
		    <option value="13"></option>
		    <option value="14"></option>
		    <option value="15"></option>
		    <option value="16"></option>
		    <option value="17"></option>
		    <option value="18"></option>
		    <option value="19"></option>
		    <option value="20"></option>
		</select>
	</form>
	<div id="div1">
		ESPN.com is an online platform for multiple sports news, statistics, team information, and player information. ESPN.com offers sports scores, standings, and other statistics for a variety of sports...
	</div>
</body>
</html>
EOD;

// BREAK THE STRING APART BASED ON SIGNAL AND ENDING SUBSTRINGS
$sig = '<div id="div1">';
$end = '</div>';
$arr = explode($sig, $htm);
$arr = explode($end, $arr[1]);

// SHOW THE WORK PRODUCT
echo htmlentities($arr[0]);

Open in new window

0
Beverley PortlockCommented:
I would use DOMDocument to tease the HTML apart. The code below picks out the elements and (for demo purposes) puts them in a list like so

test 2
test 3
text 4
text 5

http://uk1.php.net/manual/en/class.domdocument.php

<?php


// Set up test data
//


$data = <<<DATA
<html>
<head>
     <title></title>
</head>
<body>
     <form action="story.php?storyNumber=storyNumber.value">
          <select id="storyNumber" name="storyNumber" title="Story Navigation">
              <option selected="" value="1"></option>
              <option value="2">test 2</option>
              <option value="3">test 3</option>
              <option value="4">text 4</option>
              <option value="5">text 5</option>
              <option value="6"></option>
              <option value="7"></option>
              <option value="8"></option>
              <option value="9"></option>
              <option value="10"></option>
              <option value="11"></option>
              <option value="12"></option>
              <option value="13"></option>
              <option value="14"></option>
              <option value="15"></option>
              <option value="16"></option>
              <option value="17"></option>
              <option value="18"></option>
              <option value="19"></option>
              <option value="20"></option>
          </select>
     </form>
     <div id="div1">
          ESPN.com is an online platform for multiple sports news, statistics, team information, and player information. ESPN.com offers sports scores, standings, and other statistics for a variety of sports...
     </div>
</body>
</html>
DATA;


// ----- Process test data -----
//


// Create a DOMDocument object and load the test data into it
//
$doc = new DOMDocument();
$doc->loadHTML( $data );


// Locate and extract the SELECT using its ID attribute
//
$items = $doc->getElementById("storyNumber");


// Pick the tags belonging to the SELECT apart to get the option text
//
$options = $items->getElementsByTagName("option");


// For demo purposes, print the text out
//
for ($i = 0; $i < $options->length; $i++)
        echo $options->item($i)->nodeValue . "<br/>";

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.