Solved

Scraping Test2

Posted on 2014-10-12
2
93 Views
Last Modified: 2014-10-12
Hi folks, I need to be able to retrieve the contents of a site I intend to publish in a different domain and I would like to test everything to see how I would go about doing it. What I need is to retrieve the contents of a single "div" by its "id" and depending on the amount of values in a "select" scrape the news published by the same author in a single output. So if there is several stories like on the example bellow it would retrieve all the content on the first story and afterwards would move on to the next article within the "select" and it would add the second article bellow the forst one.

Any help would be much appreciated.

<html>
<head>
	<title></title>
</head>
<body>
	<form action="story.php?storyNumber=storyNumber.value">
		<select id="storyNumber" name="storyNumber" title="Story Navigation">
		    <option selected="" value="1"></option>
		    <option value="2"></option>
		    <option value="3"></option>
		    <option value="4"></option>
		    <option value="5"></option>
		    <option value="6"></option>
		    <option value="7"></option>
		    <option value="8"></option>
		    <option value="9"></option>
		    <option value="10"></option>
		    <option value="11"></option>
		    <option value="12"></option>
		    <option value="13"></option>
		    <option value="14"></option>
		    <option value="15"></option>
		    <option value="16"></option>
		    <option value="17"></option>
		    <option value="18"></option>
		    <option value="19"></option>
		    <option value="20"></option>
		</select>
	</form>
	<div id="div1">
		ESPN.com is an online platform for multiple sports news, statistics, team information, and player information. ESPN.com offers sports scores, standings, and other statistics for a variety of sports...
	</div>
</body>
</html>

Open in new window

0
Comment
Question by:Rodric MacOliver
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 
LVL 110

Assisted Solution

by:Ray Paseur
Ray Paseur earned 250 total points
ID: 40375542
I can show you how to get the contents of the <div> and since there is only one <div> in the test data, this will be fairly easy.  The bigger problem goes to the nature of using the HTML <select> to find the other articles.  Where are they?  What was selected?  We cannot tell from the code snippet shown here.  And to be honest, we may never be able to tell because of the nature of HTTP protocols.  It's a complex subject, but this article may help you get started.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/A_11271-Understanding-Client-Server-Protocols-and-Web-Applications.html

Here's the dilemma as I see it: Your script can read this page from a server, but it will always get the same page.  It's easy to parse the HTML as you can see below.  Since your script is not a web browser it cannot readily process JavaScript, and since your script is not a human being, it may not be able to make selections and follow the requests from page to page.

My advise to anyone who wants to "scrape" web pages is to rethink the requirement.  If a web publisher is willing to make its data available to you, the publisher will provide an API that gives your programming stable, consistent, structured access to the data.  You may have to pay for the API (this is one way publishers make money today) but the costs are likely to be modest and the benefits are great.

<?php // demo/temp_r05.php
error_reporting(E_ALL);

// SEE: http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28535978.html

// TEST DATA FROM THE POST AT E-E, SIMULATING A WEB PAGE ACQUIRED VIA file_get_contents()
$htm = <<<EOD
<html>
<head>
	<title></title>
</head>
<body>
	<form action="story.php?storyNumber=storyNumber.value">
		<select id="storyNumber" name="storyNumber" title="Story Navigation">
		    <option selected="" value="1"></option>
		    <option value="2"></option>
		    <option value="3"></option>
		    <option value="4"></option>
		    <option value="5"></option>
		    <option value="6"></option>
		    <option value="7"></option>
		    <option value="8"></option>
		    <option value="9"></option>
		    <option value="10"></option>
		    <option value="11"></option>
		    <option value="12"></option>
		    <option value="13"></option>
		    <option value="14"></option>
		    <option value="15"></option>
		    <option value="16"></option>
		    <option value="17"></option>
		    <option value="18"></option>
		    <option value="19"></option>
		    <option value="20"></option>
		</select>
	</form>
	<div id="div1">
		ESPN.com is an online platform for multiple sports news, statistics, team information, and player information. ESPN.com offers sports scores, standings, and other statistics for a variety of sports...
	</div>
</body>
</html>
EOD;

// BREAK THE STRING APART BASED ON SIGNAL AND ENDING SUBSTRINGS
$sig = '<div id="div1">';
$end = '</div>';
$arr = explode($sig, $htm);
$arr = explode($end, $arr[1]);

// SHOW THE WORK PRODUCT
echo htmlentities($arr[0]);

Open in new window

0
 
LVL 34

Accepted Solution

by:
Beverley Portlock earned 250 total points
ID: 40375618
I would use DOMDocument to tease the HTML apart. The code below picks out the elements and (for demo purposes) puts them in a list like so

test 2
test 3
text 4
text 5

http://uk1.php.net/manual/en/class.domdocument.php

<?php


// Set up test data
//


$data = <<<DATA
<html>
<head>
     <title></title>
</head>
<body>
     <form action="story.php?storyNumber=storyNumber.value">
          <select id="storyNumber" name="storyNumber" title="Story Navigation">
              <option selected="" value="1"></option>
              <option value="2">test 2</option>
              <option value="3">test 3</option>
              <option value="4">text 4</option>
              <option value="5">text 5</option>
              <option value="6"></option>
              <option value="7"></option>
              <option value="8"></option>
              <option value="9"></option>
              <option value="10"></option>
              <option value="11"></option>
              <option value="12"></option>
              <option value="13"></option>
              <option value="14"></option>
              <option value="15"></option>
              <option value="16"></option>
              <option value="17"></option>
              <option value="18"></option>
              <option value="19"></option>
              <option value="20"></option>
          </select>
     </form>
     <div id="div1">
          ESPN.com is an online platform for multiple sports news, statistics, team information, and player information. ESPN.com offers sports scores, standings, and other statistics for a variety of sports...
     </div>
</body>
</html>
DATA;


// ----- Process test data -----
//


// Create a DOMDocument object and load the test data into it
//
$doc = new DOMDocument();
$doc->loadHTML( $data );


// Locate and extract the SELECT using its ID attribute
//
$items = $doc->getElementById("storyNumber");


// Pick the tags belonging to the SELECT apart to get the option text
//
$options = $items->getElementsByTagName("option");


// For demo purposes, print the text out
//
for ($i = 0; $i < $options->length; $i++)
        echo $options->item($i)->nodeValue . "<br/>";

Open in new window

0

Featured Post

Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Deprecated and Headed for the Dustbin By now, you have probably heard that some PHP features, while convenient, can also cause PHP security problems.  This article discusses one of those, called register_globals.  It is a thing you do not want.  …
This article discusses four methods for overlaying images in a container on a web page
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.

749 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question