Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

Scraping Test2

Posted on 2014-10-12
2
Medium Priority
?
98 Views
Last Modified: 2014-10-12
Hi folks, I need to be able to retrieve the contents of a site I intend to publish in a different domain and I would like to test everything to see how I would go about doing it. What I need is to retrieve the contents of a single "div" by its "id" and depending on the amount of values in a "select" scrape the news published by the same author in a single output. So if there is several stories like on the example bellow it would retrieve all the content on the first story and afterwards would move on to the next article within the "select" and it would add the second article bellow the forst one.

Any help would be much appreciated.

<html>
<head>
	<title></title>
</head>
<body>
	<form action="story.php?storyNumber=storyNumber.value">
		<select id="storyNumber" name="storyNumber" title="Story Navigation">
		    <option selected="" value="1"></option>
		    <option value="2"></option>
		    <option value="3"></option>
		    <option value="4"></option>
		    <option value="5"></option>
		    <option value="6"></option>
		    <option value="7"></option>
		    <option value="8"></option>
		    <option value="9"></option>
		    <option value="10"></option>
		    <option value="11"></option>
		    <option value="12"></option>
		    <option value="13"></option>
		    <option value="14"></option>
		    <option value="15"></option>
		    <option value="16"></option>
		    <option value="17"></option>
		    <option value="18"></option>
		    <option value="19"></option>
		    <option value="20"></option>
		</select>
	</form>
	<div id="div1">
		ESPN.com is an online platform for multiple sports news, statistics, team information, and player information. ESPN.com offers sports scores, standings, and other statistics for a variety of sports...
	</div>
</body>
</html>

Open in new window

0
Comment
Question by:Rodric MacOliver
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 
LVL 111

Assisted Solution

by:Ray Paseur
Ray Paseur earned 1000 total points
ID: 40375542
I can show you how to get the contents of the <div> and since there is only one <div> in the test data, this will be fairly easy.  The bigger problem goes to the nature of using the HTML <select> to find the other articles.  Where are they?  What was selected?  We cannot tell from the code snippet shown here.  And to be honest, we may never be able to tell because of the nature of HTTP protocols.  It's a complex subject, but this article may help you get started.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/A_11271-Understanding-Client-Server-Protocols-and-Web-Applications.html

Here's the dilemma as I see it: Your script can read this page from a server, but it will always get the same page.  It's easy to parse the HTML as you can see below.  Since your script is not a web browser it cannot readily process JavaScript, and since your script is not a human being, it may not be able to make selections and follow the requests from page to page.

My advise to anyone who wants to "scrape" web pages is to rethink the requirement.  If a web publisher is willing to make its data available to you, the publisher will provide an API that gives your programming stable, consistent, structured access to the data.  You may have to pay for the API (this is one way publishers make money today) but the costs are likely to be modest and the benefits are great.

<?php // demo/temp_r05.php
error_reporting(E_ALL);

// SEE: http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28535978.html

// TEST DATA FROM THE POST AT E-E, SIMULATING A WEB PAGE ACQUIRED VIA file_get_contents()
$htm = <<<EOD
<html>
<head>
	<title></title>
</head>
<body>
	<form action="story.php?storyNumber=storyNumber.value">
		<select id="storyNumber" name="storyNumber" title="Story Navigation">
		    <option selected="" value="1"></option>
		    <option value="2"></option>
		    <option value="3"></option>
		    <option value="4"></option>
		    <option value="5"></option>
		    <option value="6"></option>
		    <option value="7"></option>
		    <option value="8"></option>
		    <option value="9"></option>
		    <option value="10"></option>
		    <option value="11"></option>
		    <option value="12"></option>
		    <option value="13"></option>
		    <option value="14"></option>
		    <option value="15"></option>
		    <option value="16"></option>
		    <option value="17"></option>
		    <option value="18"></option>
		    <option value="19"></option>
		    <option value="20"></option>
		</select>
	</form>
	<div id="div1">
		ESPN.com is an online platform for multiple sports news, statistics, team information, and player information. ESPN.com offers sports scores, standings, and other statistics for a variety of sports...
	</div>
</body>
</html>
EOD;

// BREAK THE STRING APART BASED ON SIGNAL AND ENDING SUBSTRINGS
$sig = '<div id="div1">';
$end = '</div>';
$arr = explode($sig, $htm);
$arr = explode($end, $arr[1]);

// SHOW THE WORK PRODUCT
echo htmlentities($arr[0]);

Open in new window

0
 
LVL 34

Accepted Solution

by:
Beverley Portlock earned 1000 total points
ID: 40375618
I would use DOMDocument to tease the HTML apart. The code below picks out the elements and (for demo purposes) puts them in a list like so

test 2
test 3
text 4
text 5

http://uk1.php.net/manual/en/class.domdocument.php

<?php


// Set up test data
//


$data = <<<DATA
<html>
<head>
     <title></title>
</head>
<body>
     <form action="story.php?storyNumber=storyNumber.value">
          <select id="storyNumber" name="storyNumber" title="Story Navigation">
              <option selected="" value="1"></option>
              <option value="2">test 2</option>
              <option value="3">test 3</option>
              <option value="4">text 4</option>
              <option value="5">text 5</option>
              <option value="6"></option>
              <option value="7"></option>
              <option value="8"></option>
              <option value="9"></option>
              <option value="10"></option>
              <option value="11"></option>
              <option value="12"></option>
              <option value="13"></option>
              <option value="14"></option>
              <option value="15"></option>
              <option value="16"></option>
              <option value="17"></option>
              <option value="18"></option>
              <option value="19"></option>
              <option value="20"></option>
          </select>
     </form>
     <div id="div1">
          ESPN.com is an online platform for multiple sports news, statistics, team information, and player information. ESPN.com offers sports scores, standings, and other statistics for a variety of sports...
     </div>
</body>
</html>
DATA;


// ----- Process test data -----
//


// Create a DOMDocument object and load the test data into it
//
$doc = new DOMDocument();
$doc->loadHTML( $data );


// Locate and extract the SELECT using its ID attribute
//
$items = $doc->getElementById("storyNumber");


// Pick the tags belonging to the SELECT apart to get the option text
//
$options = $items->getElementsByTagName("option");


// For demo purposes, print the text out
//
for ($i = 0; $i < $options->length; $i++)
        echo $options->item($i)->nodeValue . "<br/>";

Open in new window

0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
Since pre-biblical times, humans have sought ways to keep secrets, and share the secrets selectively.  This article explores the ways PHP can be used to hide and encrypt information.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

721 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question