[Webinar] Streamline your web hosting managementRegister Today



Posted on 2010-11-17
Medium Priority
Last Modified: 2012-05-10
when using a webcrawler to gather data from a site

it is best to use landmarks
in case the layout of the site changes

what is an example of a landmark
Question by:rgb192
  • 5
  • 5

Expert Comment

ID: 34161833
a good and useful link can be found by just googling,
check here : http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1015021
LVL 111

Expert Comment

by:Ray Paseur
ID: 34163719
"When using a webcrawler to gather data from a site" -- the first question to ask is, "May I please have permission to gather data from your site using a web crawler?" and the next question is, "Do you have an API that can give me the data in a structured format like XML or JSON?"  These are more reliable than any "scraper" method because technically competent data providers understand the RESTful API structure and the need for versioning.

A search for "webcrawler landmarks" turns up a particularly inept list of random stuff.  Do you have a definition or an alternate term for "Landmark?"

Author Comment

ID: 34179855
two pages from
Webbots, Spiders, and Screen Scrapers
by Michael Schrenk

which use the word 'landmark'

I still dont understand.

Initialization and Downloading the Target

The example script initializes by including the LIB_http and LIB_parse libraries you read about earlier. It also creates an array where the parsed data is stored, and it sets the product counter to zero, as shown in Listing 7-1.

# Initialization

# Download the target (practice store) web page
$target = "http://www.schrenk.com/webbots/example_store";
$web_page = http_get($target, "");

Listing 7-1: Initializing the price-monitoring webbot

After initialization, the script proceeds to download the target web page with the get_http() function described in DOWNLOADING WEB PAGES.

After downloading the web page, the script parses all the page's tables into an array, as shown in Listing 7-2.

# Parse all the tables on the web page into an array
$table_array = parse_array($web_page['FILE'], "<table", "</table>");

Listing 7-2: Parsing the tables into an array

The script does this because the product pricing data is in a table. Once we neatly separate all the tables, we can look for the table with the product data. Notice that the script uses <table, not <table>, as the leading indicator for a table. It does this because <table will always be appropriate, no matter how many table formatting attributes are used.

Next, the script looks for the first landmark, or text that identifies the table where the product data exists. Since the landmark represents text that identifies the desired data, that text must be exclusive to our task. For example, by examining the page's source code we can see that we cannot use the word origin as a landmark because it appears in both the description of this week's auction and the list of products for sale. The example script uses the words Products for Sale, because that phrase only exists in the heading of the product table and is not likely to exist elsewhere if the web page is updated. The script looks at each table until it finds the one that contains the landmark text, Products for Sale, as shown in Listing 7-3.
Code View:

# Look for the table that contains the product information
for($xx=0; $xx<count($table_array); $xx++)
    $table_landmark = "Products For Sale";
    if(stristr($table_array[$xx], $table_landmark))     // Process this table
        echo "FOUND: Product table\n";


Listing 7-3: Examining each table for the existence of the landmark text

Once the table containing the product pricing data is found, that table is parsed into an array of table rows, as shown in Listing 7-4.

# Parse table into an array of table rows
$product_row_array = parse_array($table_array[$xx], "<tr", "</tr>");

Listing 7-4: Parsing the table into an array of table rows

Then, once an array of table rows from the product data table is available, the script looks for the product table heading row. The heading row is useful for two reasons: It tells the webbot where the data begins within the table, and it provides the column positions for the desired data. This is important because in the future, the order of the data columns could change (as part of a web page update, for example). If the webbot uses column names to identify data, the webbot will still parse data correctly if the order changes, as long as the column names remain the same.

Here again, the script relies on a landmark to find the table heading row. This time, the landmark is the word Condition, as shown in Listing 7-5. Once the landmark identifies the table heading, the positions of the desired table columns are recorded for later use.
Code View:

for($table_row=0; $table_row<count($product_row_array); $table_row++)
   # Detect the beginning of the desired data (heading row)
   $heading_landmark = "Condition";
   if((stristr($product_row_array[$table_row], $heading_landmark)))
     echo "FOUND: Table heading row\n";

     # Get the position of the desired headings
     $table_cell_array = parse_array($product_row_array[$table_row], "<td", "</td>");
     for($heading_cell=0; $heading_cell<count($table_cell_array); $heading_cell++)
         if(stristr(strip_tags(trim($table_cell_array[$heading_cell])), "ID#"))
"Product name"))
         if(stristr(strip_tags(trim($table_cell_array[$heading_cell])), "Price"))
     echo "FOUND: id_column=$id_column\n";
     echo "FOUND: price_column=$price_column\n";
     echo "FOUND: name_column=$name_column\n";

     # Save the heading row for later use

     $heading_row = $table_row;


Listing 7-5: Detecting the table heading and recording the positions of desired columns

As the script loops through the table containing the desired data, it must also identify where the pricing data ends. A landmark is used again to identify the end of the desired data. The script looks for the landmark Calculate, from the form's submit button, to identify when it has reached the end of the data. Once found, it breaks the loop, as shown in Listing 7-6.

# Detect the end of the desired data table
$ending_landmark = "Calculate";
if((stristr($product_row_array[$table_row], $ending_landmark)))
    echo "PARSING COMPLETE!\n";

Listing 7-6: Detecting the end of the table

If the script finds the headers but doesn't find the end of the table, it assumes that the rest of the table rows contain data. It parses these table rows, using the column position data gleaned earlier, as shown in Listing 7-7.
Code View:

# Parse product and price data
if(isset($heading_row) && $heading_row<$table_row)
    $table_cell_array = parse_array($product_row_array[$table_row], "<td", "</td>");
    $product_array[$product_count]['ID'] =
    $product_array[$product_count]['NAME'] =
    $product_array[$product_count]['PRICE'] =
    echo"PROCESSED: Item #$product_count\n";


Listing 7-7: Assigning parsed data to an array

Once the prices are parsed into an array, the webbot script can do anything it wants with the data. In this case, it simply displays what it collected, as shown in Listing 7-8.

# Display the collected data
for($xx=0; $xx<count($product_array); $xx++)
    echo "$xx. ";
    echo "ID: ".$product_array[$xx]['ID'].", ";
    echo "NAME: ".$product_array[$xx]['NAME'].", ";
    echo "PRICE: ".$product_array[$xx]['PRICE']."\n";

Listing 7-8: Displaying the parsed product pricing data

As shown in The price-monitoring webbot, as run in a shell, the webbot indicates when it finds landmarks and prices. This not only tells the operator how the webbot is running, but also provides important diagnostic information, making both debugging and maintenance easier.

Since prices are almost always in HTML tables, you will usually parse price information in a manner that is similar to that shown here. Occasionally, pricing information may be contained in other tags, (like <div> tags, for example), but this is less likely. When you encounter <div> tags, you can easily parse the data they contain into arrays using similar methods.

Previous Page Next Page

The biggest complaint users have about webbots is their unreliability: Your webbots will suddenly and inexplicably fail if they are not fault tolerant, or able to adapt to the changing conditions of your target websites. This chapter is devoted to helping you write webbots that are tolerant to network outages and unexpected changes in the web pages you target.

Webbots that don't adapt to their changing environments are worse than nonfunctional ones because, when presented with the unexpected, they may perform in odd and unpredictable ways. For example, a non-fault-tolerant webbot may not notice that a form has changed and will continue to emulate the nonexistent form. When a webbot does something that is impossible to do with a browser (like submit an obsolete form), system administrators become aware of the webbot. Furthermore, it's usually easy for system administrators to identify the owner of a webbot by tracing an IP address or matching a user to a username and password. Depending on what your webbot does and which website it targets, the identification of a webbot can lead to possible banishment from the website and the loss of a competitive advantage for your business. It's better to avoid these issues by designing fault-tolerant webbots that anticipate changes in the websites they target.

Fault tolerance does not mean that everything will always work perfectly. Sometimes changes in a targeted website confuse even the most fault-tolerant webbot. In these cases, the proper thing for a webbot to do is to abort its task and report an error to its owner. Essentially, you want your webbot to fail in the same manner a person using a browser might fail. For example, if a webbot is buying an airline ticket, it should not proceed with a purchase if a seat is not available on a desired flight. This action sounds silly, but it is exactly what a poorly programmed webbot may do if it is expecting an available seat and has no provision to act otherwise.
Types of Webbot Fault Tolerance

For a webbot, fault tolerance involves adapting to changes to URLs, HTML content (which affect parsing), forms, cookie use, and network outages and congestion). We'll examine each of these aspects of fault tolerance in the following sections.
Adapting to Changes in URLs

Possibly the most important type of webbot fault tolerance is URL tolerance, or a webbot's ability to make valid requests for web pages under changing conditions. URL tolerance ensures that your webbot does the following:


      Download pages that are available on the target site

      Follow header redirections to updated pages

      Use referer values to indicate that you followed a link from a page that is still on the website

Avoid Making Requests for Pages That Don't Exist

Before you determine that your webbot downloaded a valid web page, you should verify that you made a valid request. Your webbot can verify successful page requests by examining the HTTP code, a status code returned in the header of every web page. If the request was successful, the resulting HTTP code will be in the 200 series—meaning that the HTTP code will be a three-digit number beginning with a two. Any other value for the HTTP code may indicate an error. The most common HTTP code is 200, which says that the request was valid and that the requested page was sent to the web agent. The script in Listing 25-1 shows how to use the LIB_http library's http_get() function to validate the returned page by looking at the returned HTTP code. If the webbot doesn't detect the expected HTTP code, an error handler is used to manage the error and the webbot stops.

# Get the web page
$page = http_get($target="www.schrenk.com", $ref="");
# Vector to error handler if error code detected
    error_handler("BAD RESULT", $page['STATUS']['http_code']);


Listing 25-1: Detecting a bad page request

Before using the method described in Listing 25-1, review a list of HTTP codes and decide which codes apply to your situation.[]

    [] A full list of HTTP codes is available in STATUS CODES.

If the page no longer exists, the fetch will return a 404 Not Found error. When this happens, it's imperative that the webbot stop and not download any more pages until you find the cause of the error. Not proceeding after detecting an error is a far better strategy than continuing as if nothing is wrong.

Web developers don't always remove obsolete web pages from their websites—sometimes they just link to an updated page without removing the old one. Therefore, webbots should start at the web page's home page and verify the existence of each page between the home page and the actual targeted web page. This process does two things. It helps your webbot maintain stealth, as it simulates the browsing habits of a person using a browser. Moreover, by validating that there are links to subsequent pages, you verify that the pages you are targeting are still in use. In contrast, if your webbot targets a page within a site without verifying that other pages still link to it, you risk targeting an obsolete web page.

The fact that your webbot made a valid page request does not indicate that the page you've downloaded is the one you intended to download or that it contains the information you expected to receive. For that reason, it is useful to find a validation point, or text that serves as an indication that the newly downloaded web page contains the expected information. Every situation is different, but there should always be some text on every page that validates that the page contains the content you're expecting. For example, suppose your webbot submits a form to authenticate itself to a website. If the next web page contains a message that welcomes the member to the website, you may wish to use the member's name as a validation point to verify that your webbot successfully authenticated, as shown in Listing 25-2.

$username = "GClasemann";
$page = http_get($target, $ref="");
if(!stristr($page['FILE'], "$username")
    echo "authentication error";
    error_handler("BAD AUTHENTICATION for ".$username, $target);

Listing 25-2: Using a username as a validation point to confirm the result of submitting a form

The script in Listing 25-2 verifies that a validation point, in this case a username, exists as anticipated on the fetched page. This strategy works because the only way that the user's name would appear on the web page is if he or she had been successfully authenticated by the website. If the webbot doesn't find the validation point, it assumes there is a problem and it reports the situation with an error handler.
Follow Page Redirections

Page redirections are instructions sent by the server that tell a browser that it should download a page other than the one originally requested. Web developers use page redirection techniques to tell browsers that the page they're looking for has changed and that they should download another page in its place. This allows people to access correct pages even when obsolete addresses are bookmarked by browsers or listed by search engines. As you'll discover, there are several methods for redirecting browsers. The more web redirection techniques your webbots understand, the more fault tolerant your webbot becomes.

Header redirection is the oldest method of page redirection. It occurs when the server places a Location: URL line in the HTTP header, where URL represents the web page the browser should download (in place of the one requested). When a web agent sees a header redirection, it's supposed to download the page defined by the new location. Your webbot could look for redirections in the headers of downloaded pages, but it's easier to configure PHP/CURL to follow header redirections automatically.[] Listing 25-3 shows the PHP/CURL options you need to make automatic redirection happen.

    [] LIB_http does this for you.

Code View:

curl_setopt($curl_session, CURLOPT_FOLLOWLOCATION, TRUE);     // Follow redirects
curl_setopt($curl_session, CURLOPT_MAXREDIRS, 4);             // Only follow 4


Listing 25-3: Configuring PHP/CURL to follow up to four header redirections

The first option in Listing 25-3 tells PHP/CURL to follow all page redirections as they are defined by the target server. The second option limits the number of redirections your webbot will follow. Limiting the number of redirections defeats webbot traps where servers redirect agents to the page they just downloaded, causing an endless number of requests for the same page and an endless loop.

In addition to header redirections, you should also be prepared to identify and accommodate page redirections made between the <head> and </head> tags, as shown in Listing 25-4.

    <meta http-equiv="refresh" content="0; URL=http://www.nostarch.com">
</html >

Listing 25-4: Page redirection between the <head> and </head> tags

In Listing 25-4, the web page tells the browser to download http://www.nostarch.com instead of the intended page. Detecting these kinds of redirections is accomplished with a script like the one in Listing 25-5. This script looks for redirections between the <head> and </head> tags in a test page on the book's website.
Code View:

# Include http, parse, and address resolution libraries

# Identify the target web page and the page base
$target = "http://www.schrenk.com/nostarch/webbots/head_redirection_test.php";
$page_base = "http://www.schrenk.com/nostarch/webbots/";

# Download the web page
$page = http_get($target, $ref="");

# Parse the <head></head>
$head_section = return_between($string=$page['FILE'], $start="<head>", $end="</head>",

# Create an array of all the meta tags
$meta_tag_array = parse_array($head_section, $beg_tag="<meta", $close_tag=">");

# Examine each meta tag for a redirection command
for($xx=0; $xx<count($meta_tag_array); $xx++)
    # Look for http-equiv attribute
    $meta_attribute = get_attribute($meta_tag_array[$xx], $attribute="http-equiv");
        $new_page = return_between($meta_tag_array[$xx], $start="URL", $end=">",
        # Clean up URL
        $new_page = trim(str_replace("", "", $new_page));
        $new_page = str_replace("=", "", $new_page);
        $new_page = str_replace("\"", "", $new_page);
        $new_page = str_replace("'", "", $new_page);
        # Create fully resolved URL
        $new_page = resolve_address($new_page, $page_base);

# Echo results of script
echo "HTML Head redirection detected<br>";
echo "Redirect page = ".$new_page;


Listing 25-5: Detecting redirection between the <head> and </head> tags

Listing 25-5 is also an example of the need for good coding practices as part of writing fault-tolerant webbots. For instance, in Listing 25-5 notice how these practices are followed:


      The script looks for the redirection between the <head> and </head> tags, and not just anywhere on the web page

      The script looks for the http-equiv attribute only within a meta tag

      The redirected URL is converted into a fully resolved address

      Like a browser, the script stops looking for redirections when it finds the first one

The last—and most troublesome—type of redirection is that done with JavaScript. These instances are troublesome because webbots typically lack JavaScript parsers, making it difficult for them to interpret JavaScript. The simplest redirection of this type is a single line of JavaScript, as shown in Listing 25-6.

<script>document.location = 'http://www.schrenk.com'; </script>

Listing 25-6: A simple JavaScript page redirection

Detecting JavaScript redirections is also tricky because JavaScript is a very flexible language, and page redirections can take many forms. For example, consider what it would take to detect a page redirection like the one in Listing 25-7.

            function goSomeWhereNew(URL)
                location.href = URL;
    <body onLoad=" goSomeWhereNew('http://www.schrenk.com')">

Listing 27-7: A complicated JavaScript page redirection

Fortunately, JavaScript page redirection is not a particularly effective way for a web developer to send a visitor to a new page. Some people turn off JavaScript in their browser configuration, so it doesn't work for everyone; therefore, JavaScript redirection is rarely used. Since it is difficult to write fault-tolerant routines to handle JavaScript, you may have to tough it out and rely on the error-detection techniques addressed later in this chapter.
Maintain the Accuracy of Referer Values

The last aspect of verifying that you're using correct URLs is ensuring that your referer values correctly simulate followed links. You should set the referer to the last target page you requested. This is important for several reasons. For example, some image servers use the referer value to verify that a request for an image is preceded by a request for the entire web page. This defeats bandwidth hijacking, the practice of sourcing images from other people's domains. In addition, websites may defeat deep linking, or linking to a website's inner pages, by examining the referer to verify that people followed a prescribed succession of links to get to a specific point within a website.
Adapting to Changes in Page Content

Parse tolerance is your webbot's ability to parse web pages when your webbot downloads the correct page, but its contents have changed. The following paragraphs describe how to write parsing routines that are tolerant to minor changes in web pages. This may also be a good time to review PARSING TECHNIQUES, which covers general parsing techniques.
Avoid Position Parsing

To facilitate fault tolerance when parsing web pages, you should avoid all attempts at position parsing, or parsing information based on its position within a web page. For example, it's a bad idea to assume that the information you're looking for has these characteristics:


      Starts x characters from the beginning of the page and is y characters in length

      Is in the xth table in a web page

      Is at the very top or bottom of a web page

Any small change in a website can effect position parsing. There are much better ways of finding the information you need to parse.
Use Relative Parsing

Relative parsing is a technique that involves looking for desired information relative to other things on a web page. For example, since many web pages hold information in tables, you can place all the tables into an array, identifying which table contains a landmark term that identifies the correct table. Once a webbot finds the correct table, the data can be parsed from the correct cell by finding the cell relative to a specific column name within that table. For an example of how this works, look at the parsing techniques performed in PRICE-MONITORING WEBBOTS in which a webbot parses prices from an online store.

Table column headings may also be used as landmarks to identify data in tables. For example, assume you have a table like Use Table Headers to Identify Data Within Columns, which presents statistics for three baseball players.

Table Use Table Headers to Identify Data Within Columns
Player      Team      Hits      Home Runs      Average
Zoe      Marsupials      78      15      .327
Cullen      Wombats      56      16      .331
Kade      Wombats      58      17      .324

In this example you could parse all the tables from the web page and isolate the table containing the landmark Player Statistics. In that table, your webbot could then use the column names as secondary landmarks to identify players and their statistics.
Look for Landmarks That Are Least Likely to Change

You achieve additional fault tolerance when you choose landmarks that are least likely to change. From my experience, the things in web pages that change with the lowest frequency are those that are related to server applications or back-end code. In most cases, names of form elements and values for hidden form fields seldom change. For example, in Listing 25-8 it's very easy to find the names and breeds of dogs because the form handler needs to see them in a well-defined manner. Webbot developers generally don't look for data values in forms because they aren't visible in rendered HTML. However, if you're lucky enough to find the data values you're looking for within a form definition, that's where you should get them, even if they appear in other visible places on the website.

<form method="POST" action="dog_form.php">
  <input type="hidden" name="Jackson" value="Jack Russell Terrier">
  <input type="hidden" name="Xing" value="Shepherd Mix">
  <input type="hidden" name="Buster" value="Maltese">
  <input type="hidden" name="Bare-bear" value="Pomeranian">

Listing 25-8: Finding data values in form variables

Similarly, you should avoid landmarks that are subject to frequent changes, like dynamically generated content, HTML comments (which Macromedia Dreamweaver and other page-generation software programs automatically insert into HTML pages), and information that is time or calendar derived.
Adapting to Changes in Forms

Form tolerance defines your webbot's ability to verify that it is sending the correct form information to the correct form handler. When your webbot detects that a form has changed, it is usually best to terminate your webbot, rather than trying to adapt to the changes on the fly. Form emulation is complicated, and it's too easy to make embarrassing mistakes—like submitting nonexistent forms. You should also use the form diagnostic page on the book's website (described in AUTOMATING FORM SUBMISSION) to analyze forms before writing form emulation scripts.

Before emulating a form, a webbot should verify that the form variables it plans to submit are still in use in the submitted form. This check should verify the data pair names submitted to the form handler and the form's method and action. Listing 25-9 parses this information on a test page on the book's website. You can use similar scripts to isolate individual form elements, which can be compared to the variables in form emulation scripts.
Code View:

# Import libraries

# Identify location of form and page base address
$page_base ="http://www.schrenk.com/nostarch/webbots/";
$target = "http://www.schrenk.com/nostarch/webbots/easy_form.php";
$web_page = http_get($target, "");

# Find the forms in the web page
$form_array = parse_array($web_page['FILE'], $open_tag="<form", $close_tag="</form>");

# Parse each form in $form_array
for($xx=0; $xx<count($form_array); $xx++)
    $form_beginning_tag = return_between($form_array[$xx], "<form", ">", INCL);
    $form_action = get_attribute($form_beginning_tag, "action");

    // If no action, use this page as action
        $form_action = $target;
    $fully_resolved_form_action = resolve_address($form_action, $page_base);

    // Default to GET method if no method specified
    if(strtolower(get_attribute($form_beginning_tag, "method")=="post"))

    $form_element_array = parse_array($form_array[$xx], "<input", ">");
    echo "Form Method=$form_method<br>";
    echo "Form Action=$fully_resolved_form_action<br>";
    # Parse each element in this form
    for($yy=0; $yy<count($form_element_array); $yy++)
        $element_name = get_attribute($form_element_array[$yy], "name");
        $element_value = get_attribute($form_element_array[$yy], "value");
        echo "Element Name=$element_name, value=$element_value<br>";


Listing 25-9: Parsing form values

Listing 25-9 finds and parses the values of all forms in a web page. When run, it also finds the form's method and creates a fully resolved URL for the form action, as shown in Results of running the script in Listing 25-9.

Results of running the script in Listing 25-9

Adapting to Changes in Cookie Management

Cookie tolerance involves saving the cookies written by websites and making them available when fetching successive pages from the same website. Cookie management should happen automatically if you are using the LIB_http library and have the COOKIE_FILE pointing to a file your webbots can access.

One area of concern is that the LIB_http library (and PHP/CURL, for that matter) will not delete expired cookies or cookies without an expiration date, which are supposed to expire when the browser is closed. In these cases, it's important to manually delete cookies in order to simulate new browser sessions. If you don't delete expired cookies, it will eventually look like you're using a browser that has been open continuously for months or even years, which can look pretty suspicious.
Adapting to Network Outages and Network Congestion

Unless you plan accordingly, your webbots and spiders will hang, or become nonresponsive, when a targeted website suffers from a network outage or an unusually high volume of network traffic. Webbots become nonresponsive when they request and wait for a page that they never receive. While there's nothing you can do about getting data from nonresponsive target websites, there's also no reason your webbot needs to be hung up when it encounters one. You can avoid this problem by inserting the command shown in Listing 25-10 when configuring your PHP/CURL sessions.

curl_setopt($curl_session, CURLOPT_TIME, $timeout_value);

Listing 25-10: Setting time-out values in PHP/CURL

CURLOPT_TIME defines the number of seconds PHP/CURL waits for a targeted website to respond. This happens automatically if you use the LIB_http library featured in this book. By default, page requests made by LIB_http wait a maximum of 25 seconds for any target website to respond. If there's no response within the allotted time, the PHP/CURL session returns an empty result.

While on the subject of time-outs, it's important to recognize that PHP, by default, will time-out if a script executes longer than 30 seconds. In normal use, PHP's time-out ensures that if a script takes too long to execute, the webserver will return a server error to the browser. The browser, in turn, informs the user that a process has timed-out. The default time-out works great for serving web pages, but when you use PHP to build webbot or spider scripts, PHP must facilitate longer execution times. You can extend (or eliminate) the default PHP script-execution time with the commands shown in Listing 25-11.

You should exercise extreme caution when eliminating PHP's time-out, as shown in the second example in Listing 25-11. If you eliminate the time-out, your script may hang permanently if it encounters a problem.

set_time_limit(60);       // Set PHP time-out to 60 seconds
set_time_limit(0);        // Completely remove PHP script time-out

Listing 25-11: Adjusting the default PHP script time-out

Always try to avoid time-outs by designing webbots that execute quickly, even if that means your webbot needs to run more than once to accomplish a task. For example, if a webbot needs to download and parse 50 web pages, it's usually best to write the bot in such a way that it can process pages one at a time and know where it left off; then you can schedule the webbot to execute every minute or so for an hour. Webbot scripts that execute quickly are easier to test, resemble normal network traffic more closely, and use fewer system resources.
Previous Page Next Page

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

LVL 111

Expert Comment

by:Ray Paseur
ID: 34183508
OK, I think I understand Schrenk's point.  It seems that another term for "landmark" might be "delimiter" and it seems like he is using the term to describe a substring in the HTML.  The location of this substring would give you a point where you could decide what to discard and what to keep.

Let's say you are looking to scrape something out of the meta-tags in the "head" of an HTML document.  Your landmarks might be "<head>" and "</head>"  You can discard any parts of the document before the first and after the last.  Then you will have only the <head> portion to consider.  So if (for example) the substring "meta" were present in the HTML body, your isolated data would not contain the body, and your script would not give you a false positive.

As you can see, the word "meta" is now a part of the body of this web page because we posted that word here.  So if you wanted to find meta keywords or description, you would only want to look in the head of the document, not the body.
<?php // RAY_temp_rgb192.php
echo "<pre>";

// LANDMARKS OF <head> and </head> TO BOUND THE STRING

$htm = file_get_contents('http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_26623320.html?cid=1572#a34179855');
$arr = explode('<head>', $htm);
$arr = explode('</head>', $arr[1]);
echo htmlentities($arr[0]);

Open in new window


Author Comment

ID: 34190041
is this the expected output
<script type="text/javascript">
  // <![CDATA[
  var eeTimerStart = new Date().getTime();
  var eeTimerCnt = 0;
  var eeAdsLoaded = 0;
  var ourMs = 0;
  var adMs = 0;
  function eeEncode(str)
     str = escape(str);
     str = str.replace('+', '%2B');
     str = str.replace('%20', '+');
     str = str.replace('*', '%2A');
     str = str.replace('/', '%2F');
     str = str.replace('@', '%40');
     return str;
  function endEETimer()
      if (++eeTimerCnt == 4) {
         ourMs = (new Date().getTime() - eeTimerStart);
         eeTimerStart = new Date().getTime();
      if (eeTimerCnt == 5 && eeAdsLoaded == 1) {
         adMs = (new Date().getTime() - eeTimerStart);
         eeTimerStart = new Date().getTime();
      if (eeTimerCnt == 6) {
         var omnitureMs = (new Date().getTime() - eeTimerStart);
         var img = document.createElement("img");
         img.src="/pageLoaded.jsp?url=" + eeEncode(document.location.href) + 
                 "&isNew=1" +
                 "&adMs=" + adMs + "&ourMs=" + ourMs + "&omnitureMs=" + omnitureMs + 
                 "&isSecure=0" + 
                 "&isExpertSkin=0" + 
                 "&isVS=1" + 
                 "&isUsingCDN=0" +
                 "&isUsingEELevel3CDN=1" +
                 "&isUsingEEDigitalWestCDN=0" +


  // ]]>
  <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
<link rel="canonical" href="http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_26623320.html" /><link rel="shortcut icon" href="/images/ee.ico" type="image/x-icon" />

<style type="text/css">@import "http://images.experts-exchange.com/getCSS?key=/00188/ee_NS|--base,xp--base,--component,xp--component,--formFactory,xp--formFactory,xp--button,xp-include-infoBox&t=1290023997000";</style>
<style type="text/css">@import "http://images.experts-exchange.com/getCSS?key=/00188/ee_NS|xp-jsp-viewQuestion,-include-customEEple,xp-include-customEEple,-include-suggestedResults,xp-include-zoneHeader,-include-question,xp-include-question,-include-questionList,xp-include-questionList,xp-include-ranks,-include-zoneAd2,-include-eeAd,xp-include-outsideBookmarks,-include-codeSnippet,xp-include-codeSnippet,-include-fileUpload,xp-include-fileUpload,xp-include-addSnippet,-include-addAttachments,xp-include-addAttachments,-include-richtext,xp-include-questionScore,xp-include-actionBox,-include-landingHeader,xp-include-landingHeader,-include-screencastRecordingOverlay,-include-preview,xp-include-preview,-include-comments,xp-include-comments,xp-include-allZones,xp-include-rootTAHeader,-include-collapsibleList2,xp-include-collapsibleList2,xp-include-postableBody,xp-include-viewQuestionPage20,xp-include-viewQuestionPage51,xp-include-viewQuestionPage103,xp-include-viewQuestionPage154,xp-include-viewQuestionPage163,xp-include-viewQuestionPage175,-include-createFilterOverlay,xp-include-createFilterOverlay,xp-include-searchConvMessageAds,-include-mobileAdOverlay&t=1290023996000";</style>
<script src="http://images.experts-exchange.com/00188/scripts/eeSubs_3141229a04f38a0e12487cd43f5756fc.js" type="text/javascript"></script>
<title>landmark : php</title>
<meta name="description" content="when using a webcrawler to gather data from a site it is best to use landmarks in case the layout of the site changes what is an example of a" />
<meta name="keywords" content="php, PHP Scripting Language" />
<script src="http://images.experts-exchange.com/00188/scripts/s_code_0a522bfb0687449fe5b609a65bced569.js" type="text/javascript"></script>

Open in new window

LVL 111

Expert Comment

by:Ray Paseur
ID: 34190767
I'm confused.  This is a PHP question, right?

Author Comment

ID: 34192724
yes php question

got output
landmark : php

when I change tags to '<body>' '</body>'

I get error

What I didnt understand was the author says,
if layout changes, use landmark so site scraper will still work

<?php // RAY_temp_rgb192.php
echo "<pre>";

// LANDMARKS OF <head> and </head> TO BOUND THE STRING

$htm = file_get_contents('http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_26623320.html?cid=1572#a34179855');
$arr = explode('<title>', $htm);
$arr = explode('</title>', $arr[1]);
echo htmlentities($arr[0]);

Open in new window


Author Comment

ID: 34192730
<unrelated>I think that book is above my skill level, so I did not understand much</unrelated>
LVL 111

Accepted Solution

Ray Paseur earned 2000 total points
ID: 34192847
Regarding this...

"What I didnt understand was the author says,
if layout changes, use landmark so site scraper will still work"

The author is making the assumption, incorrectly I think, that a change to the site might leave you able to use the same script, even after the site begins presenting new HTML.  In practice we find that when someone refactors a site, she makes very broad changes to HTML, CSS, etc.  To sum up, site scraping is a brittle technology and it will break when the site changes.  If you need to design your apps to use site scrapers, you need to design your apps to fail "softly" - such as to avoid polluting the data base when an error occurs.

You can find the body this way.  Note the omitted ending wicket on the search for '<body' - in the EE page, the body tag has attributes.
<?php // RAY_temp_rgb192.php
echo "<pre>";

// LANDMARKS OF <head> and </head> TO BOUND THE STRING

$htm = file_get_contents('http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_26623320.html?cid=1572#a34179855');
$arr = explode('<body', $htm);
$arr = explode('</body', $arr[1]);
echo htmlentities($arr[0]);

Open in new window


Author Closing Comment

ID: 34197808
thanks... I think you should write a book.
LVL 111

Expert Comment

by:Ray Paseur
ID: 34198509
Thanks for the points -- I'm working on it!

Featured Post

Receive 1:1 tech help

Solve your biggest tech problems alongside global tech experts with 1:1 help.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

There are times when I have encountered the need to decompress a response from a PHP request. This is how it's done, but you must have control of the request and you can set the Accept-Encoding header.
Originally, this post was published on Monitis Blog, you can check it here . In business circles, we sometimes hear that today is the “age of the customer.” And so it is. Thanks to the enormous advances over the past few years in consumer techno…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
Suggested Courses

590 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question