Solved

Parsing a page (HTML) using PHP, HOW?

Posted on 2010-08-21
8
538 Views
Last Modified: 2013-11-18
http://www.asos.com/Asos/Little-Asos-Union-Jack-T-Shirt/Prod/pgeproduct.aspx?iid=1273626

Take a look at this page, it's a clothe shop for kids. This is one of their items and I want to point out the size section. What we need to do here is to get all the sizes for this item and check whether the sizes are available or not. Right now all the sizes for this items are:

3-4 years
4-5 years
5-6 years
7-8 years

How can you say if the sizes are available or not?

Now take a look at this page first and check the sizes again:

http://www.asos.com/Ralph-Lauren/Ralph-Lauren-Long-Sleeve-Big-Horse-Stripe-Rugby-Top/Prod/pgeproduct.aspx?iid=1111751

This item has the following sizes:

12 months
18 months - Not Available
24 months

As you can see 18 months size is not available, it is indicated by the "Not Available" text next to the size.

What we need to do is go the page of an item, get the sizes and check the availability of each sizes. How can I do this in PHP?
0
Comment
Question by:Bandai2
  • 3
  • 3
  • 2
8 Comments
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 33491496
What you're describing is called "scraping" a web site.  Each page you need to scrape is typically a custom application, and it's a rather brittle technology, since a change in the foreign web site can cause your application to break without notice.

A better approach is to ask the owner of the web site for a REST API.  You send the item identifier to the API; the API responds with an XML string giving the item identifier and the availability.

If the owner of the web site wants to give you automated access to this information, the API is a very easy solution.  If you can make the case that they will increase sales even a little bit, it will make a good business case for  you.
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 33491539
Have a look near line 510 in the HTML for this page:
http://www.asos.com/Ralph-Lauren/Ralph-Lauren-Long-Sleeve-Big-Horse-Stripe-Rugby-Top/Prod/pgeproduct.aspx?iid=1111751

As you can see, this is the size selection DIV.  The information that populates this DIV apparently comes from an AJAX interaction with a backend script.  The words "Not Available" are not in the HTML, but they are clearly rendered on the screen when you open the SELECT form control.  So they are put into the DOM some other way.  it's things like this that argue for an API, instead of a screen scraper.  As more AJAX interactivity is put into web sites, the scrapers become less useful.  One might even understand that a vendor who wanted to prevent scraping would deliberately design an AJAX application to make it harder to leech the content from the site.

Perhaps one of the other Experts can help you parse the DOM, but it might be faster, easier and more dependable to ask for the API.

Best of luck with your project, ~Ray
<div class="size-guide">
<select name="ctl00$ContentMainPage$ctlSeparateProduct$drpdwnSize" id="ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize" name="drpdwnSize" onchange="drpdwnSizeChange(this, 'ctl00_ContentMainPage_ctlSeparateProduct', arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct);">
<option value="-1">Select Size</option>
</select>
</div>

Open in new window

0
 

Author Comment

by:Bandai2
ID: 33491545
We already asked for an API and they don't have it. There's no definite date on when they could create one but I won't hold out for it.

So we have to do this scrapping for now until we can get a proper API for this.
0
Master Your Team's Linux and Cloud Stack!

The average business loses $13.5M per year to ineffective training (per 1,000 employees). Keep ahead of the competition and combine in-person quality with online cost and flexibility by training with Linux Academy.

 
LVL 109

Expert Comment

by:Ray Paseur
ID: 33491597
Well, good luck with it.  There is nothing in the HTML to scrape, so you might need to parse the DOM (no guarantees, but it might work).  I'm afraid I do not have time to research that, (I could write the API faster!) but maybe some of these links could get you to a solution.
http://php.net/manual/en/book.dom.php
http://www.bitrepository.com/php-simple-html-dom-parser.html
http://simplehtmldom.sourceforge.net/
0
 
LVL 7

Expert Comment

by:mcuk_storm
ID: 33491695
This code should do the job, the product variations are in the HTML but they are declared in a javascript variable near the top of the page then pulled in via javascript on page load. This script will extract the declarations and convert them to a PHP array containing an associative array for each variation, including whether it is available as a boolean. There are a few columns in the data that i have not been able to work out what they do, these are labeled unknown_col[1-6]

That said i fully agree with previous comments, this is not an efficient way of working and an API would be MUCH better.

<?php

function getProductVariations($url) {
  
  //Use CURL to get the raw HTML for the page
  $ch = curl_init();
  curl_setopt_array($ch,
    array(
      CURLOPT_RETURNTRANSFER=>true,
      CURLOPT_HEADER => false,
      CURLOPT_URL => $url
    )
  );
  $raw_html = curl_exec($ch);

  //If we get an invalid response back from the server fail
  if ($raw_html===false) {
    throw new Exception(curl_error($ch));
  }

  curl_close($ch);

  //Find the variation JS declarations and extract them
  $raw_variations = preg_match_all("/arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct\[[0-9]+\].*Array\((.*)\);/",$raw_html,$raw_matches);

  //We are done with the Raw HTML now
  unset($raw_html);

  //Check that we got some results back
  if (is_array($raw_matches) && isset($raw_matches[1]) && sizeof($raw_matches[1])==$raw_variations && $raw_variations>0) {

    //This is where the matches will go
    $matches = array();

    //Go through the results of the bracketed expression and convert them to a PHP assoc array
    foreach($raw_matches[1] as $match) {

      //As they are declared in javascript we can use json_decode to process them nicely, they just need wrapping
      $proc=json_decode("[$match]");

      //Label the fields as best we can
      $proc2=array(
        "variation_id"=>$proc[0],
        "size_desc"=>$proc[1],
        "colour_desc"=>$proc[2],
        "available"=>(trim(strtolower($proc[3]))=="true"),
        "unknown_col1"=>$proc[4],
        "price"=>$proc[5],
        "unknown_col2"=>$proc[6],       /*Always seems to be zero*/
        "currency"=>$proc[7],
        "unknown_col3"=>$proc[8],
        "unknown_col4"=>$proc[9],       /*Negative price*/
        "unknown_col5"=>$proc[10],      /*Always seems to be zero*/
        "unknown_col6"=>$proc[11]       /*Always seems to be zero*/
      );

      //Push the processed variation onto the results array
      $matches[$proc[0]]=$proc2;

      //We are done with our proc2 array now (proc will be unset by the foreach loop)
      unset($proc2);
    }

    //Return the matches we have found
    return $matches;

  } else {
    throw new Exception("Unable to find any product variations");

  }
}


//EXAMPLE USAGE
try {
  $variations = getProductVariations("http://www.asos.com/Asos/Prod/pgeproduct.aspx?iid=803846");

  //Do something more useful here
  print_r($variations);


} catch(Exception $e) {
  echo "Error: " . $e->getMessage();
}

?>

Open in new window

0
 

Author Comment

by:Bandai2
ID: 33495251
Wow mcuk_storm, I didn't know it can easily be done. First time using the cURL function.

The above code works great but there's a problem when the product needs you to select a colour first before the sizes are displayed.

Like this one:

http://www.asos.com/Little-Joules/Little-Joules-Stewart-Venus-Fly-Trap-T-Shirt/Prod/pgeproduct.aspx?iid=1171006

Any idea how to go about this?
0
 
LVL 7

Accepted Solution

by:
mcuk_storm earned 500 total points
ID: 33495404
What is the problem with that one? it gives you all the variations for both colours (see the colour_desc value in the array) in that example you get some entries for Steel68Years and some Steel35Years referring to the Steel colour for 3-5Yrs and Steel 6-8Yrs.

I have attached a revised version which will give you the Colour code & description rather than just the code.
<?php

function getProductVariations($url) {
  
  //Use CURL to get the raw HTML for the page
  $ch = curl_init();
  curl_setopt_array($ch,
    array(
      CURLOPT_RETURNTRANSFER=>true,
      CURLOPT_HEADER => false,
      CURLOPT_URL => $url
    )
  );
  $raw_html = curl_exec($ch);

  //If we get an invalid response back from the server fail
  if ($raw_html===false) {
    throw new Exception(curl_error($ch));
  }

  curl_close($ch);

  $colour_lookup=array();
  $raw_colours_start_idx = strpos($raw_html,'name="ctl00$ContentMainPage$ctlSeparateProduct$drpdwnColour"');
  if ($raw_colours_start_idx !== false) {
    $raw_colours_end_idx = strpos($raw_html,'</select>',$raw_colours_start_idx);
    if ($raw_colours_end_idx !== false) {
      $raw_colours_html = substr($raw_html,$raw_colours_start_idx,$raw_colours_end_idx-$raw_colours_start_idx);

      $raw_colours = preg_match_all("/<option value=\"([^\"]+)\">([^<]+)<\/option>/",$raw_colours_html,$raw_matches);

      if (is_array($raw_matches) && isset($raw_matches[2]) && sizeof($raw_matches[2])==$raw_colours && $raw_colours>0) {
        foreach($raw_matches[1] as $idx=>$val) {
          list($col_name,$col_price) = explode("&#163;",$raw_matches[2][$idx]);
          $col_key = preg_replace("/[^a-zA-Z0-9]/","",$val);
          $colour_lookup[$col_key] = array("name"=>$col_name/*,"price"=>$col_price*/);
        }
      }
    }
  }
  unset($raw_colours_start_idx,$raw_colours_end_idx,$raw_colours_html,$raw_colours,$raw_matches,$col_key,$col_name,$col_price);
  

  //Find the variation JS declarations and extract them
  $raw_variations = preg_match_all("/arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct\[[0-9]+\].*Array\((.*)\);/",$raw_html,$raw_matches);

  //We are done with the Raw HTML now
  unset($raw_html);

  //Check that we got some results back
  if (is_array($raw_matches) && isset($raw_matches[1]) && sizeof($raw_matches[1])==$raw_variations && $raw_variations>0) {

    //This is where the matches will go
    $matches = array();

    //Go through the results of the bracketed expression and convert them to a PHP assoc array
    foreach($raw_matches[1] as $match) {

      //As they are declared in javascript we can use json_decode to process them nicely, they just need wrapping
      $proc=json_decode("[$match]");

      //Label the fields as best we can
      $proc2=array(
        "variation_id"=>$proc[0],
        "size_desc"=>$proc[1],
        "colour_code"=>$proc[2],
        "colour_desc"=>(isset($colour_lookup[$proc[2]]) ? $colour_lookup[$proc[2]]['name']: $proc[2]),
        "available"=>(trim(strtolower($proc[3]))=="true"),
        "unknown_col1"=>$proc[4],
        "price"=>$proc[5],
        "unknown_col2"=>$proc[6],       /*Always seems to be zero*/
        "currency"=>$proc[7],
        "unknown_col3"=>$proc[8],
        "unknown_col4"=>$proc[9],       /*Negative price*/
        "unknown_col5"=>$proc[10],      /*Always seems to be zero*/
        "unknown_col6"=>$proc[11]       /*Always seems to be zero*/
      );

      //Push the processed variation onto the results array
      $matches[$proc[0]]=$proc2;

      //We are done with our proc2 array now (proc will be unset by the foreach loop)
      unset($proc2);
    }

    //Return the matches we have found
    return $matches;

  } else {
    throw new Exception("Unable to find any product variations");

  }
}

//EXAMPLE USAGE
try {
  $variations = getProductVariations("http://www.asos.com/Asos/Prod/pgeproduct.aspx?iid=1171006");

  //Do something more useful here
  print_r($variations);


} catch(Exception $e) {
  echo "Error: " . $e->getMessage();
}

?>

Open in new window

0
 

Author Closing Comment

by:Bandai2
ID: 33552750
Great Work here!
0

Featured Post

Courses: Start Training Online With Pros, Today

Brush up on the basics or master the advanced techniques required to earn essential industry certifications, with Courses. Enroll in a course and start learning today. Training topics range from Android App Dev to the Xen Virtualization Platform.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
numbers ascending pyramid 101 191
Dynamic varibles 5 32
PHP curl issue VERBOSE output 18 34
Display images from mysql blob type (Not working) 9 25
Since pre-biblical times, humans have sought ways to keep secrets, and share the secrets selectively.  This article explores the ways PHP can be used to hide and encrypt information.
An enjoyable and seamless user experience can go a long way on an eCommerce site. While a cohesive layout and engaging copy play roles in creating a positive user experience, some sites neglect aspects that seem marginal but in actuality prove very …
The viewer will learn how to pass data into a function in C++. This is one step further in using functions. Instead of only printing text onto the console, the function will be able to perform calculations with argumentents given by the user.
Video by: Mark
This lesson goes over how to construct ordered and unordered lists and how to create hyperlinks.

813 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now