Solved

PHP SimpleHTMLDom issue

Posted on 2013-11-08
16
260 Views
Last Modified: 2014-04-22
Hi,

This seems to be a strange problem.

I have a class with three methods.

method1 - uses CURL to get the HTML contents of a page, passed the HTML to SimpleHTMLDom and stored relevant data in a table

method2 - is almost the same as method1, except it accesses a different URL and so pulls the data from the page in a different way, way essentially it uses CURL to get the HTML contents of a URL and pass it to SimpleHTMLDom

method3 - When executed calls method1, then method2

The problem is, if I run method1 and method2 individually, they work fine. But if I try to call method3, then it executes method1 fine, it executes method2, but the SimpleHTMLDom object is empty so the method doesn't work.

There are no error messages.

If I modify method 3 so that it calls method2 then method1 (reserve order), then it's always the 2nd method to be called that doesn't work.

Within in method I'm clearing the SimpleHTMLDom when I'm finished.

$html->clear();
unset($html);

Before passing the HTML to the SimpleHTMLDom object, I've tried outputting the HTML provided by CURL, and it's fine.

Does anyone have any ideas why this might be happening?
0
Comment
Question by:SheppardDigital
  • 9
  • 5
  • 2
16 Comments
 
LVL 58

Expert Comment

by:Gary
ID: 39633748
Some code would be helpful.
0
 

Author Comment

by:SheppardDigital
ID: 39633761
I've figured it out (I think).

I've increased the max_execution time and it's working.
0
 

Author Comment

by:SheppardDigital
ID: 39633913
I've requested that this question be closed as follows:

Accepted answer: 0 points for SheppardDigital's comment #a39633761

for the following reason:

Fixed this myself
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39633765
It might be as simple as omitting $this-> in one of the methods, but that's just speculation.  I'm with GaryC123. -- we are experts but not mind readers, so we would need to see the code in order to get beyond speculation.
0
 
LVL 58

Expert Comment

by:Gary
ID: 39633777
May want to look at doing async curl calls, at least if one is slow to finish you don't have your other calls waiting for it.
0
 

Author Comment

by:SheppardDigital
ID: 39633805
Ok, looks like increasing the execution time didn't work.

Here's the code for the page

<?php
ini_set('max_execution_time', 3600);
ini_set('memory_limit', '1024M');

include __DCMS_PATH . '/app/third_party/simplehtmldom_1_5/simple_html_dom.php';

class scrapeController {

    public function getSource($url) {
        $ch = curl_init($url);
        curl_setopt($ch, CURLOPT_TIMEOUT, 60);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
        $html_data = curl_exec($ch);
        curl_close($ch);

        return $html_data;
    }

    public function argos($url) {

        $products = array();

        $max_pages = 50;

        $reduced_products = 0;

        // Get page 1
        $useURL = str_replace('{offset}',1,$url);

        $html = new simple_html_dom();
        $html->load($this->getSource($useURL));

        // How many products were found?
        $found = $html->find('#categorylist span');
        $found_count = str_replace('(','',str_replace(' products)','',$found[0]->plaintext));

        // How many pages?
        $pages = ceil($found_count / 50);

        echo '<br>Scraping Argos ...';

        $page = 1;
        while ($page <= $pages && $page <= $max_pages) {

            // Offset
            if ($page > 1) {
                $offset = ($page + 1) * 50 + 1;
            } else {
                $offset = 1;
            }

            echo $page . '... ';

            $useURL = str_replace('{offset}',$offset,$url);

            $html = file_get_html($useURL);

            // Find products
            $items = $html->find('dl.product');

            if (count($items) > 0) {
                foreach($items as $item) {
                    // Product name
                    $name = $item->find('.title a');
                    $p['name'] = $name[0]->plaintext;
                    $p['link'] = $name[0]->href;

                    // Price
                    $price = $item->find('.price .main');
                    $p['price'] = str_replace('&pound;','',preg_replace('/\s+/', '', $price[0]->plaintext));

                    // Image
                    $image = $item->find('.image a img');
                    $p['image'] = $image[0]->src;

                    $product = new product();
                    $product->getByIdentifier(md5($p['link']));
                    $product->retailer_id = 2;
                    $product->brand = '';
                    $product->name = $p['name'];

                    if ($product->id > 0) {

                        // Reduced?
                        if ($p['price'] < $product->price_now) {
                            $product->reduced = 1;
                            $product->date_reduced = date('Y-m-d H:i:s');

                            $product->price_was = $product->price_now;

                            $reduced_products ++;
                        } elseif ($p['price'] > $product->price_now) {
                            $product->reduced = 0;
                            $product->date_reduced = '';

                            $product->price_was = $product->price_now;
                        }

                    }

                    if ($product->price != $p['price']) $product->price_now = $p['price'];
                    if (!$product->id > 0) $product->date_added = date('Y-m-d H:i:s');
                    $product->date_updated = date('Y-m-d H:i:s');
                    $product->image_url = $p['image'];
                    $product->url = $p['link'];
                    $product->identifier = md5($product->url);
                    $product->save();

                    $products[] = $product;
                }
            }

            $page++;

        }

        if ($reduced_products > 0) {
            mail('ss@shepparddigital.co.uk','Reduced',$reduced_products . ' have been reduced argos');
        }

        echo 'Done';

        $html->clear();
        unset($html);
    }

    public function debenhams($url) {
        $products = array();

        $max_pages = 100;

        $reduced_products = 0;

        // Get page 1
        $useURL = str_replace('{page}',1,$url);

        $html = new simple_html_dom();
        $html->load($this->getSource($useURL));

        // Products found
        $found = $html->find('#products_found h2 em');
        $found_count = str_replace(' products found','',$found[0]->plaintext);

        // How many pages?
        $pages = ceil($found_count / 20);

        echo 'Scraping Debenhams ...';

        $page = 1;
        while ($page <= $pages && $page <= $max_pages) {

            echo $page . '... ';

            $useURL = str_replace('{page}',$page,$url);

            $html = file_get_html($useURL);

            // Get offers table
            $containers = $html->find('.item_container .description');

            if (count($containers) > 0) {
                foreach($containers as $container) {
                    $product = array();

                    $parent = $container->parent();

                    // Brand name
                    $brand_name = $container->find('.brand_name');

                    // Product name
                    $product_name = $container->find('.product_name');

                    // Image & link
                    $link = $parent->find('a');
                    $product_link = $link[0]->href;

                    $image = $link[0]->find('img');
                    $product_image = $image[0]->src;

                    // Price
                    $price = $parent->find('.price_now');
                    $product_price = str_replace('Now&nbsp;&pound;','',$price[0]->plaintext);

                    $price_actual = $parent->find('.price-actual');
                    $price_actual = $price_actual[0]->plaintext;
                    if ($price_actual != '') $product_price = str_replace(' &pound;','',$price_actual);

                    // Does the product price contain a space?
                    $price_parts = explode(' ',$product_price);
                    if (count($price_parts) > 0) $product_price = $price_parts[0];

                    $p = array(
                        'brand_name' => preg_replace('/\s+/', ' ', $brand_name[0]->plaintext),
                        'product_name' => preg_replace('/\s+/', ' ', $product_name[0]->plaintext),
                        'link' => 'http://www.debenhams.com' . $product_link,
                        'image' => $product_image,
                        'price' => $product_price
                    );

                    $product = new product();
                    $product->getByIdentifier(md5($p['link']));
                    $product->retailer_id = 1;
                    $product->brand = $p['brand_name'];
                    $product->name = $p['product_name'];

                    if ($product->id > 0) {

                        // Reduced?
                        if ($p['price'] < $product->price_now) {
                            $product->reduced = 1;
                            $product->date_reduced = date('Y-m-d H:i:s');

                            $product->price_was = $product->price_now;

                            $reduced_products ++;
                        } elseif ($p['price'] > $product->price_now) {
                            $product->reduced = 0;
                            $product->date_reduced = '';

                            $product->price_was = $product->price_now;
                        }

                    }

                    if ($product->price != $p['price']) $product->price_now = $p['price'];
                    if (!$product->id > 0) $product->date_added = date('Y-m-d H:i:s');
                    $product->date_updated = date('Y-m-d H:i:s');
                    $product->image_url = $p['image'];
                    $product->url = $p['link'];
                    $product->identifier = md5($product->url);
                    $product->save();
                }
            }

            unset($containers);

            $page++;
        }

        if ($reduced_products > 0) {
            mail('ss@shepparddigital.co.uk','Reduced',$reduced_products . ' have been reduced at Debenhams');
        }

        $html->clear();
        unset($html);

        echo 'Done';

        return true;
    }

    public function goArgos() {
        self::argos('http://www.argos.co.uk/static/Browse/c_1/1%7Ccategory_root%7CToys%7C33006252/fs/0/p/{offset}/pp/50/r_001/6%7CAge+range%7CChild+%281-2+years%29%7C1/s/Relevance.htm');
    }

    public function goDebenhams() {
        // Boys clothes
        $this->debenhams('http://www.debenhams.com/webapp/wcs/stores/servlet/Navigate?sfv=Baby&ps=default&catalogId=10001&lid=%2F%2Fproductsuniverse%2Fen_GB%2Fproduct_online%3DY%2Finsearch%3D1%2Fcategories%3C%7Bproductsuniverse_18662%7D%2Ftrend_s%3E%7Bbaby20boys%7D%2Fgender_s%3E%7Bbaby%7D&langId=-1&sfn=AGE+AND+GENDER&storeId=10701&pn={page}');

        // Womans jumpers
        $this->debenhams('http://www.debenhams.com/webapp/wcs/stores/servlet/Navigate?ps=default&catalogId=10001&lid=%2F%2Fproductsuniverse%2Fen_GB%2Fproduct_online%3DY%2Finsearch%3D1%2Fcategories%3C%7Bproductsuniverse_18661%7D%2Fcategories%3C%7Bproductsuniverse_18661_65469%7D&langId=-1&storeId=10701&pn={page}');
    }

    public function women() {
        $this->debenhams('http://www.debenhams.com/webapp/wcs/stores/servlet/Navigate?ps=default&catalogId=10001&lid=%2F%2Fproductsuniverse%2Fen_GB%2Fproduct_online%3DY%2Finsearch%3D1%2Fcategories%3C%7Bproductsuniverse_18661%7D%2Fcategories%3C%7Bproductsuniverse_18661_65469%7D&langId=-1&storeId=10701&pn={page}');
    }

    public function index() {
        $this->goDebenhams();
        $this->goArgos();
    }

}
?>

Open in new window

0
 

Author Comment

by:SheppardDigital
ID: 39633914
Not fixed.
0
 
LVL 58

Expert Comment

by:Gary
ID: 39633988
Is this your whole code, because I get lots of errors with it.
e.g.
Undefined index
$price_actual = $price_actual[0]->plaintext;
0
What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 

Author Comment

by:SheppardDigital
ID: 39634289
yep, its the whole code, apart from the framework that's used to host it.

it's just a proof of concept so you probably will get undefined index errors as I've not properly defined every variable, our server doesn't display notices so I've not bothered, especially since I just wanted to knock together a quick sample.
0
 
LVL 58

Expert Comment

by:Gary
ID: 39634394
Well when I comment out all the error parts (like calling the product class which doesn't exist) then the curl runs fine for both sites in one call.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39634517
When you're debugging something like this you always want error_reporting(E_ALL) at the top of the script.

Since this consists of only GET requests, what is the advantage of cURL over file_get_contents()?  The latter will run synchronously, ensuring completion in the order that the code shows.
0
 

Author Comment

by:SheppardDigital
ID: 39634526
The product class exists as part of the whole framework, so when ran on the server it does exist.

The CURL runs absolutely fine, but when the HTML outputted by the curl is passed to the SimpleHTMLDom object the object is returning back as an empty object.

I just can't figure it out, I've increase execution time for the script and the memory allocation. There's no errors at all.
0
 
LVL 58

Expert Comment

by:Gary
ID: 39634591
After this
$html->load($this->getSource($useURL));

$html does contain the two sites.  Is this where you are saying it is empty on the second call?
0
 

Author Comment

by:SheppardDigital
ID: 39635369
Gary,

Yes, after that line on the second call when I perform a var_dump nothing is displayed.
0
 

Accepted Solution

by:
SheppardDigital earned 0 total points
ID: 39635374
OK, thing I've sorted it.

In each class I was calling $html->load($url); then once that was done the script was looping through a number of $html_get_source() calls. I've changed them all to use the object oriented method of calling, and also inserted a $html->clear; unset($html) within the loops so it's cleared before the next run.

That seems to have fixed it.
0
 

Author Closing Comment

by:SheppardDigital
ID: 40014343
Resolved this myself
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
Since pre-biblical times, humans have sought ways to keep secrets, and share the secrets selectively.  This article explores the ways PHP can be used to hide and encrypt information.
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now