Link to home
Start Free TrialLog in
Avatar of SheppardDigital
SheppardDigital

asked on

PHP SimpleHTMLDom issue

Hi,

This seems to be a strange problem.

I have a class with three methods.

method1 - uses CURL to get the HTML contents of a page, passed the HTML to SimpleHTMLDom and stored relevant data in a table

method2 - is almost the same as method1, except it accesses a different URL and so pulls the data from the page in a different way, way essentially it uses CURL to get the HTML contents of a URL and pass it to SimpleHTMLDom

method3 - When executed calls method1, then method2

The problem is, if I run method1 and method2 individually, they work fine. But if I try to call method3, then it executes method1 fine, it executes method2, but the SimpleHTMLDom object is empty so the method doesn't work.

There are no error messages.

If I modify method 3 so that it calls method2 then method1 (reserve order), then it's always the 2nd method to be called that doesn't work.

Within in method I'm clearing the SimpleHTMLDom when I'm finished.

$html->clear();
unset($html);

Before passing the HTML to the SimpleHTMLDom object, I've tried outputting the HTML provided by CURL, and it's fine.

Does anyone have any ideas why this might be happening?
Avatar of Gary
Gary
Flag of Ireland image

Some code would be helpful.
Avatar of SheppardDigital
SheppardDigital

ASKER

I've figured it out (I think).

I've increased the max_execution time and it's working.
I've requested that this question be closed as follows:

Accepted answer: 0 points for SheppardDigital's comment #a39633761

for the following reason:

Fixed this myself
It might be as simple as omitting $this-> in one of the methods, but that's just speculation.  I'm with GaryC123. -- we are experts but not mind readers, so we would need to see the code in order to get beyond speculation.
May want to look at doing async curl calls, at least if one is slow to finish you don't have your other calls waiting for it.
Ok, looks like increasing the execution time didn't work.

Here's the code for the page

<?php
ini_set('max_execution_time', 3600);
ini_set('memory_limit', '1024M');

include __DCMS_PATH . '/app/third_party/simplehtmldom_1_5/simple_html_dom.php';

class scrapeController {

    public function getSource($url) {
        $ch = curl_init($url);
        curl_setopt($ch, CURLOPT_TIMEOUT, 60);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
        $html_data = curl_exec($ch);
        curl_close($ch);

        return $html_data;
    }

    public function argos($url) {

        $products = array();

        $max_pages = 50;

        $reduced_products = 0;

        // Get page 1
        $useURL = str_replace('{offset}',1,$url);

        $html = new simple_html_dom();
        $html->load($this->getSource($useURL));

        // How many products were found?
        $found = $html->find('#categorylist span');
        $found_count = str_replace('(','',str_replace(' products)','',$found[0]->plaintext));

        // How many pages?
        $pages = ceil($found_count / 50);

        echo '<br>Scraping Argos ...';

        $page = 1;
        while ($page <= $pages && $page <= $max_pages) {

            // Offset
            if ($page > 1) {
                $offset = ($page + 1) * 50 + 1;
            } else {
                $offset = 1;
            }

            echo $page . '... ';

            $useURL = str_replace('{offset}',$offset,$url);

            $html = file_get_html($useURL);

            // Find products
            $items = $html->find('dl.product');

            if (count($items) > 0) {
                foreach($items as $item) {
                    // Product name
                    $name = $item->find('.title a');
                    $p['name'] = $name[0]->plaintext;
                    $p['link'] = $name[0]->href;

                    // Price
                    $price = $item->find('.price .main');
                    $p['price'] = str_replace('&pound;','',preg_replace('/\s+/', '', $price[0]->plaintext));

                    // Image
                    $image = $item->find('.image a img');
                    $p['image'] = $image[0]->src;

                    $product = new product();
                    $product->getByIdentifier(md5($p['link']));
                    $product->retailer_id = 2;
                    $product->brand = '';
                    $product->name = $p['name'];

                    if ($product->id > 0) {

                        // Reduced?
                        if ($p['price'] < $product->price_now) {
                            $product->reduced = 1;
                            $product->date_reduced = date('Y-m-d H:i:s');

                            $product->price_was = $product->price_now;

                            $reduced_products ++;
                        } elseif ($p['price'] > $product->price_now) {
                            $product->reduced = 0;
                            $product->date_reduced = '';

                            $product->price_was = $product->price_now;
                        }

                    }

                    if ($product->price != $p['price']) $product->price_now = $p['price'];
                    if (!$product->id > 0) $product->date_added = date('Y-m-d H:i:s');
                    $product->date_updated = date('Y-m-d H:i:s');
                    $product->image_url = $p['image'];
                    $product->url = $p['link'];
                    $product->identifier = md5($product->url);
                    $product->save();

                    $products[] = $product;
                }
            }

            $page++;

        }

        if ($reduced_products > 0) {
            mail('ss@shepparddigital.co.uk','Reduced',$reduced_products . ' have been reduced argos');
        }

        echo 'Done';

        $html->clear();
        unset($html);
    }

    public function debenhams($url) {
        $products = array();

        $max_pages = 100;

        $reduced_products = 0;

        // Get page 1
        $useURL = str_replace('{page}',1,$url);

        $html = new simple_html_dom();
        $html->load($this->getSource($useURL));

        // Products found
        $found = $html->find('#products_found h2 em');
        $found_count = str_replace(' products found','',$found[0]->plaintext);

        // How many pages?
        $pages = ceil($found_count / 20);

        echo 'Scraping Debenhams ...';

        $page = 1;
        while ($page <= $pages && $page <= $max_pages) {

            echo $page . '... ';

            $useURL = str_replace('{page}',$page,$url);

            $html = file_get_html($useURL);

            // Get offers table
            $containers = $html->find('.item_container .description');

            if (count($containers) > 0) {
                foreach($containers as $container) {
                    $product = array();

                    $parent = $container->parent();

                    // Brand name
                    $brand_name = $container->find('.brand_name');

                    // Product name
                    $product_name = $container->find('.product_name');

                    // Image & link
                    $link = $parent->find('a');
                    $product_link = $link[0]->href;

                    $image = $link[0]->find('img');
                    $product_image = $image[0]->src;

                    // Price
                    $price = $parent->find('.price_now');
                    $product_price = str_replace('Now&nbsp;&pound;','',$price[0]->plaintext);

                    $price_actual = $parent->find('.price-actual');
                    $price_actual = $price_actual[0]->plaintext;
                    if ($price_actual != '') $product_price = str_replace(' &pound;','',$price_actual);

                    // Does the product price contain a space?
                    $price_parts = explode(' ',$product_price);
                    if (count($price_parts) > 0) $product_price = $price_parts[0];

                    $p = array(
                        'brand_name' => preg_replace('/\s+/', ' ', $brand_name[0]->plaintext),
                        'product_name' => preg_replace('/\s+/', ' ', $product_name[0]->plaintext),
                        'link' => 'http://www.debenhams.com' . $product_link,
                        'image' => $product_image,
                        'price' => $product_price
                    );

                    $product = new product();
                    $product->getByIdentifier(md5($p['link']));
                    $product->retailer_id = 1;
                    $product->brand = $p['brand_name'];
                    $product->name = $p['product_name'];

                    if ($product->id > 0) {

                        // Reduced?
                        if ($p['price'] < $product->price_now) {
                            $product->reduced = 1;
                            $product->date_reduced = date('Y-m-d H:i:s');

                            $product->price_was = $product->price_now;

                            $reduced_products ++;
                        } elseif ($p['price'] > $product->price_now) {
                            $product->reduced = 0;
                            $product->date_reduced = '';

                            $product->price_was = $product->price_now;
                        }

                    }

                    if ($product->price != $p['price']) $product->price_now = $p['price'];
                    if (!$product->id > 0) $product->date_added = date('Y-m-d H:i:s');
                    $product->date_updated = date('Y-m-d H:i:s');
                    $product->image_url = $p['image'];
                    $product->url = $p['link'];
                    $product->identifier = md5($product->url);
                    $product->save();
                }
            }

            unset($containers);

            $page++;
        }

        if ($reduced_products > 0) {
            mail('ss@shepparddigital.co.uk','Reduced',$reduced_products . ' have been reduced at Debenhams');
        }

        $html->clear();
        unset($html);

        echo 'Done';

        return true;
    }

    public function goArgos() {
        self::argos('http://www.argos.co.uk/static/Browse/c_1/1%7Ccategory_root%7CToys%7C33006252/fs/0/p/{offset}/pp/50/r_001/6%7CAge+range%7CChild+%281-2+years%29%7C1/s/Relevance.htm');
    }

    public function goDebenhams() {
        // Boys clothes
        $this->debenhams('http://www.debenhams.com/webapp/wcs/stores/servlet/Navigate?sfv=Baby&ps=default&catalogId=10001&lid=%2F%2Fproductsuniverse%2Fen_GB%2Fproduct_online%3DY%2Finsearch%3D1%2Fcategories%3C%7Bproductsuniverse_18662%7D%2Ftrend_s%3E%7Bbaby20boys%7D%2Fgender_s%3E%7Bbaby%7D&langId=-1&sfn=AGE+AND+GENDER&storeId=10701&pn={page}');

        // Womans jumpers
        $this->debenhams('http://www.debenhams.com/webapp/wcs/stores/servlet/Navigate?ps=default&catalogId=10001&lid=%2F%2Fproductsuniverse%2Fen_GB%2Fproduct_online%3DY%2Finsearch%3D1%2Fcategories%3C%7Bproductsuniverse_18661%7D%2Fcategories%3C%7Bproductsuniverse_18661_65469%7D&langId=-1&storeId=10701&pn={page}');
    }

    public function women() {
        $this->debenhams('http://www.debenhams.com/webapp/wcs/stores/servlet/Navigate?ps=default&catalogId=10001&lid=%2F%2Fproductsuniverse%2Fen_GB%2Fproduct_online%3DY%2Finsearch%3D1%2Fcategories%3C%7Bproductsuniverse_18661%7D%2Fcategories%3C%7Bproductsuniverse_18661_65469%7D&langId=-1&storeId=10701&pn={page}');
    }

    public function index() {
        $this->goDebenhams();
        $this->goArgos();
    }

}
?>

Open in new window

Not fixed.
Is this your whole code, because I get lots of errors with it.
e.g.
Undefined index
$price_actual = $price_actual[0]->plaintext;
yep, its the whole code, apart from the framework that's used to host it.

it's just a proof of concept so you probably will get undefined index errors as I've not properly defined every variable, our server doesn't display notices so I've not bothered, especially since I just wanted to knock together a quick sample.
Well when I comment out all the error parts (like calling the product class which doesn't exist) then the curl runs fine for both sites in one call.
When you're debugging something like this you always want error_reporting(E_ALL) at the top of the script.

Since this consists of only GET requests, what is the advantage of cURL over file_get_contents()?  The latter will run synchronously, ensuring completion in the order that the code shows.
The product class exists as part of the whole framework, so when ran on the server it does exist.

The CURL runs absolutely fine, but when the HTML outputted by the curl is passed to the SimpleHTMLDom object the object is returning back as an empty object.

I just can't figure it out, I've increase execution time for the script and the memory allocation. There's no errors at all.
After this
$html->load($this->getSource($useURL));

$html does contain the two sites.  Is this where you are saying it is empty on the second call?
Gary,

Yes, after that line on the second call when I perform a var_dump nothing is displayed.
ASKER CERTIFIED SOLUTION
Avatar of SheppardDigital
SheppardDigital

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Resolved this myself