PHP SimpleHTMLDom issue

Hi,

This seems to be a strange problem.

I have a class with three methods.

method1 - uses CURL to get the HTML contents of a page, passed the HTML to SimpleHTMLDom and stored relevant data in a table

method2 - is almost the same as method1, except it accesses a different URL and so pulls the data from the page in a different way, way essentially it uses CURL to get the HTML contents of a URL and pass it to SimpleHTMLDom

method3 - When executed calls method1, then method2

The problem is, if I run method1 and method2 individually, they work fine. But if I try to call method3, then it executes method1 fine, it executes method2, but the SimpleHTMLDom object is empty so the method doesn't work.

There are no error messages.

If I modify method 3 so that it calls method2 then method1 (reserve order), then it's always the 2nd method to be called that doesn't work.

Within in method I'm clearing the SimpleHTMLDom when I'm finished.

$html->clear();
unset($html);

Before passing the HTML to the SimpleHTMLDom object, I've tried outputting the HTML provided by CURL, and it's fine.

Does anyone have any ideas why this might be happening?
SheppardDigitalAsked:
Who is Participating?
 
SheppardDigitalConnect With a Mentor Author Commented:
OK, thing I've sorted it.

In each class I was calling $html->load($url); then once that was done the script was looping through a number of $html_get_source() calls. I've changed them all to use the object oriented method of calling, and also inserted a $html->clear; unset($html) within the loops so it's cleared before the next run.

That seems to have fixed it.
0
 
GaryCommented:
Some code would be helpful.
0
 
SheppardDigitalAuthor Commented:
I've figured it out (I think).

I've increased the max_execution time and it's working.
0
Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
SheppardDigitalAuthor Commented:
I've requested that this question be closed as follows:

Accepted answer: 0 points for SheppardDigital's comment #a39633761

for the following reason:

Fixed this myself
0
 
Ray PaseurCommented:
It might be as simple as omitting $this-> in one of the methods, but that's just speculation.  I'm with GaryC123. -- we are experts but not mind readers, so we would need to see the code in order to get beyond speculation.
0
 
GaryCommented:
May want to look at doing async curl calls, at least if one is slow to finish you don't have your other calls waiting for it.
0
 
SheppardDigitalAuthor Commented:
Ok, looks like increasing the execution time didn't work.

Here's the code for the page

<?php
ini_set('max_execution_time', 3600);
ini_set('memory_limit', '1024M');

include __DCMS_PATH . '/app/third_party/simplehtmldom_1_5/simple_html_dom.php';

class scrapeController {

    public function getSource($url) {
        $ch = curl_init($url);
        curl_setopt($ch, CURLOPT_TIMEOUT, 60);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
        $html_data = curl_exec($ch);
        curl_close($ch);

        return $html_data;
    }

    public function argos($url) {

        $products = array();

        $max_pages = 50;

        $reduced_products = 0;

        // Get page 1
        $useURL = str_replace('{offset}',1,$url);

        $html = new simple_html_dom();
        $html->load($this->getSource($useURL));

        // How many products were found?
        $found = $html->find('#categorylist span');
        $found_count = str_replace('(','',str_replace(' products)','',$found[0]->plaintext));

        // How many pages?
        $pages = ceil($found_count / 50);

        echo '<br>Scraping Argos ...';

        $page = 1;
        while ($page <= $pages && $page <= $max_pages) {

            // Offset
            if ($page > 1) {
                $offset = ($page + 1) * 50 + 1;
            } else {
                $offset = 1;
            }

            echo $page . '... ';

            $useURL = str_replace('{offset}',$offset,$url);

            $html = file_get_html($useURL);

            // Find products
            $items = $html->find('dl.product');

            if (count($items) > 0) {
                foreach($items as $item) {
                    // Product name
                    $name = $item->find('.title a');
                    $p['name'] = $name[0]->plaintext;
                    $p['link'] = $name[0]->href;

                    // Price
                    $price = $item->find('.price .main');
                    $p['price'] = str_replace('&pound;','',preg_replace('/\s+/', '', $price[0]->plaintext));

                    // Image
                    $image = $item->find('.image a img');
                    $p['image'] = $image[0]->src;

                    $product = new product();
                    $product->getByIdentifier(md5($p['link']));
                    $product->retailer_id = 2;
                    $product->brand = '';
                    $product->name = $p['name'];

                    if ($product->id > 0) {

                        // Reduced?
                        if ($p['price'] < $product->price_now) {
                            $product->reduced = 1;
                            $product->date_reduced = date('Y-m-d H:i:s');

                            $product->price_was = $product->price_now;

                            $reduced_products ++;
                        } elseif ($p['price'] > $product->price_now) {
                            $product->reduced = 0;
                            $product->date_reduced = '';

                            $product->price_was = $product->price_now;
                        }

                    }

                    if ($product->price != $p['price']) $product->price_now = $p['price'];
                    if (!$product->id > 0) $product->date_added = date('Y-m-d H:i:s');
                    $product->date_updated = date('Y-m-d H:i:s');
                    $product->image_url = $p['image'];
                    $product->url = $p['link'];
                    $product->identifier = md5($product->url);
                    $product->save();

                    $products[] = $product;
                }
            }

            $page++;

        }

        if ($reduced_products > 0) {
            mail('ss@shepparddigital.co.uk','Reduced',$reduced_products . ' have been reduced argos');
        }

        echo 'Done';

        $html->clear();
        unset($html);
    }

    public function debenhams($url) {
        $products = array();

        $max_pages = 100;

        $reduced_products = 0;

        // Get page 1
        $useURL = str_replace('{page}',1,$url);

        $html = new simple_html_dom();
        $html->load($this->getSource($useURL));

        // Products found
        $found = $html->find('#products_found h2 em');
        $found_count = str_replace(' products found','',$found[0]->plaintext);

        // How many pages?
        $pages = ceil($found_count / 20);

        echo 'Scraping Debenhams ...';

        $page = 1;
        while ($page <= $pages && $page <= $max_pages) {

            echo $page . '... ';

            $useURL = str_replace('{page}',$page,$url);

            $html = file_get_html($useURL);

            // Get offers table
            $containers = $html->find('.item_container .description');

            if (count($containers) > 0) {
                foreach($containers as $container) {
                    $product = array();

                    $parent = $container->parent();

                    // Brand name
                    $brand_name = $container->find('.brand_name');

                    // Product name
                    $product_name = $container->find('.product_name');

                    // Image & link
                    $link = $parent->find('a');
                    $product_link = $link[0]->href;

                    $image = $link[0]->find('img');
                    $product_image = $image[0]->src;

                    // Price
                    $price = $parent->find('.price_now');
                    $product_price = str_replace('Now&nbsp;&pound;','',$price[0]->plaintext);

                    $price_actual = $parent->find('.price-actual');
                    $price_actual = $price_actual[0]->plaintext;
                    if ($price_actual != '') $product_price = str_replace(' &pound;','',$price_actual);

                    // Does the product price contain a space?
                    $price_parts = explode(' ',$product_price);
                    if (count($price_parts) > 0) $product_price = $price_parts[0];

                    $p = array(
                        'brand_name' => preg_replace('/\s+/', ' ', $brand_name[0]->plaintext),
                        'product_name' => preg_replace('/\s+/', ' ', $product_name[0]->plaintext),
                        'link' => 'http://www.debenhams.com' . $product_link,
                        'image' => $product_image,
                        'price' => $product_price
                    );

                    $product = new product();
                    $product->getByIdentifier(md5($p['link']));
                    $product->retailer_id = 1;
                    $product->brand = $p['brand_name'];
                    $product->name = $p['product_name'];

                    if ($product->id > 0) {

                        // Reduced?
                        if ($p['price'] < $product->price_now) {
                            $product->reduced = 1;
                            $product->date_reduced = date('Y-m-d H:i:s');

                            $product->price_was = $product->price_now;

                            $reduced_products ++;
                        } elseif ($p['price'] > $product->price_now) {
                            $product->reduced = 0;
                            $product->date_reduced = '';

                            $product->price_was = $product->price_now;
                        }

                    }

                    if ($product->price != $p['price']) $product->price_now = $p['price'];
                    if (!$product->id > 0) $product->date_added = date('Y-m-d H:i:s');
                    $product->date_updated = date('Y-m-d H:i:s');
                    $product->image_url = $p['image'];
                    $product->url = $p['link'];
                    $product->identifier = md5($product->url);
                    $product->save();
                }
            }

            unset($containers);

            $page++;
        }

        if ($reduced_products > 0) {
            mail('ss@shepparddigital.co.uk','Reduced',$reduced_products . ' have been reduced at Debenhams');
        }

        $html->clear();
        unset($html);

        echo 'Done';

        return true;
    }

    public function goArgos() {
        self::argos('http://www.argos.co.uk/static/Browse/c_1/1%7Ccategory_root%7CToys%7C33006252/fs/0/p/{offset}/pp/50/r_001/6%7CAge+range%7CChild+%281-2+years%29%7C1/s/Relevance.htm');
    }

    public function goDebenhams() {
        // Boys clothes
        $this->debenhams('http://www.debenhams.com/webapp/wcs/stores/servlet/Navigate?sfv=Baby&ps=default&catalogId=10001&lid=%2F%2Fproductsuniverse%2Fen_GB%2Fproduct_online%3DY%2Finsearch%3D1%2Fcategories%3C%7Bproductsuniverse_18662%7D%2Ftrend_s%3E%7Bbaby20boys%7D%2Fgender_s%3E%7Bbaby%7D&langId=-1&sfn=AGE+AND+GENDER&storeId=10701&pn={page}');

        // Womans jumpers
        $this->debenhams('http://www.debenhams.com/webapp/wcs/stores/servlet/Navigate?ps=default&catalogId=10001&lid=%2F%2Fproductsuniverse%2Fen_GB%2Fproduct_online%3DY%2Finsearch%3D1%2Fcategories%3C%7Bproductsuniverse_18661%7D%2Fcategories%3C%7Bproductsuniverse_18661_65469%7D&langId=-1&storeId=10701&pn={page}');
    }

    public function women() {
        $this->debenhams('http://www.debenhams.com/webapp/wcs/stores/servlet/Navigate?ps=default&catalogId=10001&lid=%2F%2Fproductsuniverse%2Fen_GB%2Fproduct_online%3DY%2Finsearch%3D1%2Fcategories%3C%7Bproductsuniverse_18661%7D%2Fcategories%3C%7Bproductsuniverse_18661_65469%7D&langId=-1&storeId=10701&pn={page}');
    }

    public function index() {
        $this->goDebenhams();
        $this->goArgos();
    }

}
?>

Open in new window

0
 
SheppardDigitalAuthor Commented:
Not fixed.
0
 
GaryCommented:
Is this your whole code, because I get lots of errors with it.
e.g.
Undefined index
$price_actual = $price_actual[0]->plaintext;
0
 
SheppardDigitalAuthor Commented:
yep, its the whole code, apart from the framework that's used to host it.

it's just a proof of concept so you probably will get undefined index errors as I've not properly defined every variable, our server doesn't display notices so I've not bothered, especially since I just wanted to knock together a quick sample.
0
 
GaryCommented:
Well when I comment out all the error parts (like calling the product class which doesn't exist) then the curl runs fine for both sites in one call.
0
 
Ray PaseurCommented:
When you're debugging something like this you always want error_reporting(E_ALL) at the top of the script.

Since this consists of only GET requests, what is the advantage of cURL over file_get_contents()?  The latter will run synchronously, ensuring completion in the order that the code shows.
0
 
SheppardDigitalAuthor Commented:
The product class exists as part of the whole framework, so when ran on the server it does exist.

The CURL runs absolutely fine, but when the HTML outputted by the curl is passed to the SimpleHTMLDom object the object is returning back as an empty object.

I just can't figure it out, I've increase execution time for the script and the memory allocation. There's no errors at all.
0
 
GaryCommented:
After this
$html->load($this->getSource($useURL));

$html does contain the two sites.  Is this where you are saying it is empty on the second call?
0
 
SheppardDigitalAuthor Commented:
Gary,

Yes, after that line on the second call when I perform a var_dump nothing is displayed.
0
 
SheppardDigitalAuthor Commented:
Resolved this myself
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.