Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

PHP SimpleHTMLDom issue

Posted on 2013-11-08
16
Medium Priority
?
279 Views
Last Modified: 2014-04-22
Hi,

This seems to be a strange problem.

I have a class with three methods.

method1 - uses CURL to get the HTML contents of a page, passed the HTML to SimpleHTMLDom and stored relevant data in a table

method2 - is almost the same as method1, except it accesses a different URL and so pulls the data from the page in a different way, way essentially it uses CURL to get the HTML contents of a URL and pass it to SimpleHTMLDom

method3 - When executed calls method1, then method2

The problem is, if I run method1 and method2 individually, they work fine. But if I try to call method3, then it executes method1 fine, it executes method2, but the SimpleHTMLDom object is empty so the method doesn't work.

There are no error messages.

If I modify method 3 so that it calls method2 then method1 (reserve order), then it's always the 2nd method to be called that doesn't work.

Within in method I'm clearing the SimpleHTMLDom when I'm finished.

$html->clear();
unset($html);

Before passing the HTML to the SimpleHTMLDom object, I've tried outputting the HTML provided by CURL, and it's fine.

Does anyone have any ideas why this might be happening?
0
Comment
Question by:SheppardDigital
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 9
  • 5
  • 2
16 Comments
 
LVL 58

Expert Comment

by:Gary
ID: 39633748
Some code would be helpful.
0
 

Author Comment

by:SheppardDigital
ID: 39633761
I've figured it out (I think).

I've increased the max_execution time and it's working.
0
 

Author Comment

by:SheppardDigital
ID: 39633913
I've requested that this question be closed as follows:

Accepted answer: 0 points for SheppardDigital's comment #a39633761

for the following reason:

Fixed this myself
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
LVL 111

Expert Comment

by:Ray Paseur
ID: 39633765
It might be as simple as omitting $this-> in one of the methods, but that's just speculation.  I'm with GaryC123. -- we are experts but not mind readers, so we would need to see the code in order to get beyond speculation.
0
 
LVL 58

Expert Comment

by:Gary
ID: 39633777
May want to look at doing async curl calls, at least if one is slow to finish you don't have your other calls waiting for it.
0
 

Author Comment

by:SheppardDigital
ID: 39633805
Ok, looks like increasing the execution time didn't work.

Here's the code for the page

<?php
ini_set('max_execution_time', 3600);
ini_set('memory_limit', '1024M');

include __DCMS_PATH . '/app/third_party/simplehtmldom_1_5/simple_html_dom.php';

class scrapeController {

    public function getSource($url) {
        $ch = curl_init($url);
        curl_setopt($ch, CURLOPT_TIMEOUT, 60);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
        $html_data = curl_exec($ch);
        curl_close($ch);

        return $html_data;
    }

    public function argos($url) {

        $products = array();

        $max_pages = 50;

        $reduced_products = 0;

        // Get page 1
        $useURL = str_replace('{offset}',1,$url);

        $html = new simple_html_dom();
        $html->load($this->getSource($useURL));

        // How many products were found?
        $found = $html->find('#categorylist span');
        $found_count = str_replace('(','',str_replace(' products)','',$found[0]->plaintext));

        // How many pages?
        $pages = ceil($found_count / 50);

        echo '<br>Scraping Argos ...';

        $page = 1;
        while ($page <= $pages && $page <= $max_pages) {

            // Offset
            if ($page > 1) {
                $offset = ($page + 1) * 50 + 1;
            } else {
                $offset = 1;
            }

            echo $page . '... ';

            $useURL = str_replace('{offset}',$offset,$url);

            $html = file_get_html($useURL);

            // Find products
            $items = $html->find('dl.product');

            if (count($items) > 0) {
                foreach($items as $item) {
                    // Product name
                    $name = $item->find('.title a');
                    $p['name'] = $name[0]->plaintext;
                    $p['link'] = $name[0]->href;

                    // Price
                    $price = $item->find('.price .main');
                    $p['price'] = str_replace('&pound;','',preg_replace('/\s+/', '', $price[0]->plaintext));

                    // Image
                    $image = $item->find('.image a img');
                    $p['image'] = $image[0]->src;

                    $product = new product();
                    $product->getByIdentifier(md5($p['link']));
                    $product->retailer_id = 2;
                    $product->brand = '';
                    $product->name = $p['name'];

                    if ($product->id > 0) {

                        // Reduced?
                        if ($p['price'] < $product->price_now) {
                            $product->reduced = 1;
                            $product->date_reduced = date('Y-m-d H:i:s');

                            $product->price_was = $product->price_now;

                            $reduced_products ++;
                        } elseif ($p['price'] > $product->price_now) {
                            $product->reduced = 0;
                            $product->date_reduced = '';

                            $product->price_was = $product->price_now;
                        }

                    }

                    if ($product->price != $p['price']) $product->price_now = $p['price'];
                    if (!$product->id > 0) $product->date_added = date('Y-m-d H:i:s');
                    $product->date_updated = date('Y-m-d H:i:s');
                    $product->image_url = $p['image'];
                    $product->url = $p['link'];
                    $product->identifier = md5($product->url);
                    $product->save();

                    $products[] = $product;
                }
            }

            $page++;

        }

        if ($reduced_products > 0) {
            mail('ss@shepparddigital.co.uk','Reduced',$reduced_products . ' have been reduced argos');
        }

        echo 'Done';

        $html->clear();
        unset($html);
    }

    public function debenhams($url) {
        $products = array();

        $max_pages = 100;

        $reduced_products = 0;

        // Get page 1
        $useURL = str_replace('{page}',1,$url);

        $html = new simple_html_dom();
        $html->load($this->getSource($useURL));

        // Products found
        $found = $html->find('#products_found h2 em');
        $found_count = str_replace(' products found','',$found[0]->plaintext);

        // How many pages?
        $pages = ceil($found_count / 20);

        echo 'Scraping Debenhams ...';

        $page = 1;
        while ($page <= $pages && $page <= $max_pages) {

            echo $page . '... ';

            $useURL = str_replace('{page}',$page,$url);

            $html = file_get_html($useURL);

            // Get offers table
            $containers = $html->find('.item_container .description');

            if (count($containers) > 0) {
                foreach($containers as $container) {
                    $product = array();

                    $parent = $container->parent();

                    // Brand name
                    $brand_name = $container->find('.brand_name');

                    // Product name
                    $product_name = $container->find('.product_name');

                    // Image & link
                    $link = $parent->find('a');
                    $product_link = $link[0]->href;

                    $image = $link[0]->find('img');
                    $product_image = $image[0]->src;

                    // Price
                    $price = $parent->find('.price_now');
                    $product_price = str_replace('Now&nbsp;&pound;','',$price[0]->plaintext);

                    $price_actual = $parent->find('.price-actual');
                    $price_actual = $price_actual[0]->plaintext;
                    if ($price_actual != '') $product_price = str_replace(' &pound;','',$price_actual);

                    // Does the product price contain a space?
                    $price_parts = explode(' ',$product_price);
                    if (count($price_parts) > 0) $product_price = $price_parts[0];

                    $p = array(
                        'brand_name' => preg_replace('/\s+/', ' ', $brand_name[0]->plaintext),
                        'product_name' => preg_replace('/\s+/', ' ', $product_name[0]->plaintext),
                        'link' => 'http://www.debenhams.com' . $product_link,
                        'image' => $product_image,
                        'price' => $product_price
                    );

                    $product = new product();
                    $product->getByIdentifier(md5($p['link']));
                    $product->retailer_id = 1;
                    $product->brand = $p['brand_name'];
                    $product->name = $p['product_name'];

                    if ($product->id > 0) {

                        // Reduced?
                        if ($p['price'] < $product->price_now) {
                            $product->reduced = 1;
                            $product->date_reduced = date('Y-m-d H:i:s');

                            $product->price_was = $product->price_now;

                            $reduced_products ++;
                        } elseif ($p['price'] > $product->price_now) {
                            $product->reduced = 0;
                            $product->date_reduced = '';

                            $product->price_was = $product->price_now;
                        }

                    }

                    if ($product->price != $p['price']) $product->price_now = $p['price'];
                    if (!$product->id > 0) $product->date_added = date('Y-m-d H:i:s');
                    $product->date_updated = date('Y-m-d H:i:s');
                    $product->image_url = $p['image'];
                    $product->url = $p['link'];
                    $product->identifier = md5($product->url);
                    $product->save();
                }
            }

            unset($containers);

            $page++;
        }

        if ($reduced_products > 0) {
            mail('ss@shepparddigital.co.uk','Reduced',$reduced_products . ' have been reduced at Debenhams');
        }

        $html->clear();
        unset($html);

        echo 'Done';

        return true;
    }

    public function goArgos() {
        self::argos('http://www.argos.co.uk/static/Browse/c_1/1%7Ccategory_root%7CToys%7C33006252/fs/0/p/{offset}/pp/50/r_001/6%7CAge+range%7CChild+%281-2+years%29%7C1/s/Relevance.htm');
    }

    public function goDebenhams() {
        // Boys clothes
        $this->debenhams('http://www.debenhams.com/webapp/wcs/stores/servlet/Navigate?sfv=Baby&ps=default&catalogId=10001&lid=%2F%2Fproductsuniverse%2Fen_GB%2Fproduct_online%3DY%2Finsearch%3D1%2Fcategories%3C%7Bproductsuniverse_18662%7D%2Ftrend_s%3E%7Bbaby20boys%7D%2Fgender_s%3E%7Bbaby%7D&langId=-1&sfn=AGE+AND+GENDER&storeId=10701&pn={page}');

        // Womans jumpers
        $this->debenhams('http://www.debenhams.com/webapp/wcs/stores/servlet/Navigate?ps=default&catalogId=10001&lid=%2F%2Fproductsuniverse%2Fen_GB%2Fproduct_online%3DY%2Finsearch%3D1%2Fcategories%3C%7Bproductsuniverse_18661%7D%2Fcategories%3C%7Bproductsuniverse_18661_65469%7D&langId=-1&storeId=10701&pn={page}');
    }

    public function women() {
        $this->debenhams('http://www.debenhams.com/webapp/wcs/stores/servlet/Navigate?ps=default&catalogId=10001&lid=%2F%2Fproductsuniverse%2Fen_GB%2Fproduct_online%3DY%2Finsearch%3D1%2Fcategories%3C%7Bproductsuniverse_18661%7D%2Fcategories%3C%7Bproductsuniverse_18661_65469%7D&langId=-1&storeId=10701&pn={page}');
    }

    public function index() {
        $this->goDebenhams();
        $this->goArgos();
    }

}
?>

Open in new window

0
 

Author Comment

by:SheppardDigital
ID: 39633914
Not fixed.
0
 
LVL 58

Expert Comment

by:Gary
ID: 39633988
Is this your whole code, because I get lots of errors with it.
e.g.
Undefined index
$price_actual = $price_actual[0]->plaintext;
0
 

Author Comment

by:SheppardDigital
ID: 39634289
yep, its the whole code, apart from the framework that's used to host it.

it's just a proof of concept so you probably will get undefined index errors as I've not properly defined every variable, our server doesn't display notices so I've not bothered, especially since I just wanted to knock together a quick sample.
0
 
LVL 58

Expert Comment

by:Gary
ID: 39634394
Well when I comment out all the error parts (like calling the product class which doesn't exist) then the curl runs fine for both sites in one call.
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 39634517
When you're debugging something like this you always want error_reporting(E_ALL) at the top of the script.

Since this consists of only GET requests, what is the advantage of cURL over file_get_contents()?  The latter will run synchronously, ensuring completion in the order that the code shows.
0
 

Author Comment

by:SheppardDigital
ID: 39634526
The product class exists as part of the whole framework, so when ran on the server it does exist.

The CURL runs absolutely fine, but when the HTML outputted by the curl is passed to the SimpleHTMLDom object the object is returning back as an empty object.

I just can't figure it out, I've increase execution time for the script and the memory allocation. There's no errors at all.
0
 
LVL 58

Expert Comment

by:Gary
ID: 39634591
After this
$html->load($this->getSource($useURL));

$html does contain the two sites.  Is this where you are saying it is empty on the second call?
0
 

Author Comment

by:SheppardDigital
ID: 39635369
Gary,

Yes, after that line on the second call when I perform a var_dump nothing is displayed.
0
 

Accepted Solution

by:
SheppardDigital earned 0 total points
ID: 39635374
OK, thing I've sorted it.

In each class I was calling $html->load($url); then once that was done the script was looping through a number of $html_get_source() calls. I've changed them all to use the object oriented method of calling, and also inserted a $html->clear; unset($html) within the loops so it's cleared before the next run.

That seems to have fixed it.
0
 

Author Closing Comment

by:SheppardDigital
ID: 40014343
Resolved this myself
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

These days socially coordinated efforts have turned into a critical requirement for enterprises.
This article discusses four methods for overlaying images in a container on a web page
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.

722 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question