Solved

PHP SimpleHTMLDom issue

Posted on 2013-11-08
16
273 Views
Last Modified: 2014-04-22
Hi,

This seems to be a strange problem.

I have a class with three methods.

method1 - uses CURL to get the HTML contents of a page, passed the HTML to SimpleHTMLDom and stored relevant data in a table

method2 - is almost the same as method1, except it accesses a different URL and so pulls the data from the page in a different way, way essentially it uses CURL to get the HTML contents of a URL and pass it to SimpleHTMLDom

method3 - When executed calls method1, then method2

The problem is, if I run method1 and method2 individually, they work fine. But if I try to call method3, then it executes method1 fine, it executes method2, but the SimpleHTMLDom object is empty so the method doesn't work.

There are no error messages.

If I modify method 3 so that it calls method2 then method1 (reserve order), then it's always the 2nd method to be called that doesn't work.

Within in method I'm clearing the SimpleHTMLDom when I'm finished.

$html->clear();
unset($html);

Before passing the HTML to the SimpleHTMLDom object, I've tried outputting the HTML provided by CURL, and it's fine.

Does anyone have any ideas why this might be happening?
0
Comment
Question by:SheppardDigital
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 9
  • 5
  • 2
16 Comments
 
LVL 58

Expert Comment

by:Gary
ID: 39633748
Some code would be helpful.
0
 

Author Comment

by:SheppardDigital
ID: 39633761
I've figured it out (I think).

I've increased the max_execution time and it's working.
0
 

Author Comment

by:SheppardDigital
ID: 39633913
I've requested that this question be closed as follows:

Accepted answer: 0 points for SheppardDigital's comment #a39633761

for the following reason:

Fixed this myself
0
Why Off-Site Backups Are The Only Way To Go

You are probably backing up your data—but how and where? Ransomware is on the rise and there are variants that specifically target backups. Read on to discover why off-site is the way to go.

 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39633765
It might be as simple as omitting $this-> in one of the methods, but that's just speculation.  I'm with GaryC123. -- we are experts but not mind readers, so we would need to see the code in order to get beyond speculation.
0
 
LVL 58

Expert Comment

by:Gary
ID: 39633777
May want to look at doing async curl calls, at least if one is slow to finish you don't have your other calls waiting for it.
0
 

Author Comment

by:SheppardDigital
ID: 39633805
Ok, looks like increasing the execution time didn't work.

Here's the code for the page

<?php
ini_set('max_execution_time', 3600);
ini_set('memory_limit', '1024M');

include __DCMS_PATH . '/app/third_party/simplehtmldom_1_5/simple_html_dom.php';

class scrapeController {

    public function getSource($url) {
        $ch = curl_init($url);
        curl_setopt($ch, CURLOPT_TIMEOUT, 60);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
        $html_data = curl_exec($ch);
        curl_close($ch);

        return $html_data;
    }

    public function argos($url) {

        $products = array();

        $max_pages = 50;

        $reduced_products = 0;

        // Get page 1
        $useURL = str_replace('{offset}',1,$url);

        $html = new simple_html_dom();
        $html->load($this->getSource($useURL));

        // How many products were found?
        $found = $html->find('#categorylist span');
        $found_count = str_replace('(','',str_replace(' products)','',$found[0]->plaintext));

        // How many pages?
        $pages = ceil($found_count / 50);

        echo '<br>Scraping Argos ...';

        $page = 1;
        while ($page <= $pages && $page <= $max_pages) {

            // Offset
            if ($page > 1) {
                $offset = ($page + 1) * 50 + 1;
            } else {
                $offset = 1;
            }

            echo $page . '... ';

            $useURL = str_replace('{offset}',$offset,$url);

            $html = file_get_html($useURL);

            // Find products
            $items = $html->find('dl.product');

            if (count($items) > 0) {
                foreach($items as $item) {
                    // Product name
                    $name = $item->find('.title a');
                    $p['name'] = $name[0]->plaintext;
                    $p['link'] = $name[0]->href;

                    // Price
                    $price = $item->find('.price .main');
                    $p['price'] = str_replace('&pound;','',preg_replace('/\s+/', '', $price[0]->plaintext));

                    // Image
                    $image = $item->find('.image a img');
                    $p['image'] = $image[0]->src;

                    $product = new product();
                    $product->getByIdentifier(md5($p['link']));
                    $product->retailer_id = 2;
                    $product->brand = '';
                    $product->name = $p['name'];

                    if ($product->id > 0) {

                        // Reduced?
                        if ($p['price'] < $product->price_now) {
                            $product->reduced = 1;
                            $product->date_reduced = date('Y-m-d H:i:s');

                            $product->price_was = $product->price_now;

                            $reduced_products ++;
                        } elseif ($p['price'] > $product->price_now) {
                            $product->reduced = 0;
                            $product->date_reduced = '';

                            $product->price_was = $product->price_now;
                        }

                    }

                    if ($product->price != $p['price']) $product->price_now = $p['price'];
                    if (!$product->id > 0) $product->date_added = date('Y-m-d H:i:s');
                    $product->date_updated = date('Y-m-d H:i:s');
                    $product->image_url = $p['image'];
                    $product->url = $p['link'];
                    $product->identifier = md5($product->url);
                    $product->save();

                    $products[] = $product;
                }
            }

            $page++;

        }

        if ($reduced_products > 0) {
            mail('ss@shepparddigital.co.uk','Reduced',$reduced_products . ' have been reduced argos');
        }

        echo 'Done';

        $html->clear();
        unset($html);
    }

    public function debenhams($url) {
        $products = array();

        $max_pages = 100;

        $reduced_products = 0;

        // Get page 1
        $useURL = str_replace('{page}',1,$url);

        $html = new simple_html_dom();
        $html->load($this->getSource($useURL));

        // Products found
        $found = $html->find('#products_found h2 em');
        $found_count = str_replace(' products found','',$found[0]->plaintext);

        // How many pages?
        $pages = ceil($found_count / 20);

        echo 'Scraping Debenhams ...';

        $page = 1;
        while ($page <= $pages && $page <= $max_pages) {

            echo $page . '... ';

            $useURL = str_replace('{page}',$page,$url);

            $html = file_get_html($useURL);

            // Get offers table
            $containers = $html->find('.item_container .description');

            if (count($containers) > 0) {
                foreach($containers as $container) {
                    $product = array();

                    $parent = $container->parent();

                    // Brand name
                    $brand_name = $container->find('.brand_name');

                    // Product name
                    $product_name = $container->find('.product_name');

                    // Image & link
                    $link = $parent->find('a');
                    $product_link = $link[0]->href;

                    $image = $link[0]->find('img');
                    $product_image = $image[0]->src;

                    // Price
                    $price = $parent->find('.price_now');
                    $product_price = str_replace('Now&nbsp;&pound;','',$price[0]->plaintext);

                    $price_actual = $parent->find('.price-actual');
                    $price_actual = $price_actual[0]->plaintext;
                    if ($price_actual != '') $product_price = str_replace(' &pound;','',$price_actual);

                    // Does the product price contain a space?
                    $price_parts = explode(' ',$product_price);
                    if (count($price_parts) > 0) $product_price = $price_parts[0];

                    $p = array(
                        'brand_name' => preg_replace('/\s+/', ' ', $brand_name[0]->plaintext),
                        'product_name' => preg_replace('/\s+/', ' ', $product_name[0]->plaintext),
                        'link' => 'http://www.debenhams.com' . $product_link,
                        'image' => $product_image,
                        'price' => $product_price
                    );

                    $product = new product();
                    $product->getByIdentifier(md5($p['link']));
                    $product->retailer_id = 1;
                    $product->brand = $p['brand_name'];
                    $product->name = $p['product_name'];

                    if ($product->id > 0) {

                        // Reduced?
                        if ($p['price'] < $product->price_now) {
                            $product->reduced = 1;
                            $product->date_reduced = date('Y-m-d H:i:s');

                            $product->price_was = $product->price_now;

                            $reduced_products ++;
                        } elseif ($p['price'] > $product->price_now) {
                            $product->reduced = 0;
                            $product->date_reduced = '';

                            $product->price_was = $product->price_now;
                        }

                    }

                    if ($product->price != $p['price']) $product->price_now = $p['price'];
                    if (!$product->id > 0) $product->date_added = date('Y-m-d H:i:s');
                    $product->date_updated = date('Y-m-d H:i:s');
                    $product->image_url = $p['image'];
                    $product->url = $p['link'];
                    $product->identifier = md5($product->url);
                    $product->save();
                }
            }

            unset($containers);

            $page++;
        }

        if ($reduced_products > 0) {
            mail('ss@shepparddigital.co.uk','Reduced',$reduced_products . ' have been reduced at Debenhams');
        }

        $html->clear();
        unset($html);

        echo 'Done';

        return true;
    }

    public function goArgos() {
        self::argos('http://www.argos.co.uk/static/Browse/c_1/1%7Ccategory_root%7CToys%7C33006252/fs/0/p/{offset}/pp/50/r_001/6%7CAge+range%7CChild+%281-2+years%29%7C1/s/Relevance.htm');
    }

    public function goDebenhams() {
        // Boys clothes
        $this->debenhams('http://www.debenhams.com/webapp/wcs/stores/servlet/Navigate?sfv=Baby&ps=default&catalogId=10001&lid=%2F%2Fproductsuniverse%2Fen_GB%2Fproduct_online%3DY%2Finsearch%3D1%2Fcategories%3C%7Bproductsuniverse_18662%7D%2Ftrend_s%3E%7Bbaby20boys%7D%2Fgender_s%3E%7Bbaby%7D&langId=-1&sfn=AGE+AND+GENDER&storeId=10701&pn={page}');

        // Womans jumpers
        $this->debenhams('http://www.debenhams.com/webapp/wcs/stores/servlet/Navigate?ps=default&catalogId=10001&lid=%2F%2Fproductsuniverse%2Fen_GB%2Fproduct_online%3DY%2Finsearch%3D1%2Fcategories%3C%7Bproductsuniverse_18661%7D%2Fcategories%3C%7Bproductsuniverse_18661_65469%7D&langId=-1&storeId=10701&pn={page}');
    }

    public function women() {
        $this->debenhams('http://www.debenhams.com/webapp/wcs/stores/servlet/Navigate?ps=default&catalogId=10001&lid=%2F%2Fproductsuniverse%2Fen_GB%2Fproduct_online%3DY%2Finsearch%3D1%2Fcategories%3C%7Bproductsuniverse_18661%7D%2Fcategories%3C%7Bproductsuniverse_18661_65469%7D&langId=-1&storeId=10701&pn={page}');
    }

    public function index() {
        $this->goDebenhams();
        $this->goArgos();
    }

}
?>

Open in new window

0
 

Author Comment

by:SheppardDigital
ID: 39633914
Not fixed.
0
 
LVL 58

Expert Comment

by:Gary
ID: 39633988
Is this your whole code, because I get lots of errors with it.
e.g.
Undefined index
$price_actual = $price_actual[0]->plaintext;
0
 

Author Comment

by:SheppardDigital
ID: 39634289
yep, its the whole code, apart from the framework that's used to host it.

it's just a proof of concept so you probably will get undefined index errors as I've not properly defined every variable, our server doesn't display notices so I've not bothered, especially since I just wanted to knock together a quick sample.
0
 
LVL 58

Expert Comment

by:Gary
ID: 39634394
Well when I comment out all the error parts (like calling the product class which doesn't exist) then the curl runs fine for both sites in one call.
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39634517
When you're debugging something like this you always want error_reporting(E_ALL) at the top of the script.

Since this consists of only GET requests, what is the advantage of cURL over file_get_contents()?  The latter will run synchronously, ensuring completion in the order that the code shows.
0
 

Author Comment

by:SheppardDigital
ID: 39634526
The product class exists as part of the whole framework, so when ran on the server it does exist.

The CURL runs absolutely fine, but when the HTML outputted by the curl is passed to the SimpleHTMLDom object the object is returning back as an empty object.

I just can't figure it out, I've increase execution time for the script and the memory allocation. There's no errors at all.
0
 
LVL 58

Expert Comment

by:Gary
ID: 39634591
After this
$html->load($this->getSource($useURL));

$html does contain the two sites.  Is this where you are saying it is empty on the second call?
0
 

Author Comment

by:SheppardDigital
ID: 39635369
Gary,

Yes, after that line on the second call when I perform a var_dump nothing is displayed.
0
 

Accepted Solution

by:
SheppardDigital earned 0 total points
ID: 39635374
OK, thing I've sorted it.

In each class I was calling $html->load($url); then once that was done the script was looping through a number of $html_get_source() calls. I've changed them all to use the object oriented method of calling, and also inserted a $html->clear; unset($html) within the loops so it's cleared before the next run.

That seems to have fixed it.
0
 

Author Closing Comment

by:SheppardDigital
ID: 40014343
Resolved this myself
0

Featured Post

[Live Webinar] The Cloud Skills Gap

As Cloud technologies come of age, business leaders grapple with the impact it has on their team's skills and the gap associated with the use of a cloud platform.

Join experts from 451 Research and Concerto Cloud Services on July 27th where we will examine fact and fiction.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article discusses how to create an extensible mechanism for linked drop downs.
This article discusses how to implement server side field validation and display customized error messages to the client.
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

623 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question