Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

How to crawl the details from other sites

Posted on 2010-11-16
9
Medium Priority
?
1,121 Views
Last Modified: 2012-06-21
Hi experts,

I am in need to crawl data from other site into an excel sheet. Say for example I need to get ecommerce site's details such as a product price, product image, product description, author and all other details of all products. How can I crawl the data using PHP and get it in a excel sheet.  I need to crawl all the data from the site.

Could you please help me regarding it??

Thanks in advance..
0
Comment
Question by:RajeshKanna
  • 4
  • 2
  • 2
  • +1
9 Comments
 
LVL 7

Expert Comment

by:armchang
ID: 34143605
Hi,

You can use this code for getting details.

$s_Source = file_get_contents($s_URL);

Example:

<?php
$s_ThisPagesHTML = file_get_contents('http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_23244643.html');
?>

Once you have the html (or whatever the content structure is), you can use string manipulation or regular expressions to extract the necessary data.
0
 

Author Comment

by:RajeshKanna
ID: 34143766
hi armchang,

Thanks for replying soon. I tried out with your code before it displays whole page of the website and I also tried with cURL, the  same happened while I used the cURL. Could you please help with anyother method??
0
 
LVL 2

Accepted Solution

by:
johnwarde earned 2000 total points
ID: 34143828
Hi Rajesh,

What you are trying to is a technique called "screen scraping", I've done it in the past using regular expressions and string manipulation as suggested by armchang, however there are now more resources on the web to make the process easier here are two:

Simple HTML DOM
This parses the HTML into a Document Object Model and from there you can use jQuery like filters to parse out parts of a page.  Have a look at all four tabs in the next URL:
http://simplehtmldom.sourceforge.net/
Download API/php classes:
http://sourceforge.net/projects/simplehtmldom/files/

ScraperWiki
This is a hosted solution where you write a screen scraper in PHP, Python or Ruby on this site then you make an API to your scraper on ScraperWiki, it will then go off and fetch and return the parsed data to your script, you can use the file_get_contents() function to do this.
http://scraperwiki.com/
http://scraperwiki.com/help/tutorials/
http://scraperwiki.com/api/1.0

The other technique you may want to use is "website crawling" where you follow links on HTML to other pages to parse, this can be done as part of your screen scraping but if you are parsing a large website you might want to separate out the two processes by using crawler, where the crawler just looks for links in a page and hands off the parsing of the page to a dedicated function/object.

As every website is different/unique you may need to use different techniques, use the following terms in you favourite search engine for other help.

php screen scraping
php web crawler

John
0
Visualize your virtual and backup environments

Create well-organized and polished visualizations of your virtual and backup environments when planning VMware vSphere, Microsoft Hyper-V or Veeam deployments. It helps you to gain better visibility and valuable business insights.

 
LVL 111

Expert Comment

by:Ray Paseur
ID: 34145631
Please post the ACTUAL URL of the site you want to extract data from.  And tell us what data you want to extract.  Then we may be able to give you something more than theoretical answers.  Thank you.

In practical terms, you have no other methods to get this information, other than file_get_contents and its variants, or CURL.  Here is a CURL example that works for me.  CURL has the advantage that you can control the timeout.  With file_get_contents(), if the foreign site is down your script blows up for excessive execution time.

One final note... Please be sure you have the endorsement of the publisher before you use a web site in this manner.  Many web sites publish explicit terms of service that prohibit automated access -- they are meant for human readers only.  Many of those that want to permit automated access offer an API to facilitate the access.  Just make sure you're on firm legal ground so you don't get banned or sued.

Best regards, ~Ray
<?php // RAY_curl_example.php
error_reporting(E_ALL);

// A FUNCTION TO RUN A CURL-GET CLIENT CALL TO A FOREIGN SERVER
function my_curl($url, $timeout=2, $error_report=FALSE)
{
    $curl = curl_init();

    // HEADERS AND OPTIONS APPEAR TO BE A FIREFOX BROWSER REFERRED BY GOOGLE
    $header[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $header[] = "Accept-Language: en-us,en;q=0.5";
    $header[] = "Pragma: "; // BROWSERS USUALLY LEAVE BLANK

    // SET THE CURL OPTIONS - SEE http://php.net/manual/en/function.curl-setopt.php
    curl_setopt( $curl, CURLOPT_URL,            $url  );
    curl_setopt( $curl, CURLOPT_USERAGENT,      'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'  );
    curl_setopt( $curl, CURLOPT_HTTPHEADER,     $header  );
    curl_setopt( $curl, CURLOPT_REFERER,        'http://www.google.com'  );
    curl_setopt( $curl, CURLOPT_ENCODING,       'gzip,deflate'  );
    curl_setopt( $curl, CURLOPT_AUTOREFERER,    TRUE  );
    curl_setopt( $curl, CURLOPT_RETURNTRANSFER, TRUE  );
    curl_setopt( $curl, CURLOPT_FOLLOWLOCATION, TRUE  );
    curl_setopt( $curl, CURLOPT_TIMEOUT,        $timeout  );

    // RUN THE CURL REQUEST AND GET THE RESULTS
    $htm = curl_exec($curl);

    // ON FAILURE HANDLE ERROR MESSAGE
    if ($htm === FALSE)
    {
        if ($error_report)
        {
            $err = curl_errno($curl);
            $inf = curl_getinfo($curl);
            echo "CURL FAIL: $url TIMEOUT=$timeout, CURL_ERRNO=$err";
            var_dump($inf);
        }
        curl_close($curl);
        return FALSE;
    }

    // ON SUCCESS RETURN XML / HTML STRING
    curl_close($curl);
    return $htm;
}




// USAGE EXAMPLE - PUT YOUR FAVORITE URL HERE
$url = "http://finance.yahoo.com/d/quotes.csv?s=lulu&f=snl1c1ohgvt1";
$htm = my_curl($url);
if (!$htm) die("NO $url");


// SHOW WHAT WE GOT
echo "<pre>";
echo htmlentities($htm);

Open in new window

0
 

Author Comment

by:RajeshKanna
ID: 34152354
hi ray,

Thanks for your reply. From your coding I can only get the html code of particular web page. I need the content from the web site say for example If a e-commerce site has books I need to get the values as

Array
(
    [results] => Array
        (
            [0] => Array
                (
                    [mrp] => 936
                    [price] => 824
                    [saving] => 112
                    [discount] => 12
                    [status] => In Stock
                    [summary] => You already know Photoshop Elements 2 basics. Now you ' d like to go beyond, with shortcuts, tricks, and tips that let you work smarter and faster. And because you learn more easily when someone shows you how, this is the book for you. Inside, you' ll find clear, illustrated... Moreinstructions for 100 tasks that reveal cool secrets, teach timesaving tricks, and explain great tips guaranteed to make you more productive with Photoshop Elements 2. less
                    [eid] => 65W3FH3NZC
                    [title] => Photoshop Elements 2:
                    [link] => /photoshop-elements-denis-graham-mike-book-0764543539/search-book-mike-wooldridge/1
                    [author] => Array
                        (
                            [0] => Denis Graham
                            [1] =>  Mike Wooldridge And Others
                        )

                )

So that I can take the description of the book, Author, title and all other thing which is available regarding the book.

Can you please help over it????

Hi johnwarde,

I worked with your code and I found the particular web page is opening under my page with my URL and I think you can understand what I exactly needed. Could you please help over it???


Thanks
0
 

Author Comment

by:RajeshKanna
ID: 34152553
Anybody please help regarding it
0
 

Author Closing Comment

by:RajeshKanna
ID: 34153068
Million thanks
0
 
LVL 2

Expert Comment

by:johnwarde
ID: 34153916
Hi Rajesh,

Thanks for the points!

As I mentioned in my solution above every website is different/unique, and it is impossible to provide code that has the ability to parse all kinds of e-commerce websites and you haven't provided a sample URL page. You need drill down into the HTML to extract the data you need.

I would also echo Ray's assertion that you need to get permission from the website that you are crawling parsing to use their data.

John
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 34154795
Question posted: 11/16/10 04:18 AM.  I asked you for the "ACTUAL URL of the site you want to extract data from" and got no response.  Question closed less than 24 hours later.  

Going forward, you will find that you get better answers from EE when you (1) respond to the Experts' requests for specific facts - we ask those questions so we can give you practical answers (not just speculative and theoretical references) and (2) leave the questions open for a couple of days.  Your Experts are located all around the globe and it takes 24 hours to make a day.  If it's worth your time to ask and our time to answer, two or three days is a minimum amount of time to leave a question open.  Many of us have family, work and other commitments and do not check our email at night.  Just a thought.
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this series, we will discuss common questions received as a database Solutions Engineer at Percona. In this role, we speak with a wide array of MySQL and MongoDB users responsible for both extremely large and complex environments to smaller singl…
Originally, this post was published on Monitis Blog, you can check it here . In business circles, we sometimes hear that today is the “age of the customer.” And so it is. Thanks to the enormous advances over the past few years in consumer techno…
The viewer will learn how to count occurrences of each item in an array.
This video shows how to quickly and easily deploy an email signature for all users in Office 365 and prevent it from being added to replies and forwards. (the resulting signature is applied on the server level in Exchange Online) The email signat…
Suggested Courses

824 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question