Solved

How to crawl the details from other sites

Posted on 2010-11-16
9
1,097 Views
Last Modified: 2012-06-21
Hi experts,

I am in need to crawl data from other site into an excel sheet. Say for example I need to get ecommerce site's details such as a product price, product image, product description, author and all other details of all products. How can I crawl the data using PHP and get it in a excel sheet.  I need to crawl all the data from the site.

Could you please help me regarding it??

Thanks in advance..
0
Comment
Question by:RajeshKanna
  • 4
  • 2
  • 2
  • +1
9 Comments
 
LVL 7

Expert Comment

by:armchang
ID: 34143605
Hi,

You can use this code for getting details.

$s_Source = file_get_contents($s_URL);

Example:

<?php
$s_ThisPagesHTML = file_get_contents('http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_23244643.html');
?>

Once you have the html (or whatever the content structure is), you can use string manipulation or regular expressions to extract the necessary data.
0
 

Author Comment

by:RajeshKanna
ID: 34143766
hi armchang,

Thanks for replying soon. I tried out with your code before it displays whole page of the website and I also tried with cURL, the  same happened while I used the cURL. Could you please help with anyother method??
0
 
LVL 2

Accepted Solution

by:
johnwarde earned 500 total points
ID: 34143828
Hi Rajesh,

What you are trying to is a technique called "screen scraping", I've done it in the past using regular expressions and string manipulation as suggested by armchang, however there are now more resources on the web to make the process easier here are two:

Simple HTML DOM
This parses the HTML into a Document Object Model and from there you can use jQuery like filters to parse out parts of a page.  Have a look at all four tabs in the next URL:
http://simplehtmldom.sourceforge.net/
Download API/php classes:
http://sourceforge.net/projects/simplehtmldom/files/

ScraperWiki
This is a hosted solution where you write a screen scraper in PHP, Python or Ruby on this site then you make an API to your scraper on ScraperWiki, it will then go off and fetch and return the parsed data to your script, you can use the file_get_contents() function to do this.
http://scraperwiki.com/
http://scraperwiki.com/help/tutorials/
http://scraperwiki.com/api/1.0

The other technique you may want to use is "website crawling" where you follow links on HTML to other pages to parse, this can be done as part of your screen scraping but if you are parsing a large website you might want to separate out the two processes by using crawler, where the crawler just looks for links in a page and hands off the parsing of the page to a dedicated function/object.

As every website is different/unique you may need to use different techniques, use the following terms in you favourite search engine for other help.

php screen scraping
php web crawler

John
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 34145631
Please post the ACTUAL URL of the site you want to extract data from.  And tell us what data you want to extract.  Then we may be able to give you something more than theoretical answers.  Thank you.

In practical terms, you have no other methods to get this information, other than file_get_contents and its variants, or CURL.  Here is a CURL example that works for me.  CURL has the advantage that you can control the timeout.  With file_get_contents(), if the foreign site is down your script blows up for excessive execution time.

One final note... Please be sure you have the endorsement of the publisher before you use a web site in this manner.  Many web sites publish explicit terms of service that prohibit automated access -- they are meant for human readers only.  Many of those that want to permit automated access offer an API to facilitate the access.  Just make sure you're on firm legal ground so you don't get banned or sued.

Best regards, ~Ray
<?php // RAY_curl_example.php

error_reporting(E_ALL);



// A FUNCTION TO RUN A CURL-GET CLIENT CALL TO A FOREIGN SERVER

function my_curl($url, $timeout=2, $error_report=FALSE)

{

    $curl = curl_init();



    // HEADERS AND OPTIONS APPEAR TO BE A FIREFOX BROWSER REFERRED BY GOOGLE

    $header[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";

    $header[] = "Cache-Control: max-age=0";

    $header[] = "Connection: keep-alive";

    $header[] = "Keep-Alive: 300";

    $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";

    $header[] = "Accept-Language: en-us,en;q=0.5";

    $header[] = "Pragma: "; // BROWSERS USUALLY LEAVE BLANK



    // SET THE CURL OPTIONS - SEE http://php.net/manual/en/function.curl-setopt.php

    curl_setopt( $curl, CURLOPT_URL,            $url  );

    curl_setopt( $curl, CURLOPT_USERAGENT,      'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'  );

    curl_setopt( $curl, CURLOPT_HTTPHEADER,     $header  );

    curl_setopt( $curl, CURLOPT_REFERER,        'http://www.google.com'  );

    curl_setopt( $curl, CURLOPT_ENCODING,       'gzip,deflate'  );

    curl_setopt( $curl, CURLOPT_AUTOREFERER,    TRUE  );

    curl_setopt( $curl, CURLOPT_RETURNTRANSFER, TRUE  );

    curl_setopt( $curl, CURLOPT_FOLLOWLOCATION, TRUE  );

    curl_setopt( $curl, CURLOPT_TIMEOUT,        $timeout  );



    // RUN THE CURL REQUEST AND GET THE RESULTS

    $htm = curl_exec($curl);



    // ON FAILURE HANDLE ERROR MESSAGE

    if ($htm === FALSE)

    {

        if ($error_report)

        {

            $err = curl_errno($curl);

            $inf = curl_getinfo($curl);

            echo "CURL FAIL: $url TIMEOUT=$timeout, CURL_ERRNO=$err";

            var_dump($inf);

        }

        curl_close($curl);

        return FALSE;

    }



    // ON SUCCESS RETURN XML / HTML STRING

    curl_close($curl);

    return $htm;

}









// USAGE EXAMPLE - PUT YOUR FAVORITE URL HERE

$url = "http://finance.yahoo.com/d/quotes.csv?s=lulu&f=snl1c1ohgvt1";

$htm = my_curl($url);

if (!$htm) die("NO $url");





// SHOW WHAT WE GOT

echo "<pre>";

echo htmlentities($htm);

Open in new window

0
What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 

Author Comment

by:RajeshKanna
ID: 34152354
hi ray,

Thanks for your reply. From your coding I can only get the html code of particular web page. I need the content from the web site say for example If a e-commerce site has books I need to get the values as

Array
(
    [results] => Array
        (
            [0] => Array
                (
                    [mrp] => 936
                    [price] => 824
                    [saving] => 112
                    [discount] => 12
                    [status] => In Stock
                    [summary] => You already know Photoshop Elements 2 basics. Now you ' d like to go beyond, with shortcuts, tricks, and tips that let you work smarter and faster. And because you learn more easily when someone shows you how, this is the book for you. Inside, you' ll find clear, illustrated... Moreinstructions for 100 tasks that reveal cool secrets, teach timesaving tricks, and explain great tips guaranteed to make you more productive with Photoshop Elements 2. less
                    [eid] => 65W3FH3NZC
                    [title] => Photoshop Elements 2:
                    [link] => /photoshop-elements-denis-graham-mike-book-0764543539/search-book-mike-wooldridge/1
                    [author] => Array
                        (
                            [0] => Denis Graham
                            [1] =>  Mike Wooldridge And Others
                        )

                )

So that I can take the description of the book, Author, title and all other thing which is available regarding the book.

Can you please help over it????

Hi johnwarde,

I worked with your code and I found the particular web page is opening under my page with my URL and I think you can understand what I exactly needed. Could you please help over it???


Thanks
0
 

Author Comment

by:RajeshKanna
ID: 34152553
Anybody please help regarding it
0
 

Author Closing Comment

by:RajeshKanna
ID: 34153068
Million thanks
0
 
LVL 2

Expert Comment

by:johnwarde
ID: 34153916
Hi Rajesh,

Thanks for the points!

As I mentioned in my solution above every website is different/unique, and it is impossible to provide code that has the ability to parse all kinds of e-commerce websites and you haven't provided a sample URL page. You need drill down into the HTML to extract the data you need.

I would also echo Ray's assertion that you need to get permission from the website that you are crawling parsing to use their data.

John
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 34154795
Question posted: 11/16/10 04:18 AM.  I asked you for the "ACTUAL URL of the site you want to extract data from" and got no response.  Question closed less than 24 hours later.  

Going forward, you will find that you get better answers from EE when you (1) respond to the Experts' requests for specific facts - we ask those questions so we can give you practical answers (not just speculative and theoretical references) and (2) leave the questions open for a couple of days.  Your Experts are located all around the globe and it takes 24 hours to make a day.  If it's worth your time to ask and our time to answer, two or three days is a minimum amount of time to leave a question open.  Many of us have family, work and other commitments and do not check our email at night.  Just a thought.
0

Featured Post

Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
Using in clause in query with many values 7 37
WooCommerce Sort by Date 4 9
SVG Star 4 16
Printing a Google Form 2 7
Introduction Since I wrote the original article about Handling Date and Time in PHP and MySQL (http://www.experts-exchange.com/articles/201/Handling-Date-and-Time-in-PHP-and-MySQL.html) several years ago, it seemed like now was a good time to updat…
This article discusses four methods for overlaying images in a container on a web page
In this tutorial viewers will learn how to style a corner ribbon overlay for an image using CSS Create a new class by typing ".Ribbon":  Define the class' "display:" as "inline-block": Define its "position:" as "relative": Define its "overflow:" as …
In this tutorial viewers will learn how to embed an audio file in a webpage using HTML5. Ensure your DOCTYPE declaration is set to HTML5: : The declaration should display (CODE) HTML5 is supported by the most recent versions of all major browsers…

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now