Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

How to crawl the details from other sites

Posted on 2010-11-16
9
1,108 Views
Last Modified: 2012-06-21
Hi experts,

I am in need to crawl data from other site into an excel sheet. Say for example I need to get ecommerce site's details such as a product price, product image, product description, author and all other details of all products. How can I crawl the data using PHP and get it in a excel sheet.  I need to crawl all the data from the site.

Could you please help me regarding it??

Thanks in advance..
0
Comment
Question by:RajeshKanna
  • 4
  • 2
  • 2
  • +1
9 Comments
 
LVL 7

Expert Comment

by:armchang
ID: 34143605
Hi,

You can use this code for getting details.

$s_Source = file_get_contents($s_URL);

Example:

<?php
$s_ThisPagesHTML = file_get_contents('http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_23244643.html');
?>

Once you have the html (or whatever the content structure is), you can use string manipulation or regular expressions to extract the necessary data.
0
 

Author Comment

by:RajeshKanna
ID: 34143766
hi armchang,

Thanks for replying soon. I tried out with your code before it displays whole page of the website and I also tried with cURL, the  same happened while I used the cURL. Could you please help with anyother method??
0
 
LVL 2

Accepted Solution

by:
johnwarde earned 500 total points
ID: 34143828
Hi Rajesh,

What you are trying to is a technique called "screen scraping", I've done it in the past using regular expressions and string manipulation as suggested by armchang, however there are now more resources on the web to make the process easier here are two:

Simple HTML DOM
This parses the HTML into a Document Object Model and from there you can use jQuery like filters to parse out parts of a page.  Have a look at all four tabs in the next URL:
http://simplehtmldom.sourceforge.net/
Download API/php classes:
http://sourceforge.net/projects/simplehtmldom/files/

ScraperWiki
This is a hosted solution where you write a screen scraper in PHP, Python or Ruby on this site then you make an API to your scraper on ScraperWiki, it will then go off and fetch and return the parsed data to your script, you can use the file_get_contents() function to do this.
http://scraperwiki.com/
http://scraperwiki.com/help/tutorials/
http://scraperwiki.com/api/1.0

The other technique you may want to use is "website crawling" where you follow links on HTML to other pages to parse, this can be done as part of your screen scraping but if you are parsing a large website you might want to separate out the two processes by using crawler, where the crawler just looks for links in a page and hands off the parsing of the page to a dedicated function/object.

As every website is different/unique you may need to use different techniques, use the following terms in you favourite search engine for other help.

php screen scraping
php web crawler

John
0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 
LVL 109

Expert Comment

by:Ray Paseur
ID: 34145631
Please post the ACTUAL URL of the site you want to extract data from.  And tell us what data you want to extract.  Then we may be able to give you something more than theoretical answers.  Thank you.

In practical terms, you have no other methods to get this information, other than file_get_contents and its variants, or CURL.  Here is a CURL example that works for me.  CURL has the advantage that you can control the timeout.  With file_get_contents(), if the foreign site is down your script blows up for excessive execution time.

One final note... Please be sure you have the endorsement of the publisher before you use a web site in this manner.  Many web sites publish explicit terms of service that prohibit automated access -- they are meant for human readers only.  Many of those that want to permit automated access offer an API to facilitate the access.  Just make sure you're on firm legal ground so you don't get banned or sued.

Best regards, ~Ray
<?php // RAY_curl_example.php
error_reporting(E_ALL);

// A FUNCTION TO RUN A CURL-GET CLIENT CALL TO A FOREIGN SERVER
function my_curl($url, $timeout=2, $error_report=FALSE)
{
    $curl = curl_init();

    // HEADERS AND OPTIONS APPEAR TO BE A FIREFOX BROWSER REFERRED BY GOOGLE
    $header[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $header[] = "Accept-Language: en-us,en;q=0.5";
    $header[] = "Pragma: "; // BROWSERS USUALLY LEAVE BLANK

    // SET THE CURL OPTIONS - SEE http://php.net/manual/en/function.curl-setopt.php
    curl_setopt( $curl, CURLOPT_URL,            $url  );
    curl_setopt( $curl, CURLOPT_USERAGENT,      'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'  );
    curl_setopt( $curl, CURLOPT_HTTPHEADER,     $header  );
    curl_setopt( $curl, CURLOPT_REFERER,        'http://www.google.com'  );
    curl_setopt( $curl, CURLOPT_ENCODING,       'gzip,deflate'  );
    curl_setopt( $curl, CURLOPT_AUTOREFERER,    TRUE  );
    curl_setopt( $curl, CURLOPT_RETURNTRANSFER, TRUE  );
    curl_setopt( $curl, CURLOPT_FOLLOWLOCATION, TRUE  );
    curl_setopt( $curl, CURLOPT_TIMEOUT,        $timeout  );

    // RUN THE CURL REQUEST AND GET THE RESULTS
    $htm = curl_exec($curl);

    // ON FAILURE HANDLE ERROR MESSAGE
    if ($htm === FALSE)
    {
        if ($error_report)
        {
            $err = curl_errno($curl);
            $inf = curl_getinfo($curl);
            echo "CURL FAIL: $url TIMEOUT=$timeout, CURL_ERRNO=$err";
            var_dump($inf);
        }
        curl_close($curl);
        return FALSE;
    }

    // ON SUCCESS RETURN XML / HTML STRING
    curl_close($curl);
    return $htm;
}




// USAGE EXAMPLE - PUT YOUR FAVORITE URL HERE
$url = "http://finance.yahoo.com/d/quotes.csv?s=lulu&f=snl1c1ohgvt1";
$htm = my_curl($url);
if (!$htm) die("NO $url");


// SHOW WHAT WE GOT
echo "<pre>";
echo htmlentities($htm);

Open in new window

0
 

Author Comment

by:RajeshKanna
ID: 34152354
hi ray,

Thanks for your reply. From your coding I can only get the html code of particular web page. I need the content from the web site say for example If a e-commerce site has books I need to get the values as

Array
(
    [results] => Array
        (
            [0] => Array
                (
                    [mrp] => 936
                    [price] => 824
                    [saving] => 112
                    [discount] => 12
                    [status] => In Stock
                    [summary] => You already know Photoshop Elements 2 basics. Now you ' d like to go beyond, with shortcuts, tricks, and tips that let you work smarter and faster. And because you learn more easily when someone shows you how, this is the book for you. Inside, you' ll find clear, illustrated... Moreinstructions for 100 tasks that reveal cool secrets, teach timesaving tricks, and explain great tips guaranteed to make you more productive with Photoshop Elements 2. less
                    [eid] => 65W3FH3NZC
                    [title] => Photoshop Elements 2:
                    [link] => /photoshop-elements-denis-graham-mike-book-0764543539/search-book-mike-wooldridge/1
                    [author] => Array
                        (
                            [0] => Denis Graham
                            [1] =>  Mike Wooldridge And Others
                        )

                )

So that I can take the description of the book, Author, title and all other thing which is available regarding the book.

Can you please help over it????

Hi johnwarde,

I worked with your code and I found the particular web page is opening under my page with my URL and I think you can understand what I exactly needed. Could you please help over it???


Thanks
0
 

Author Comment

by:RajeshKanna
ID: 34152553
Anybody please help regarding it
0
 

Author Closing Comment

by:RajeshKanna
ID: 34153068
Million thanks
0
 
LVL 2

Expert Comment

by:johnwarde
ID: 34153916
Hi Rajesh,

Thanks for the points!

As I mentioned in my solution above every website is different/unique, and it is impossible to provide code that has the ability to parse all kinds of e-commerce websites and you haven't provided a sample URL page. You need drill down into the HTML to extract the data you need.

I would also echo Ray's assertion that you need to get permission from the website that you are crawling parsing to use their data.

John
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 34154795
Question posted: 11/16/10 04:18 AM.  I asked you for the "ACTUAL URL of the site you want to extract data from" and got no response.  Question closed less than 24 hours later.  

Going forward, you will find that you get better answers from EE when you (1) respond to the Experts' requests for specific facts - we ask those questions so we can give you practical answers (not just speculative and theoretical references) and (2) leave the questions open for a couple of days.  Your Experts are located all around the globe and it takes 24 hours to make a day.  If it's worth your time to ask and our time to answer, two or three days is a minimum amount of time to leave a question open.  Many of us have family, work and other commitments and do not check our email at night.  Just a thought.
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Using cfstoredproc to return query data 2 30
java mysql insert application 14 25
Delete  php pages of a part of our site 8 31
PHP Mail error 3 26
Find out what you should include to make the best professional email signature for your organization.
Finding original email is quite difficult due to their duplicates. From this article, you will come to know why multiple duplicates of same emails appear and how to delete duplicate emails from Outlook securely and instantly while vital emails remai…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
The viewer will learn the basics of jQuery, including how to invoke it on a web page. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery.: (CODE)

789 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question