Link to home
Create AccountLog in
Avatar of RajeshKanna
RajeshKannaFlag for India

asked on

How to crawl the details from other sites

Hi experts,

I am in need to crawl data from other site into an excel sheet. Say for example I need to get ecommerce site's details such as a product price, product image, product description, author and all other details of all products. How can I crawl the data using PHP and get it in a excel sheet.  I need to crawl all the data from the site.

Could you please help me regarding it??

Thanks in advance..
Avatar of Armand G
Armand G
Flag of New Zealand image

Hi,

You can use this code for getting details.

$s_Source = file_get_contents($s_URL);

Example:

<?php
$s_ThisPagesHTML = file_get_contents('https://www.experts-exchange.com/questions/23244643/massive-URL-crawling-with-PHP.html);
?>

Once you have the html (or whatever the content structure is), you can use string manipulation or regular expressions to extract the necessary data.
Avatar of RajeshKanna

ASKER

hi armchang,

Thanks for replying soon. I tried out with your code before it displays whole page of the website and I also tried with cURL, the  same happened while I used the cURL. Could you please help with anyother method??
ASKER CERTIFIED SOLUTION
Avatar of johnwarde
johnwarde
Flag of Ireland image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
Please post the ACTUAL URL of the site you want to extract data from.  And tell us what data you want to extract.  Then we may be able to give you something more than theoretical answers.  Thank you.

In practical terms, you have no other methods to get this information, other than file_get_contents and its variants, or CURL.  Here is a CURL example that works for me.  CURL has the advantage that you can control the timeout.  With file_get_contents(), if the foreign site is down your script blows up for excessive execution time.

One final note... Please be sure you have the endorsement of the publisher before you use a web site in this manner.  Many web sites publish explicit terms of service that prohibit automated access -- they are meant for human readers only.  Many of those that want to permit automated access offer an API to facilitate the access.  Just make sure you're on firm legal ground so you don't get banned or sued.

Best regards, ~Ray
<?php // RAY_curl_example.php
error_reporting(E_ALL);

// A FUNCTION TO RUN A CURL-GET CLIENT CALL TO A FOREIGN SERVER
function my_curl($url, $timeout=2, $error_report=FALSE)
{
    $curl = curl_init();

    // HEADERS AND OPTIONS APPEAR TO BE A FIREFOX BROWSER REFERRED BY GOOGLE
    $header[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $header[] = "Accept-Language: en-us,en;q=0.5";
    $header[] = "Pragma: "; // BROWSERS USUALLY LEAVE BLANK

    // SET THE CURL OPTIONS - SEE http://php.net/manual/en/function.curl-setopt.php
    curl_setopt( $curl, CURLOPT_URL,            $url  );
    curl_setopt( $curl, CURLOPT_USERAGENT,      'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'  );
    curl_setopt( $curl, CURLOPT_HTTPHEADER,     $header  );
    curl_setopt( $curl, CURLOPT_REFERER,        'http://www.google.com'  );
    curl_setopt( $curl, CURLOPT_ENCODING,       'gzip,deflate'  );
    curl_setopt( $curl, CURLOPT_AUTOREFERER,    TRUE  );
    curl_setopt( $curl, CURLOPT_RETURNTRANSFER, TRUE  );
    curl_setopt( $curl, CURLOPT_FOLLOWLOCATION, TRUE  );
    curl_setopt( $curl, CURLOPT_TIMEOUT,        $timeout  );

    // RUN THE CURL REQUEST AND GET THE RESULTS
    $htm = curl_exec($curl);

    // ON FAILURE HANDLE ERROR MESSAGE
    if ($htm === FALSE)
    {
        if ($error_report)
        {
            $err = curl_errno($curl);
            $inf = curl_getinfo($curl);
            echo "CURL FAIL: $url TIMEOUT=$timeout, CURL_ERRNO=$err";
            var_dump($inf);
        }
        curl_close($curl);
        return FALSE;
    }

    // ON SUCCESS RETURN XML / HTML STRING
    curl_close($curl);
    return $htm;
}




// USAGE EXAMPLE - PUT YOUR FAVORITE URL HERE
$url = "http://finance.yahoo.com/d/quotes.csv?s=lulu&f=snl1c1ohgvt1";
$htm = my_curl($url);
if (!$htm) die("NO $url");


// SHOW WHAT WE GOT
echo "<pre>";
echo htmlentities($htm);

Open in new window

hi ray,

Thanks for your reply. From your coding I can only get the html code of particular web page. I need the content from the web site say for example If a e-commerce site has books I need to get the values as

Array
(
    [results] => Array
        (
            [0] => Array
                (
                    [mrp] => 936
                    [price] => 824
                    [saving] => 112
                    [discount] => 12
                    [status] => In Stock
                    [summary] => You already know Photoshop Elements 2 basics. Now you ' d like to go beyond, with shortcuts, tricks, and tips that let you work smarter and faster. And because you learn more easily when someone shows you how, this is the book for you. Inside, you' ll find clear, illustrated... Moreinstructions for 100 tasks that reveal cool secrets, teach timesaving tricks, and explain great tips guaranteed to make you more productive with Photoshop Elements 2. less
                    [eid] => 65W3FH3NZC
                    [title] => Photoshop Elements 2:
                    [link] => /photoshop-elements-denis-graham-mike-book-0764543539/search-book-mike-wooldridge/1
                    [author] => Array
                        (
                            [0] => Denis Graham
                            [1] =>  Mike Wooldridge And Others
                        )

                )

So that I can take the description of the book, Author, title and all other thing which is available regarding the book.

Can you please help over it????

Hi johnwarde,

I worked with your code and I found the particular web page is opening under my page with my URL and I think you can understand what I exactly needed. Could you please help over it???


Thanks
Anybody please help regarding it
Million thanks
Hi Rajesh,

Thanks for the points!

As I mentioned in my solution above every website is different/unique, and it is impossible to provide code that has the ability to parse all kinds of e-commerce websites and you haven't provided a sample URL page. You need drill down into the HTML to extract the data you need.

I would also echo Ray's assertion that you need to get permission from the website that you are crawling parsing to use their data.

John
Question posted: 11/16/10 04:18 AM.  I asked you for the "ACTUAL URL of the site you want to extract data from" and got no response.  Question closed less than 24 hours later.  

Going forward, you will find that you get better answers from EE when you (1) respond to the Experts' requests for specific facts - we ask those questions so we can give you practical answers (not just speculative and theoretical references) and (2) leave the questions open for a couple of days.  Your Experts are located all around the globe and it takes 24 hours to make a day.  If it's worth your time to ask and our time to answer, two or three days is a minimum amount of time to leave a question open.  Many of us have family, work and other commitments and do not check our email at night.  Just a thought.