Pulling/Scraping from other sites

Hi,
I have a task of pulling properties from another site, we have their permission to do this but ultimately I am not sure the best way to approach this task, I would like to copy the content to our own database with some sort of incremental update.
Has anyone done this kind of task before, what is the best approach to take.

Many thanks,
LVL 4
ScorchDAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

unknown_routineCommented:
You can use php to read the contents of a page:

$html = file_get_contents('http://somesite.com/questions/ask');

Open in new window


Then you can apply the proper code extract data from the $html , which is string.
0
COBOLdinosaurCommented:
Once you have the string in a variable you will need to parse it to extract what you want, but without knowing the nature of the source input and whay needs to be extracted, it is not possible to give much detail about what the parser need to look like.

Cd&
0
ScorchDAuthor Commented:
OK I have the following but it does not work on my first implementation, this seems to make sense but in reality it does not work recursively, specifically it prints the same URL until timeout, I would expect it to re-curse as the first URL it finds is correct and the HTML is the same on subsequent pages.

function getArticles($page) {
    global $articles, $descriptions;
    
    $html = new simple_html_dom();
    $html->load_file($page);
    
    $items = $html->find('.post_title');  
    
    foreach($items as $post) {
        # remember comments count as nodes
        $articles[] = array($post->children(0)->outertext);
    }

    # lets see if there's a next page
    if($next = $html->find('#search-results-footer img[alt=Next]', 0)) {
        $URL = $next->parent()->href;
        echo "going on to $URL <<<\n";
        # memory leak clean up
        $html->clear();
        unset($html);
        
        getArticles($URL);
    }
}

Open in new window


As you can see the specific site in question does not id or class the a tag in question so I am using a child img tag to find the href in question, this works for the first instance.

Any help is greatly appreciated, I understand where I am going in principle but a working example is always a big relief.
0
Determine the Perfect Price for Your IT Services

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden with our free interactive tool and use it to determine the right price for your IT services. Download your free eBook now!

ScorchDAuthor Commented:
OK rather obvious problem, there is nothing wrong with the above code, the site is using session cookies, without them the page jumps back to the start of the list so although my function is recusing the site is actually showing the same page, hence the results.

Any thoughts?
0
COBOLdinosaurCommented:
I am not familar with the sourceforge object you are using, and you are not showing the format of the input or what you are attempting to capture.

At this point all I see is a script that probably should work; but it is like a boat on land. It should float but I won't know if it leaks until it is in the water.

Show me what you are trying to process and what you want from it.

Cd&
0
ScorchDAuthor Commented:
Thanks,  please see my above comment, I do have the above script working, the site in question does require a session cookie in order for the above to work so I am specifically looking for a curl or similar mechanism to enable this to work alongside a session cookie. Pagination does not work without the cookie so I am halfway there  I suppose but without any experience of using curl and cookies with this type of operation.

Many thanks
0
Ray PaseurCommented:
What is the URL of the page you want to scrape?  What is the data you want to acquire?
0
ScorchDAuthor Commented:
Hi Ray, a site which I am having this issue with is www.somesite.com, if you browse you will see a sessions cookie being set, removing this and navigating will result in a loss of position, hence why my above code does not work.

Many thanks
0
Ray PaseurCommented:
Why not accept and return the cookies?  You can do this with cURL.  I believe that this setup will enable you to see the cookies, but of course it will have to be integrated with the rest of your code.  And that may not be very easy.  It appears that the site has been deliberately constructed to foil screen-scraping schemes.  I believe it uses JavaScript to produce much of what you see on the browser viewport, therefore the ability to accept and return the cookies is only one part of the process.  Your script will have to interpret and run JavaScript, too.  In other words, you need a browser to use this site.

<?php // RAY_temp_scorchd.php
error_reporting(E_ALL);


// DEMONSTRATE THE BASICS OF CURL


// A URL TO SCRAPE
$url = 'http://www.somesite.com';

// TRY THE REMOTE WEB SERVICE
$htm = my_curl($url);
$dat = file_get_contents('cookie.txt');

// SHOW THE WORK PRODUCT OR BARK OUT ERROR MESSAGES
echo "<pre>";
echo PHP_EOL . "<b>$url</b>";
echo PHP_EOL . "<i>$dat</i>";
echo PHP_EOL;
echo PHP_EOL . htmlentities($htm);

// ACTIVATE THIS TO SEE THE HTML STRING
// echo PHP_EOL . htmlentities($htm);


// A FUNCTION TO RUN A CURL-GET CLIENT CALL TO A FOREIGN SERVER
function my_curl
( $url
, $timeout=3
, $error_report=TRUE
)
{
    $curl = curl_init();

    // HEADERS AND OPTIONS APPEAR TO BE A FIREFOX BROWSER REFERRED BY GOOGLE
    $header[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $header[] = "Accept-Language: en-us,en;q=0.5";
    $header[] = "Pragma: "; // BROWSERS USUALLY LEAVE BLANK

    // SET THE CURL OPTIONS - SEE http://php.net/manual/en/function.curl-setopt.php
    curl_setopt( $curl, CURLOPT_URL,            $url  );
    curl_setopt( $curl, CURLOPT_USERAGENT,      'Mozilla/5.0 (Windows NT 6.1; rv:22.0) Gecko/20100101 Firefox/22.0'  );
    curl_setopt( $curl, CURLOPT_HTTPHEADER,     $header  );
    curl_setopt( $curl, CURLOPT_REFERER,        'http://www.google.com'  );
    curl_setopt( $curl, CURLOPT_ENCODING,       'gzip,deflate'  );
    curl_setopt( $curl, CURLOPT_AUTOREFERER,    TRUE  );
    curl_setopt( $curl, CURLOPT_RETURNTRANSFER, TRUE  );
    curl_setopt( $curl, CURLOPT_FOLLOWLOCATION, TRUE  );
    curl_setopt( $curl, CURLOPT_COOKIEFILE,     'cookie.txt' );
    curl_setopt( $curl, CURLOPT_COOKIEJAR,      'cookie.txt' );
    curl_setopt( $curl, CURLOPT_TIMEOUT,        $timeout  );

    // RUN THE CURL REQUEST AND GET THE RESULTS
    $htm = curl_exec($curl);

    // ON FAILURE HANDLE ERROR MESSAGE
    if ($htm === FALSE)
    {
        if ($error_report)
        {
            $err = curl_errno($curl);
            $inf = curl_getinfo($curl);
            echo "CURL FAIL: $url TIMEOUT=$timeout, CURL_ERRNO=$err";
            var_dump($inf);
        }
        curl_close($curl);
        return FALSE;
    }

    // ON SUCCESS RETURN XML / HTML STRING
    curl_close($curl);
    return $htm;
}

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.