Solved

Web Site Scraping

Posted on 2014-02-20
13
491 Views
Last Modified: 2016-08-12
I have often seen requests for "Web site scraping". I think what is meant is to be able, programatically on a web server, to use an inputted url & be able to get the entire content of the site as data, to analyze.

In other words, like view source, but capture the source as data.

How is this done?

Is there some general tool available?

Thanks
0
Comment
Question by:Richard Korts
13 Comments
 
LVL 51

Expert Comment

by:Bill Prew
ID: 39874340
You can easily do this using either of these two utilities:

https://www.gnu.org/software/wget/

http://curl.haxx.se/

If you tell us a little more about exactly the way you want to use this we can be more specific.

~bp
0
 

Author Comment

by:Richard Korts
ID: 39874553
I looked at those; it is not clear how they would work.

More specifically, I want a php program running on a web server.

The user enters a url into an HTML form field. The form is processed by a php program that is able to deal with the source of that url as an array, a blob of text or ??.

I want the php program to examine the source for specific things. I want, programatically, to look for specific strings in the source, etc.

Thanks
0
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 39874622
Two things.  You could "Request Attention" and get the PHP topic areas added to your question.  And you should take a look at the site you want to 'scrape'.  Quite a few sites now are going to put up their valuable content using AJAX.  That means the data you probably want is Not included in the original page code and would Only be available if you can run the javascript that accesses it.
0
 

Author Comment

by:Richard Korts
ID: 39874682
Thanks Dave,

I did that; I thought of that initially.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39876590
I want a php program running on a web server.
Richard, I've tried this before and it just doesn't work.  PHP is too slow to do an acceptable job.  You might look at this:
http://www.httrack.com/

If you want to "scrape" certain pages of a certain site, then PHP is fast enough.  You can read the HTML document and parse it.  But as Dave says, most web publishers are clueful about attempts to programmatically copy / steal their important data, and they are not going to publish in clear text any more.  If they want to make some of the information available to automated access, they will publish an API and give you the data in JSON format, or for the more old-fashioned, in XML.

If you have a URL and you want an example of how to find some of the information in the HTML document, please post the relevant information and I'll try to show you how it can be done.
0
 

Author Comment

by:Richard Korts
ID: 39877779
I was considering responding to a posting on a site I use to look for new projects. Here is the posting:
______________________________________________

Hello,

We are looking for a developer or a company to develop for us a PHP application.

Your proposal should cover the following:

- PHP development of the functional requirements listed below
- Ensure the non functional requirements are respected especially on the platform PHP versions etc....
- Provide support for QA and deployment

If you would like more information please contact us

Type of application development required:
New Application

Integration requirements:
Standalone Application

Purpose or functionality of application:
providing the following fonctional requirements:

- Users can enter in a form field the URL of a website
- Parse a website to look for google analytics or google tags manager Js
- Display the Google analytics ID in the results page
- Display the results into a table confirming or not if the site is using google analytics and a table of all the pages with status for each of them (i.e. is google analytics available on the page) in the results page

Non functional requirements:
- Application is in english only
- We will provide environment for QA and deployment - we will also do the deployment
- UI will be done separetely/stay very simple
- PHP version on platform is PHP Version 5.4.4-14+deb7u7.4
- Apache version is Apache/2.4.6
- Mysql: 5.5.33-0+wheezy1-log - (Debian) - libmysql - 5.5.33 - UTF8

Platform(s) desired for application:
Linux

Graphical User Interface requirements:
No

Application to run over network:
Yes
________________________________________________

I have seen other generally similar requirements, I have never been able to figure out how to do this.

Thanks
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 39878070
The signature of Google Analytics looks something like this:

<script type="text/javascript">
  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-30349117-1']);
  _gaq.push(['_trackPageview']);
  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();
</script>

Open in new window

Most of the time you would be able to isolate the GA account with something like this:
http://iconoun.com/demo/temp_rkorts.php

<?php // /demo/temp_rkorts.php
error_reporting(E_ALL);

function googleAnalytics($url, $sig="_gaq.push(['_setAccount',")
{
    // READ THE HTML DOCUMENT
    $str = file_get_contents($url);

    // LOOK FOR THIS SIGNAL STRING
    $arr = explode($sig, $str);

    // IF IT IS MISSING
    if (count($arr) < 2) return FALSE;

    // IF IT IS PRESENT
    $arr = explode("'", trim($arr[1]));
    return $arr[1];
}

// TEST THE FUNCTION
$ret = googleAnalytics('http://www.nationalpres.org');
if ($ret) echo $ret;
if (!$ret) echo "<br>NO GA FOUND";

$ret = googleAnalytics('http://www.laprbass.com/');
if ($ret) echo $ret;
if (!$ret) echo "<br>NO GA FOUND";

Open in new window

0
 
LVL 108

Assisted Solution

by:Ray Paseur
Ray Paseur earned 500 total points
ID: 39878078
The other part, potentially more difficult and time-consuming to test, would be the part about finding all of the links and following them recursively throughout the site.  IIRC there was once a project called sphpider.  It seems to have gone stale.  It was one of the attempts I looked to when I was trying to write a PHP search engine.  You might find something useful there.
0
 

Author Comment

by:Richard Korts
ID: 39878171
Ray, excellent, thanks for all that code, it never occurred to me (it should have) to use file_get_contents.

Thanks!
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39878206
Thanks for the points and thanks for using EE, ~Ray
0
 

Author Comment

by:Richard Korts
ID: 39878210
Ray, it occurred to me that in most cases, in analyzing the main page, would it not normally be true that at least some of the other site pages would occur somewhere in the page content as links (<a href="...">) where of course we would be looking for relative references or references to the base url appended with"/<page name>". Of course it can cascade down (not all subpages are necessarily referenced from the main page, etc.).

Richard
0
 

Expert Comment

by:Pratul Sricastava
ID: 41753471
What is web scraping

Web scraping (also termed web data extraction, screen scraping, or web harvesting) is a web technique of extracting data from the web, and turning unstructured data on the web (including HTML formats) into structured data that you can store to your local computer or a database. Usually, data available on the Internet is only viewable with a web browser, and has little or no structure. Almost all the websites do not provide users with the functionality to save a copy of the data displayed on the web. The only option is human’s manual copy-and-paste action. No doubt that it will be time-consuming and boring to manually capture and separate these data you want exactly. Fortunately, the web scraping technique can execute the process automatically and organize them very well in minutes, instead of manually coping the data from websites.

 

The use of web scraping

Nowadays, web scraping has been widely used in various fields, such as news portals, blogs, forums, e-commerce websites, social media, real estate, financial reports, And the purposes of web scraping are also various, including contact scraping, online price comparison, website change detection, web data integration, weather data monitoring, research, etc.

 
Web scraping techniques

The web scraping technique is implemented by web-scraping software tools. These tools interacts with websites in the same way as you do when using a web browser like Chrome. In addition to display the data in a browser, web scrapers extract data from web pages and store them to a local folder or database. There are lots of web-scraping software tools on the Internet.

Web scraping tools like Octoparse, Contentgrabber, Import.io enable you to configure web-scraping tasks to run on multiple websites at the same time, as well as schedule each extraction task to run automatically. You can configure your tasks to run as frequently as you like, such as hourly, daily, weekly, and monthly.
2

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

Envision that you are chipping away at another e-business site with a team of pundit developers and designers. Everything seems, by all accounts, to be going easily.
Read about why website design really matters in today's demanding market.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.
HTML5 has deprecated a few of the older ways of showing media as well as offering up a new way to create games and animations. Audio, video, and canvas are just a few of the adjustments made between XHTML and HTML5. As we learned in our last micr…

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

23 Experts available now in Live!

Get 1:1 Help Now