[Webinar] Streamline your web hosting managementRegister Today

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 511
  • Last Modified:

Web Site Scraping

I have often seen requests for "Web site scraping". I think what is meant is to be able, programatically on a web server, to use an inputted url & be able to get the entire content of the site as data, to analyze.

In other words, like view source, but capture the source as data.

How is this done?

Is there some general tool available?

Thanks
0
Richard Korts
Asked:
Richard Korts
2 Solutions
 
Bill PrewCommented:
You can easily do this using either of these two utilities:

https://www.gnu.org/software/wget/

http://curl.haxx.se/

If you tell us a little more about exactly the way you want to use this we can be more specific.

~bp
0
 
Richard KortsAuthor Commented:
I looked at those; it is not clear how they would work.

More specifically, I want a php program running on a web server.

The user enters a url into an HTML form field. The form is processed by a php program that is able to deal with the source of that url as an array, a blob of text or ??.

I want the php program to examine the source for specific things. I want, programatically, to look for specific strings in the source, etc.

Thanks
0
 
Dave BaldwinFixer of ProblemsCommented:
Two things.  You could "Request Attention" and get the PHP topic areas added to your question.  And you should take a look at the site you want to 'scrape'.  Quite a few sites now are going to put up their valuable content using AJAX.  That means the data you probably want is Not included in the original page code and would Only be available if you can run the javascript that accesses it.
0
The new generation of project management tools

With monday.com’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.

 
Richard KortsAuthor Commented:
Thanks Dave,

I did that; I thought of that initially.
0
 
Ray PaseurCommented:
I want a php program running on a web server.
Richard, I've tried this before and it just doesn't work.  PHP is too slow to do an acceptable job.  You might look at this:
http://www.httrack.com/

If you want to "scrape" certain pages of a certain site, then PHP is fast enough.  You can read the HTML document and parse it.  But as Dave says, most web publishers are clueful about attempts to programmatically copy / steal their important data, and they are not going to publish in clear text any more.  If they want to make some of the information available to automated access, they will publish an API and give you the data in JSON format, or for the more old-fashioned, in XML.

If you have a URL and you want an example of how to find some of the information in the HTML document, please post the relevant information and I'll try to show you how it can be done.
0
 
Richard KortsAuthor Commented:
I was considering responding to a posting on a site I use to look for new projects. Here is the posting:
______________________________________________

Hello,

We are looking for a developer or a company to develop for us a PHP application.

Your proposal should cover the following:

- PHP development of the functional requirements listed below
- Ensure the non functional requirements are respected especially on the platform PHP versions etc....
- Provide support for QA and deployment

If you would like more information please contact us

Type of application development required:
New Application

Integration requirements:
Standalone Application

Purpose or functionality of application:
providing the following fonctional requirements:

- Users can enter in a form field the URL of a website
- Parse a website to look for google analytics or google tags manager Js
- Display the Google analytics ID in the results page
- Display the results into a table confirming or not if the site is using google analytics and a table of all the pages with status for each of them (i.e. is google analytics available on the page) in the results page

Non functional requirements:
- Application is in english only
- We will provide environment for QA and deployment - we will also do the deployment
- UI will be done separetely/stay very simple
- PHP version on platform is PHP Version 5.4.4-14+deb7u7.4
- Apache version is Apache/2.4.6
- Mysql: 5.5.33-0+wheezy1-log - (Debian) - libmysql - 5.5.33 - UTF8

Platform(s) desired for application:
Linux

Graphical User Interface requirements:
No

Application to run over network:
Yes
________________________________________________

I have seen other generally similar requirements, I have never been able to figure out how to do this.

Thanks
0
 
Ray PaseurCommented:
The signature of Google Analytics looks something like this:

<script type="text/javascript">
  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-30349117-1']);
  _gaq.push(['_trackPageview']);
  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();
</script>

Open in new window

Most of the time you would be able to isolate the GA account with something like this:
http://iconoun.com/demo/temp_rkorts.php

<?php // /demo/temp_rkorts.php
error_reporting(E_ALL);

function googleAnalytics($url, $sig="_gaq.push(['_setAccount',")
{
    // READ THE HTML DOCUMENT
    $str = file_get_contents($url);

    // LOOK FOR THIS SIGNAL STRING
    $arr = explode($sig, $str);

    // IF IT IS MISSING
    if (count($arr) < 2) return FALSE;

    // IF IT IS PRESENT
    $arr = explode("'", trim($arr[1]));
    return $arr[1];
}

// TEST THE FUNCTION
$ret = googleAnalytics('http://www.nationalpres.org');
if ($ret) echo $ret;
if (!$ret) echo "<br>NO GA FOUND";

$ret = googleAnalytics('http://www.laprbass.com/');
if ($ret) echo $ret;
if (!$ret) echo "<br>NO GA FOUND";

Open in new window

0
 
Ray PaseurCommented:
The other part, potentially more difficult and time-consuming to test, would be the part about finding all of the links and following them recursively throughout the site.  IIRC there was once a project called sphpider.  It seems to have gone stale.  It was one of the attempts I looked to when I was trying to write a PHP search engine.  You might find something useful there.
0
 
Richard KortsAuthor Commented:
Ray, excellent, thanks for all that code, it never occurred to me (it should have) to use file_get_contents.

Thanks!
0
 
Ray PaseurCommented:
Thanks for the points and thanks for using EE, ~Ray
0
 
Richard KortsAuthor Commented:
Ray, it occurred to me that in most cases, in analyzing the main page, would it not normally be true that at least some of the other site pages would occur somewhere in the page content as links (<a href="...">) where of course we would be looking for relative references or references to the base url appended with"/<page name>". Of course it can cascade down (not all subpages are necessarily referenced from the main page, etc.).

Richard
0
 
Pratul SricastavaCommented:
What is web scraping

Web scraping (also termed web data extraction, screen scraping, or web harvesting) is a web technique of extracting data from the web, and turning unstructured data on the web (including HTML formats) into structured data that you can store to your local computer or a database. Usually, data available on the Internet is only viewable with a web browser, and has little or no structure. Almost all the websites do not provide users with the functionality to save a copy of the data displayed on the web. The only option is human’s manual copy-and-paste action. No doubt that it will be time-consuming and boring to manually capture and separate these data you want exactly. Fortunately, the web scraping technique can execute the process automatically and organize them very well in minutes, instead of manually coping the data from websites.

 

The use of web scraping

Nowadays, web scraping has been widely used in various fields, such as news portals, blogs, forums, e-commerce websites, social media, real estate, financial reports, And the purposes of web scraping are also various, including contact scraping, online price comparison, website change detection, web data integration, weather data monitoring, research, etc.

 
Web scraping techniques

The web scraping technique is implemented by web-scraping software tools. These tools interacts with websites in the same way as you do when using a web browser like Chrome. In addition to display the data in a browser, web scrapers extract data from web pages and store them to a local folder or database. There are lots of web-scraping software tools on the Internet.

Web scraping tools like Octoparse, Contentgrabber, Import.io enable you to configure web-scraping tasks to run on multiple websites at the same time, as well as schedule each extraction task to run automatically. You can configure your tasks to run as frequently as you like, such as hourly, daily, weekly, and monthly.
2

Featured Post

Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now