Solved

Web Site Scraping

Posted on 2014-02-20
13
493 Views
Last Modified: 2016-08-12
I have often seen requests for "Web site scraping". I think what is meant is to be able, programatically on a web server, to use an inputted url & be able to get the entire content of the site as data, to analyze.

In other words, like view source, but capture the source as data.

How is this done?

Is there some general tool available?

Thanks
0
Comment
Question by:Richard Korts
13 Comments
 
LVL 52

Expert Comment

by:Bill Prew
ID: 39874340
You can easily do this using either of these two utilities:

https://www.gnu.org/software/wget/

http://curl.haxx.se/

If you tell us a little more about exactly the way you want to use this we can be more specific.

~bp
0
 

Author Comment

by:Richard Korts
ID: 39874553
I looked at those; it is not clear how they would work.

More specifically, I want a php program running on a web server.

The user enters a url into an HTML form field. The form is processed by a php program that is able to deal with the source of that url as an array, a blob of text or ??.

I want the php program to examine the source for specific things. I want, programatically, to look for specific strings in the source, etc.

Thanks
0
 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 39874622
Two things.  You could "Request Attention" and get the PHP topic areas added to your question.  And you should take a look at the site you want to 'scrape'.  Quite a few sites now are going to put up their valuable content using AJAX.  That means the data you probably want is Not included in the original page code and would Only be available if you can run the javascript that accesses it.
0
 

Author Comment

by:Richard Korts
ID: 39874682
Thanks Dave,

I did that; I thought of that initially.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39876590
I want a php program running on a web server.
Richard, I've tried this before and it just doesn't work.  PHP is too slow to do an acceptable job.  You might look at this:
http://www.httrack.com/

If you want to "scrape" certain pages of a certain site, then PHP is fast enough.  You can read the HTML document and parse it.  But as Dave says, most web publishers are clueful about attempts to programmatically copy / steal their important data, and they are not going to publish in clear text any more.  If they want to make some of the information available to automated access, they will publish an API and give you the data in JSON format, or for the more old-fashioned, in XML.

If you have a URL and you want an example of how to find some of the information in the HTML document, please post the relevant information and I'll try to show you how it can be done.
0
 

Author Comment

by:Richard Korts
ID: 39877779
I was considering responding to a posting on a site I use to look for new projects. Here is the posting:
______________________________________________

Hello,

We are looking for a developer or a company to develop for us a PHP application.

Your proposal should cover the following:

- PHP development of the functional requirements listed below
- Ensure the non functional requirements are respected especially on the platform PHP versions etc....
- Provide support for QA and deployment

If you would like more information please contact us

Type of application development required:
New Application

Integration requirements:
Standalone Application

Purpose or functionality of application:
providing the following fonctional requirements:

- Users can enter in a form field the URL of a website
- Parse a website to look for google analytics or google tags manager Js
- Display the Google analytics ID in the results page
- Display the results into a table confirming or not if the site is using google analytics and a table of all the pages with status for each of them (i.e. is google analytics available on the page) in the results page

Non functional requirements:
- Application is in english only
- We will provide environment for QA and deployment - we will also do the deployment
- UI will be done separetely/stay very simple
- PHP version on platform is PHP Version 5.4.4-14+deb7u7.4
- Apache version is Apache/2.4.6
- Mysql: 5.5.33-0+wheezy1-log - (Debian) - libmysql - 5.5.33 - UTF8

Platform(s) desired for application:
Linux

Graphical User Interface requirements:
No

Application to run over network:
Yes
________________________________________________

I have seen other generally similar requirements, I have never been able to figure out how to do this.

Thanks
0
3 Use Cases for Connected Systems

Our Dev teams are like yours. They’re continually cranking out code for new features/bugs fixes, testing, deploying, testing some more, responding to production monitoring events and more. It’s complex. So, we thought you’d like to see what’s working for us.

 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 39878070
The signature of Google Analytics looks something like this:

<script type="text/javascript">
  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-30349117-1']);
  _gaq.push(['_trackPageview']);
  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();
</script>

Open in new window

Most of the time you would be able to isolate the GA account with something like this:
http://iconoun.com/demo/temp_rkorts.php

<?php // /demo/temp_rkorts.php
error_reporting(E_ALL);

function googleAnalytics($url, $sig="_gaq.push(['_setAccount',")
{
    // READ THE HTML DOCUMENT
    $str = file_get_contents($url);

    // LOOK FOR THIS SIGNAL STRING
    $arr = explode($sig, $str);

    // IF IT IS MISSING
    if (count($arr) < 2) return FALSE;

    // IF IT IS PRESENT
    $arr = explode("'", trim($arr[1]));
    return $arr[1];
}

// TEST THE FUNCTION
$ret = googleAnalytics('http://www.nationalpres.org');
if ($ret) echo $ret;
if (!$ret) echo "<br>NO GA FOUND";

$ret = googleAnalytics('http://www.laprbass.com/');
if ($ret) echo $ret;
if (!$ret) echo "<br>NO GA FOUND";

Open in new window

0
 
LVL 108

Assisted Solution

by:Ray Paseur
Ray Paseur earned 500 total points
ID: 39878078
The other part, potentially more difficult and time-consuming to test, would be the part about finding all of the links and following them recursively throughout the site.  IIRC there was once a project called sphpider.  It seems to have gone stale.  It was one of the attempts I looked to when I was trying to write a PHP search engine.  You might find something useful there.
0
 

Author Comment

by:Richard Korts
ID: 39878171
Ray, excellent, thanks for all that code, it never occurred to me (it should have) to use file_get_contents.

Thanks!
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39878206
Thanks for the points and thanks for using EE, ~Ray
0
 

Author Comment

by:Richard Korts
ID: 39878210
Ray, it occurred to me that in most cases, in analyzing the main page, would it not normally be true that at least some of the other site pages would occur somewhere in the page content as links (<a href="...">) where of course we would be looking for relative references or references to the base url appended with"/<page name>". Of course it can cascade down (not all subpages are necessarily referenced from the main page, etc.).

Richard
0
 

Expert Comment

by:Pratul Sricastava
ID: 41753471
What is web scraping

Web scraping (also termed web data extraction, screen scraping, or web harvesting) is a web technique of extracting data from the web, and turning unstructured data on the web (including HTML formats) into structured data that you can store to your local computer or a database. Usually, data available on the Internet is only viewable with a web browser, and has little or no structure. Almost all the websites do not provide users with the functionality to save a copy of the data displayed on the web. The only option is human’s manual copy-and-paste action. No doubt that it will be time-consuming and boring to manually capture and separate these data you want exactly. Fortunately, the web scraping technique can execute the process automatically and organize them very well in minutes, instead of manually coping the data from websites.

 

The use of web scraping

Nowadays, web scraping has been widely used in various fields, such as news portals, blogs, forums, e-commerce websites, social media, real estate, financial reports, And the purposes of web scraping are also various, including contact scraping, online price comparison, website change detection, web data integration, weather data monitoring, research, etc.

 
Web scraping techniques

The web scraping technique is implemented by web-scraping software tools. These tools interacts with websites in the same way as you do when using a web browser like Chrome. In addition to display the data in a browser, web scrapers extract data from web pages and store them to a local folder or database. There are lots of web-scraping software tools on the Internet.

Web scraping tools like Octoparse, Contentgrabber, Import.io enable you to configure web-scraping tasks to run on multiple websites at the same time, as well as schedule each extraction task to run automatically. You can configure your tasks to run as frequently as you like, such as hourly, daily, weekly, and monthly.
2

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Why do we like using grid based layouts in website design? Let's look at the live examples of websites and compare them to grid based WordPress themes.
This article discusses four methods for overlaying images in a container on a web page
The viewer will receive an overview of the basics of CSS showing inline styles. In the head tags set up your style tags: (CODE) Reference the nav tag and set your properties.: (CODE) Set the reference for the UL element and styles for it to ensu…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

914 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now