Go Premium for a chance to win a PS4. Enter to Win


Web Site Scraping

Posted on 2014-02-20
Medium Priority
Last Modified: 2016-08-12
I have often seen requests for "Web site scraping". I think what is meant is to be able, programatically on a web server, to use an inputted url & be able to get the entire content of the site as data, to analyze.

In other words, like view source, but capture the source as data.

How is this done?

Is there some general tool available?

Question by:Richard Korts
LVL 59

Expert Comment

by:Bill Prew
ID: 39874340
You can easily do this using either of these two utilities:



If you tell us a little more about exactly the way you want to use this we can be more specific.


Author Comment

by:Richard Korts
ID: 39874553
I looked at those; it is not clear how they would work.

More specifically, I want a php program running on a web server.

The user enters a url into an HTML form field. The form is processed by a php program that is able to deal with the source of that url as an array, a blob of text or ??.

I want the php program to examine the source for specific things. I want, programatically, to look for specific strings in the source, etc.

LVL 84

Expert Comment

by:Dave Baldwin
ID: 39874622
Two things.  You could "Request Attention" and get the PHP topic areas added to your question.  And you should take a look at the site you want to 'scrape'.  Quite a few sites now are going to put up their valuable content using AJAX.  That means the data you probably want is Not included in the original page code and would Only be available if you can run the javascript that accesses it.
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!


Author Comment

by:Richard Korts
ID: 39874682
Thanks Dave,

I did that; I thought of that initially.
LVL 111

Expert Comment

by:Ray Paseur
ID: 39876590
I want a php program running on a web server.
Richard, I've tried this before and it just doesn't work.  PHP is too slow to do an acceptable job.  You might look at this:

If you want to "scrape" certain pages of a certain site, then PHP is fast enough.  You can read the HTML document and parse it.  But as Dave says, most web publishers are clueful about attempts to programmatically copy / steal their important data, and they are not going to publish in clear text any more.  If they want to make some of the information available to automated access, they will publish an API and give you the data in JSON format, or for the more old-fashioned, in XML.

If you have a URL and you want an example of how to find some of the information in the HTML document, please post the relevant information and I'll try to show you how it can be done.

Author Comment

by:Richard Korts
ID: 39877779
I was considering responding to a posting on a site I use to look for new projects. Here is the posting:


We are looking for a developer or a company to develop for us a PHP application.

Your proposal should cover the following:

- PHP development of the functional requirements listed below
- Ensure the non functional requirements are respected especially on the platform PHP versions etc....
- Provide support for QA and deployment

If you would like more information please contact us

Type of application development required:
New Application

Integration requirements:
Standalone Application

Purpose or functionality of application:
providing the following fonctional requirements:

- Users can enter in a form field the URL of a website
- Parse a website to look for google analytics or google tags manager Js
- Display the Google analytics ID in the results page
- Display the results into a table confirming or not if the site is using google analytics and a table of all the pages with status for each of them (i.e. is google analytics available on the page) in the results page

Non functional requirements:
- Application is in english only
- We will provide environment for QA and deployment - we will also do the deployment
- UI will be done separetely/stay very simple
- PHP version on platform is PHP Version 5.4.4-14+deb7u7.4
- Apache version is Apache/2.4.6
- Mysql: 5.5.33-0+wheezy1-log - (Debian) - libmysql - 5.5.33 - UTF8

Platform(s) desired for application:

Graphical User Interface requirements:

Application to run over network:

I have seen other generally similar requirements, I have never been able to figure out how to do this.

LVL 111

Accepted Solution

Ray Paseur earned 2000 total points
ID: 39878070
The signature of Google Analytics looks something like this:

<script type="text/javascript">
  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-30349117-1']);
  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);

Open in new window

Most of the time you would be able to isolate the GA account with something like this:

<?php // /demo/temp_rkorts.php

function googleAnalytics($url, $sig="_gaq.push(['_setAccount',")
    $str = file_get_contents($url);

    $arr = explode($sig, $str);

    if (count($arr) < 2) return FALSE;

    $arr = explode("'", trim($arr[1]));
    return $arr[1];

$ret = googleAnalytics('http://www.nationalpres.org');
if ($ret) echo $ret;
if (!$ret) echo "<br>NO GA FOUND";

$ret = googleAnalytics('http://www.laprbass.com/');
if ($ret) echo $ret;
if (!$ret) echo "<br>NO GA FOUND";

Open in new window

LVL 111

Assisted Solution

by:Ray Paseur
Ray Paseur earned 2000 total points
ID: 39878078
The other part, potentially more difficult and time-consuming to test, would be the part about finding all of the links and following them recursively throughout the site.  IIRC there was once a project called sphpider.  It seems to have gone stale.  It was one of the attempts I looked to when I was trying to write a PHP search engine.  You might find something useful there.

Author Comment

by:Richard Korts
ID: 39878171
Ray, excellent, thanks for all that code, it never occurred to me (it should have) to use file_get_contents.

LVL 111

Expert Comment

by:Ray Paseur
ID: 39878206
Thanks for the points and thanks for using EE, ~Ray

Author Comment

by:Richard Korts
ID: 39878210
Ray, it occurred to me that in most cases, in analyzing the main page, would it not normally be true that at least some of the other site pages would occur somewhere in the page content as links (<a href="...">) where of course we would be looking for relative references or references to the base url appended with"/<page name>". Of course it can cascade down (not all subpages are necessarily referenced from the main page, etc.).


Expert Comment

by:Pratul Sricastava
ID: 41753471
What is web scraping

Web scraping (also termed web data extraction, screen scraping, or web harvesting) is a web technique of extracting data from the web, and turning unstructured data on the web (including HTML formats) into structured data that you can store to your local computer or a database. Usually, data available on the Internet is only viewable with a web browser, and has little or no structure. Almost all the websites do not provide users with the functionality to save a copy of the data displayed on the web. The only option is human’s manual copy-and-paste action. No doubt that it will be time-consuming and boring to manually capture and separate these data you want exactly. Fortunately, the web scraping technique can execute the process automatically and organize them very well in minutes, instead of manually coping the data from websites.


The use of web scraping

Nowadays, web scraping has been widely used in various fields, such as news portals, blogs, forums, e-commerce websites, social media, real estate, financial reports, And the purposes of web scraping are also various, including contact scraping, online price comparison, website change detection, web data integration, weather data monitoring, research, etc.

Web scraping techniques

The web scraping technique is implemented by web-scraping software tools. These tools interacts with websites in the same way as you do when using a web browser like Chrome. In addition to display the data in a browser, web scrapers extract data from web pages and store them to a local folder or database. There are lots of web-scraping software tools on the Internet.

Web scraping tools like Octoparse, Contentgrabber, Import.io enable you to configure web-scraping tasks to run on multiple websites at the same time, as well as schedule each extraction task to run automatically. You can configure your tasks to run as frequently as you like, such as hourly, daily, weekly, and monthly.

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

CTAs encourage people to do something specific to show interest in your company, product or service. Keep reading to learn why CTAs should always be thought of as extremely important, albeit small, sections of websites.
The first step to building an amazing About page is to figure out what you want the page to say about your company. You then must grab the attention of the reader, boast a bit, tell a story and let others brag about you. With a little bit of thought…
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…
Suggested Courses

926 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question