Link to home
Start Free TrialLog in
Avatar of Richard Korts
Richard KortsFlag for United States of America

asked on

Web Site Scraping

I have often seen requests for "Web site scraping". I think what is meant is to be able, programatically on a web server, to use an inputted url & be able to get the entire content of the site as data, to analyze.

In other words, like view source, but capture the source as data.

How is this done?

Is there some general tool available?

Thanks
Avatar of Bill Prew
Bill Prew

You can easily do this using either of these two utilities:

https://www.gnu.org/software/wget/

http://curl.haxx.se/

If you tell us a little more about exactly the way you want to use this we can be more specific.

~bp
Avatar of Richard Korts

ASKER

I looked at those; it is not clear how they would work.

More specifically, I want a php program running on a web server.

The user enters a url into an HTML form field. The form is processed by a php program that is able to deal with the source of that url as an array, a blob of text or ??.

I want the php program to examine the source for specific things. I want, programatically, to look for specific strings in the source, etc.

Thanks
Two things.  You could "Request Attention" and get the PHP topic areas added to your question.  And you should take a look at the site you want to 'scrape'.  Quite a few sites now are going to put up their valuable content using AJAX.  That means the data you probably want is Not included in the original page code and would Only be available if you can run the javascript that accesses it.
Thanks Dave,

I did that; I thought of that initially.
I want a php program running on a web server.
Richard, I've tried this before and it just doesn't work.  PHP is too slow to do an acceptable job.  You might look at this:
http://www.httrack.com/

If you want to "scrape" certain pages of a certain site, then PHP is fast enough.  You can read the HTML document and parse it.  But as Dave says, most web publishers are clueful about attempts to programmatically copy / steal their important data, and they are not going to publish in clear text any more.  If they want to make some of the information available to automated access, they will publish an API and give you the data in JSON format, or for the more old-fashioned, in XML.

If you have a URL and you want an example of how to find some of the information in the HTML document, please post the relevant information and I'll try to show you how it can be done.
I was considering responding to a posting on a site I use to look for new projects. Here is the posting:
______________________________________________

Hello,

We are looking for a developer or a company to develop for us a PHP application.

Your proposal should cover the following:

- PHP development of the functional requirements listed below
- Ensure the non functional requirements are respected especially on the platform PHP versions etc....
- Provide support for QA and deployment

If you would like more information please contact us

Type of application development required:
New Application

Integration requirements:
Standalone Application

Purpose or functionality of application:
providing the following fonctional requirements:

- Users can enter in a form field the URL of a website
- Parse a website to look for google analytics or google tags manager Js
- Display the Google analytics ID in the results page
- Display the results into a table confirming or not if the site is using google analytics and a table of all the pages with status for each of them (i.e. is google analytics available on the page) in the results page

Non functional requirements:
- Application is in english only
- We will provide environment for QA and deployment - we will also do the deployment
- UI will be done separetely/stay very simple
- PHP version on platform is PHP Version 5.4.4-14+deb7u7.4
- Apache version is Apache/2.4.6
- Mysql: 5.5.33-0+wheezy1-log - (Debian) - libmysql - 5.5.33 - UTF8

Platform(s) desired for application:
Linux

Graphical User Interface requirements:
No

Application to run over network:
Yes
________________________________________________

I have seen other generally similar requirements, I have never been able to figure out how to do this.

Thanks
ASKER CERTIFIED SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Ray, excellent, thanks for all that code, it never occurred to me (it should have) to use file_get_contents.

Thanks!
Thanks for the points and thanks for using EE, ~Ray
Ray, it occurred to me that in most cases, in analyzing the main page, would it not normally be true that at least some of the other site pages would occur somewhere in the page content as links (<a href="...">) where of course we would be looking for relative references or references to the base url appended with"/<page name>". Of course it can cascade down (not all subpages are necessarily referenced from the main page, etc.).

Richard
What is web scraping

Web scraping (also termed web data extraction, screen scraping, or web harvesting) is a web technique of extracting data from the web, and turning unstructured data on the web (including HTML formats) into structured data that you can store to your local computer or a database. Usually, data available on the Internet is only viewable with a web browser, and has little or no structure. Almost all the websites do not provide users with the functionality to save a copy of the data displayed on the web. The only option is human’s manual copy-and-paste action. No doubt that it will be time-consuming and boring to manually capture and separate these data you want exactly. Fortunately, the web scraping technique can execute the process automatically and organize them very well in minutes, instead of manually coping the data from websites.

 

The use of web scraping

Nowadays, web scraping has been widely used in various fields, such as news portals, blogs, forums, e-commerce websites, social media, real estate, financial reports, And the purposes of web scraping are also various, including contact scraping, online price comparison, website change detection, web data integration, weather data monitoring, research, etc.

 
Web scraping techniques

The web scraping technique is implemented by web-scraping software tools. These tools interacts with websites in the same way as you do when using a web browser like Chrome. In addition to display the data in a browser, web scrapers extract data from web pages and store them to a local folder or database. There are lots of web-scraping software tools on the Internet.

Web scraping tools like Octoparse, Contentgrabber, Import.io enable you to configure web-scraping tasks to run on multiple websites at the same time, as well as schedule each extraction task to run automatically. You can configure your tasks to run as frequently as you like, such as hourly, daily, weekly, and monthly.