Web scraping - the best tools for the job

Im trying to research what the best web language to use would be to develop a web scraping tool. The tool should scrape employment sites only and lets assume that the developer only has a fair working knowledge of  Coldfusion, ASP.net and PHP technologies.

Or if anyone has any good resources on the subject then they would also be welcome.

Thanks in advance
Who is Participating?
There are some commercial screen scrapers, although I never use them.

As far as breakage goes, that is par for the course, unless you can key in on an ID or other tag that remains relatively static when the pages are redesigned.

That is why they came up with RSS...
ok...you got me...what's "web scraping"?
VHSBAuthor Commented:
web scraping or screen scraping, its when you automatically parse content from web pages (as long as they are your pages, or you have the permission of the target website). The pages are normally retrieved with http calls etc, are you with me?
Cloud Class® Course: SQL Server Core 2016

This course will introduce you to SQL Server Core 2016, as well as teach you about SSMS, data tools, installation, server configuration, using Management Studio, and writing and executing queries.


I could be wrong, but I doubt if you will find a ready-made product to suit your needs, primarily because the data you want to 'scrape' is going to be displayed & stored differently from site to site.  Unless, of course, all you want to do is save an image of the screen itself, in which case there are more than a number of screen capture programs out there.

If you wish to save the data, however, you will need to extract the pertinent data out of the page's source -- not an easy task, given the caveat stated above.

That having been said, I would assume ASP or PHP would equally usable, even javascript could be used to some extent.

I use the XMLHTTP object which is becoming quite ubiquitous.


Dim objH, str

' Create an xmlhttp object:
Set objH = Server.CreateObject("Microsoft.XMLHTTP")
objH.Open "GET", "http://www.domain.com", False
' Send the request and return the data:

str = objH.responseBody
Set objH = Nothing

str now has the HTML from the site which you can easily parse.

VHSBAuthor Commented:
Thanks guys, I should have already said that I have already developed a web scraping tool in Coldfusion, and a tedious task it was too. It can be broken easily due to the constant changes in the html and it isnt very flexible i.e it cant scrape many sites effectively.
However I wanted to know if and how, other languages may be more suitable for the task of web scraping in the future. As .net technology is becoming more popular i thought that would be a good place to start.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.