Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1370
  • Last Modified:

Web scraping - the best tools for the job

Guys
Im trying to research what the best web language to use would be to develop a web scraping tool. The tool should scrape employment sites only and lets assume that the developer only has a fair working knowledge of  Coldfusion, ASP.net and PHP technologies.

Or if anyone has any good resources on the subject then they would also be welcome.

Thanks in advance
0
VHSB
Asked:
VHSB
  • 2
  • 2
  • 2
1 Solution
 
VincentPugliaCommented:
ok...you got me...what's "web scraping"?
0
 
VHSBAuthor Commented:
web scraping or screen scraping, its when you automatically parse content from web pages (as long as they are your pages, or you have the permission of the target website). The pages are normally retrieved with http calls etc, are you with me?
0
 
VincentPugliaCommented:
Ok...understood.

I could be wrong, but I doubt if you will find a ready-made product to suit your needs, primarily because the data you want to 'scrape' is going to be displayed & stored differently from site to site.  Unless, of course, all you want to do is save an image of the screen itself, in which case there are more than a number of screen capture programs out there.

If you wish to save the data, however, you will need to extract the pertinent data out of the page's source -- not an easy task, given the caveat stated above.

That having been said, I would assume ASP or PHP would equally usable, even javascript could be used to some extent.

Vinny
0
What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.

 
rdivilbissCommented:
I use the XMLHTTP object which is becoming quite ubiquitous.

http://www.w3schools.com/dom/dom_http.asp

Dim objH, str

' Create an xmlhttp object:
Set objH = Server.CreateObject("Microsoft.XMLHTTP")
objH.Open "GET", "http://www.domain.com", False
     
' Send the request and return the data:
objH.Send

str = objH.responseBody
Set objH = Nothing
%>

str now has the HTML from the site which you can easily parse.

Regards,
Rod
0
 
VHSBAuthor Commented:
Thanks guys, I should have already said that I have already developed a web scraping tool in Coldfusion, and a tedious task it was too. It can be broken easily due to the constant changes in the html and it isnt very flexible i.e it cant scrape many sites effectively.
However I wanted to know if and how, other languages may be more suitable for the task of web scraping in the future. As .net technology is becoming more popular i thought that would be a good place to start.

Thanks
0
 
rdivilbissCommented:
There are some commercial screen scrapers, although I never use them.

As far as breakage goes, that is par for the course, unless you can key in on an ID or other tag that remains relatively static when the pages are redesigned.

That is why they came up with RSS...
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

  • 2
  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now