Link to home
Start Free TrialLog in
Avatar of VHSB
VHSB

asked on

Web scraping - the best tools for the job

Guys
Im trying to research what the best web language to use would be to develop a web scraping tool. The tool should scrape employment sites only and lets assume that the developer only has a fair working knowledge of  Coldfusion, ASP.net and PHP technologies.

Or if anyone has any good resources on the subject then they would also be welcome.

Thanks in advance
Avatar of VincentPuglia
VincentPuglia

ok...you got me...what's "web scraping"?
Avatar of VHSB

ASKER

web scraping or screen scraping, its when you automatically parse content from web pages (as long as they are your pages, or you have the permission of the target website). The pages are normally retrieved with http calls etc, are you with me?
Ok...understood.

I could be wrong, but I doubt if you will find a ready-made product to suit your needs, primarily because the data you want to 'scrape' is going to be displayed & stored differently from site to site.  Unless, of course, all you want to do is save an image of the screen itself, in which case there are more than a number of screen capture programs out there.

If you wish to save the data, however, you will need to extract the pertinent data out of the page's source -- not an easy task, given the caveat stated above.

That having been said, I would assume ASP or PHP would equally usable, even javascript could be used to some extent.

Vinny
I use the XMLHTTP object which is becoming quite ubiquitous.

http://www.w3schools.com/dom/dom_http.asp

Dim objH, str

' Create an xmlhttp object:
Set objH = Server.CreateObject("Microsoft.XMLHTTP")
objH.Open "GET", "http://www.domain.com", False
     
' Send the request and return the data:
objH.Send

str = objH.responseBody
Set objH = Nothing
%>

str now has the HTML from the site which you can easily parse.

Regards,
Rod
Avatar of VHSB

ASKER

Thanks guys, I should have already said that I have already developed a web scraping tool in Coldfusion, and a tedious task it was too. It can be broken easily due to the constant changes in the html and it isnt very flexible i.e it cant scrape many sites effectively.
However I wanted to know if and how, other languages may be more suitable for the task of web scraping in the future. As .net technology is becoming more popular i thought that would be a good place to start.

Thanks
ASKER CERTIFIED SOLUTION
Avatar of rdivilbiss
rdivilbiss
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial