VHSB
asked on
Web scraping - the best tools for the job
Guys
Im trying to research what the best web language to use would be to develop a web scraping tool. The tool should scrape employment sites only and lets assume that the developer only has a fair working knowledge of Coldfusion, ASP.net and PHP technologies.
Or if anyone has any good resources on the subject then they would also be welcome.
Thanks in advance
Im trying to research what the best web language to use would be to develop a web scraping tool. The tool should scrape employment sites only and lets assume that the developer only has a fair working knowledge of Coldfusion, ASP.net and PHP technologies.
Or if anyone has any good resources on the subject then they would also be welcome.
Thanks in advance
ok...you got me...what's "web scraping"?
ASKER
web scraping or screen scraping, its when you automatically parse content from web pages (as long as they are your pages, or you have the permission of the target website). The pages are normally retrieved with http calls etc, are you with me?
Ok...understood.
I could be wrong, but I doubt if you will find a ready-made product to suit your needs, primarily because the data you want to 'scrape' is going to be displayed & stored differently from site to site. Unless, of course, all you want to do is save an image of the screen itself, in which case there are more than a number of screen capture programs out there.
If you wish to save the data, however, you will need to extract the pertinent data out of the page's source -- not an easy task, given the caveat stated above.
That having been said, I would assume ASP or PHP would equally usable, even javascript could be used to some extent.
Vinny
I could be wrong, but I doubt if you will find a ready-made product to suit your needs, primarily because the data you want to 'scrape' is going to be displayed & stored differently from site to site. Unless, of course, all you want to do is save an image of the screen itself, in which case there are more than a number of screen capture programs out there.
If you wish to save the data, however, you will need to extract the pertinent data out of the page's source -- not an easy task, given the caveat stated above.
That having been said, I would assume ASP or PHP would equally usable, even javascript could be used to some extent.
Vinny
I use the XMLHTTP object which is becoming quite ubiquitous.
http://www.w3schools.com/dom/dom_http.asp
Dim objH, str
' Create an xmlhttp object:
Set objH = Server.CreateObject("Micro soft.XMLHT TP")
objH.Open "GET", "http://www.domain.com", False
' Send the request and return the data:
objH.Send
str = objH.responseBody
Set objH = Nothing
%>
str now has the HTML from the site which you can easily parse.
Regards,
Rod
http://www.w3schools.com/dom/dom_http.asp
Dim objH, str
' Create an xmlhttp object:
Set objH = Server.CreateObject("Micro
objH.Open "GET", "http://www.domain.com", False
' Send the request and return the data:
objH.Send
str = objH.responseBody
Set objH = Nothing
%>
str now has the HTML from the site which you can easily parse.
Regards,
Rod
ASKER
Thanks guys, I should have already said that I have already developed a web scraping tool in Coldfusion, and a tedious task it was too. It can be broken easily due to the constant changes in the html and it isnt very flexible i.e it cant scrape many sites effectively.
However I wanted to know if and how, other languages may be more suitable for the task of web scraping in the future. As .net technology is becoming more popular i thought that would be a good place to start.
Thanks
However I wanted to know if and how, other languages may be more suitable for the task of web scraping in the future. As .net technology is becoming more popular i thought that would be a good place to start.
Thanks
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.