Link to home
Create AccountLog in
Avatar of asciiassasin
asciiassasin

asked on

What web technologies are best for web scraping / comparison?

I have an idea for a service, but just a little HTML and CSS experience in coding for the web.

The basic idea is to automate searching a site that has a search field and returns search results in pages of 20 results each.  I need to search these results for certain keywords.  

Once I find a search result that contains the keyword(s) I am looking for, I need to open that page in another thread and search for more specific info (size, color, weight) and store that info in a database.

Then, I need to open a second site and see if the second site already offers the same things as the first by searching based on the item names returned from the search results on the first site.

THEN...if the second site does not have items in the first site, I need to automate adding them to the second site.

So, what web technologies would be best for this type of project?
ASKER CERTIFIED SOLUTION
Avatar of Scott Fell
Scott Fell
Flag of United States of America image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
web scraping is a debatable area.. it must be done with the consent of the copyright holder.

And what you want to do depends greatly upon the structure of each website. Most sites these days are database driven so it could be next to impossible (or at least very difficult)
SOLUTION
Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
Hasn't this idea been done? I think Yahoo, Google and Bing may have the technology working already.

If you want information from a web publisher, contact the publisher and ask for an API.  And if you decide you have a right to read their web pages and extract data, please learn about and obey robots.txt.
Avatar of asciiassasin
asciiassasin

ASKER

An API probably isn't coming anytime soon from Company A, but I do have permission from Company A and B to automate this process (or....at least the rules do not specifically prohibit this behavior - if they did, I wouldn't be doing it - no sense wasting your time on something just so somebody else can crush it with a single email from an attorney).

I know I need a good understanding of regular expressions, but I am really thinking of doing the scraping and entering using VB.Net until I can figure out how to do it with other web based technologies.