Hi,
We are developing 5 scrapers that will scrape product information from 5 different sources. Each source currently has about 4.5 million products. Our initial scraping will be done until we have completely scraped the 4.5 million records. Thereafter each source site will typically add/remove 40K-50K products per day. As a result all 5 scrapers will be run daily, scraping against each one of these unique sites. Once the product information is scraped(dimensions,price, description,color,weight, warranty, etc), we will display or use that data on our website. We want to have something that works great when scraping but also keep in mind we need to use this data to display on our website and we want the search/catalog functionality on our website to also retrieve and display data fast. We have beefed up hardware. We are open to hybrid solutions.
We are trying to decide on using SQL vs No-SQL.. Specifically we are comparing the MongoDB,HBase vs MySQL, SQL..
Wouldn't it be simpler to ask for catalogues from these sites?
Also "We have beefed up hardware" means that this can lead to DoS of the source sites. In some countries you'er working in a grey area. And in some this could be even illegal.