Hi,
We are developing 5 scrapers that will scrape product information from 5 different sources. Each source currently has about 4.5 million products. Our initial scraping will be done until we have completely scraped the 4.5 million records. Thereafter each source site will typically add/remove 40K-50K products per day. As a result all 5 scrapers will be run daily, scraping against each one of these unique sites. Once the product information is scraped(dimensions,price, description,color,weight, warranty, etc), we will display or use that data on our website. We want to have something that works great when scraping but also keep in mind we need to use this data to display on our website and we want the search/catalog functionality on our website to also retrieve and display data fast. We have beefed up hardware. We are open to hybrid solutions.
We are trying to decide on using SQL vs No-SQL.. Specifically we are comparing the MongoDB,HBase vs MySQL, SQL..
Our community of experts have been thoroughly vetted for their expertise and industry experience.
The Most Valuable Expert award recognizes technology experts who passionately share their knowledge with the community, demonstrate the core values of this platform, and go the extra mile in all aspects of their contributions. This award is based off of nominations by EE users and experts. Multiple MVEs may be awarded each year.