Avatar of mbarazi
Flag for United States of America asked on

Using SQL vs NoSQL with scrapers


We are developing 5 scrapers that will scrape product information from 5 different sources. Each source currently has about 4.5 million products. Our initial scraping will be done until we have completely scraped the 4.5 million records. Thereafter each source site will typically add/remove 40K-50K products per day. As a result all 5 scrapers will be run daily, scraping against each one of these unique sites. Once the product information is scraped(dimensions,price, description,color,weight, warranty, etc), we will display or use that data on our website. We want to have something that works great when scraping but also keep in mind we need to use this data to display on our website and we want the search/catalog functionality on our website to also retrieve and display data fast. We have beefed up hardware. We are open to hybrid solutions.    

We are trying to decide on using SQL vs No-SQL.. Specifically we are comparing the MongoDB,HBase vs MySQL, SQL..
Windows OSWeb DevelopmentDatabasesJavaC#

Avatar of undefined
Last Comment
Kyle Hamilton

8/22/2022 - Mon
Kyle Hamilton

View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.

Just a comment:

Wouldn't it be simpler to ask for catalogues from these sites?

Also "We have beefed up hardware" means that this can lead to DoS of the source sites. In some countries you'er working in a grey area. And in some this could be even illegal.

And in some this could be even illegal.

What . .  you mean helping yourself to someone else's information might not be right?

When you run a massive scraping against a site and you don't have an agreement with its owner, this can be for example in Germany a possible denial of service attack,  we have a law which prohibits such kind of "massive querying".

The key point is that you "have beefed up hardware". When your scraping attempt saturates the sites bandwidth then you have basically done a denial of service attack.

Typical scraping scenario: Retrieving 100 products per web request. This means  45000 requests. So when your hardware is capable of running 1000 requests per second, this can be already a problem for that site.
Your help has saved me hundreds of hours of internet surfing.

I was being ironic.

I'm actually surprised that Internet law (such as it is) has not yet caught up with this activity, and regulated it more widely.

We are sensitive to the source sites and are inserting a delay between requests. We only issue one request per second.. With that said we are just being proactive and have to be as fast as possible on our end loading and processing result of the scrape... Because of our product drives these sites to ve competitive  they are not necessarily willing to share data..

Request per second makes 1 insert to database per second. Floppy disk storage with token ring network can handle that easily.
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.

One request per site but all 5 scrapera will run concurrently  thus 5 inserts per second.

Floppy drive makes like 10 IOPS
Kyle Hamilton

i think the point we're trying to make is you dont need anything beyond a simple RDBMS and some cheap hardware given the stated volume and velocity. you will not gain anything by using anything else, but you will lose flexibility and speed. and if you're doing transactions, ie modifying rows, then RDBMS us by far your best choice.  

after that, any ethical considerations are completely on you.
This is the best money I have ever spent. I cannot not tell you how many times these folks have saved my bacon. I learn so much from the contributors.