Solved

Framework For Building Multi-threaded Web Scraper

Posted on 2014-12-23
3
342 Views
Last Modified: 2014-12-30
I'm trying to find a framework that will allow me to create distributed/parallel tasks. At first I planned on using node.js to create a web scraper that will scrape multiple URLs in parallel but I am told it does not allow parallel processing. Right now I only plan on using 1 computer to process parallel jobs for learning purposes but my goal is to develop something that supports scaling across multiple nodes if I need more computers to carry out the tasks. Is there a better framework more suited for this? Any development language is fine.
0
Comment
Question by:OriNetworks
  • 2
3 Comments
 
LVL 26

Assisted Solution

by:dpearson
dpearson earned 500 total points
ID: 40520091
I think the answer depends on what you plan to do with the results of what you scrape.

It's really easy these days to distribute the work of loading the different URLs across multiple threads and hence multiple processes.  For example in Java you'd use an ExecutorService (here's an example: http://www.javacodegeeks.com/2013/01/java-thread-pool-example-using-executors-and-threadpoolexecutor.html)
and then you make http calls using some library like httpclient (http://hc.apache.org/httpclient-3.x/tutorial.html) to do the scraping.

No real need for a framework - the whole thing would be ~50 lines of code and you just spin up instances of this little Java app on different servers as needed.  You could scale to a huge size very easily.

That will work fine if all you want to do is scrape the URLs looking for some specific data.   Or if you just want to dump the results of what you scrape into a SQL database, then again no framework required.

But if you wanted to collate the data - e.g. counting the number of times a specific phrase occurred on those pages without dumping the results into a single common database or you want to take one page and break out all of the links it references and repeat that process recursively, then you start thinking this is more of a Map-Reduce problem and a framework might help with that.

Doug
0
 
LVL 17

Author Comment

by:OriNetworks
ID: 40522272
My mention of a framework can from the thought that someone would have already made a system for parallel processing of scripts to get as much work done as fast as possible on a particular machine. Since I only have one machine for testing I want to get as much work done as possible using as many CPU threads as possible. Of course the web scraping would be one task I wanted to process in parallel. Later I might want to create a processing layer to extract information from those results by just passing a different script to this 'framework'. If I attempted to do this in java then I think I would have to learn how to handle different threads for concurrent processing and then recreate this management of threads for each additional task I create in the future.

This is acceptable but I was trying to find a generic framework where I can tell it which script to run and up to how many instances I want to run.
0
 
LVL 26

Accepted Solution

by:
dpearson earned 500 total points
ID: 40523221
OK I see.  In that case you may want to try something like PPSS (https://github.com/louwrentius/PPSS).  I've not used it personally, but I believe it's designed for this sort of simple application where you just want to run a set of scripts in parallel.

Hope that helps,

Doug
0

Featured Post

Best Practices: Disaster Recovery Testing

Besides backup, any IT division should have a disaster recovery plan. You will find a few tips below relating to the development of such a plan and to what issues one should pay special attention in the course of backup planning.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Companies keep a much closer eye on costs today, so changing to new Technology – Microsoft Office 365 is the smartest move to take.
Is your company's data protection keeping pace with virtualization? Here are 7 dynamic ways to adapt to rapid breakthroughs in technology.
This Micro Tutorial will explain how to export DynamoDB tables in Amazon Web Services.
Need to grow your business through quality cloud solutions? With everything required to build a cloud platform and solution, you may feel like the distance between you and the cloud is quite long. Help is here. Spend some time learning about the Con…

932 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now