Solved

Framework For Building Multi-threaded Web Scraper

Posted on 2014-12-23
3
371 Views
Last Modified: 2014-12-30
I'm trying to find a framework that will allow me to create distributed/parallel tasks. At first I planned on using node.js to create a web scraper that will scrape multiple URLs in parallel but I am told it does not allow parallel processing. Right now I only plan on using 1 computer to process parallel jobs for learning purposes but my goal is to develop something that supports scaling across multiple nodes if I need more computers to carry out the tasks. Is there a better framework more suited for this? Any development language is fine.
0
Comment
Question by:OriNetworks
  • 2
3 Comments
 
LVL 27

Assisted Solution

by:dpearson
dpearson earned 500 total points
ID: 40520091
I think the answer depends on what you plan to do with the results of what you scrape.

It's really easy these days to distribute the work of loading the different URLs across multiple threads and hence multiple processes.  For example in Java you'd use an ExecutorService (here's an example: http://www.javacodegeeks.com/2013/01/java-thread-pool-example-using-executors-and-threadpoolexecutor.html)
and then you make http calls using some library like httpclient (http://hc.apache.org/httpclient-3.x/tutorial.html) to do the scraping.

No real need for a framework - the whole thing would be ~50 lines of code and you just spin up instances of this little Java app on different servers as needed.  You could scale to a huge size very easily.

That will work fine if all you want to do is scrape the URLs looking for some specific data.   Or if you just want to dump the results of what you scrape into a SQL database, then again no framework required.

But if you wanted to collate the data - e.g. counting the number of times a specific phrase occurred on those pages without dumping the results into a single common database or you want to take one page and break out all of the links it references and repeat that process recursively, then you start thinking this is more of a Map-Reduce problem and a framework might help with that.

Doug
0
 
LVL 17

Author Comment

by:OriNetworks
ID: 40522272
My mention of a framework can from the thought that someone would have already made a system for parallel processing of scripts to get as much work done as fast as possible on a particular machine. Since I only have one machine for testing I want to get as much work done as possible using as many CPU threads as possible. Of course the web scraping would be one task I wanted to process in parallel. Later I might want to create a processing layer to extract information from those results by just passing a different script to this 'framework'. If I attempted to do this in java then I think I would have to learn how to handle different threads for concurrent processing and then recreate this management of threads for each additional task I create in the future.

This is acceptable but I was trying to find a generic framework where I can tell it which script to run and up to how many instances I want to run.
0
 
LVL 27

Accepted Solution

by:
dpearson earned 500 total points
ID: 40523221
OK I see.  In that case you may want to try something like PPSS (https://github.com/louwrentius/PPSS).  I've not used it personally, but I believe it's designed for this sort of simple application where you just want to run a set of scripts in parallel.

Hope that helps,

Doug
0

Featured Post

Microsoft Certification Exam 74-409

Veeam® is happy to provide the Microsoft community with a study guide prepared by MVP and MCT, Orin Thomas. This guide will take you through each of the exam objectives, helping you to prepare for and pass the examination.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A customer recently asked me about anti-malware and the different deployment options available for his business. Daily news about cyberattacks, zero-day vulnerabilities, and companies that suffered a security breach made him wonder if the endpoint a…
Learn how the use of a bunch of disparate tools requiring a lot of manual attention led to a series of unfortunate backup events for one company.
Internet Business Fax to Email Made Easy - With eFax Corporate (http://www.enterprise.efax.com), you'll receive a dedicated online fax number, which is used the same way as a typical analog fax number. You'll receive secure faxes in your email, fr…

839 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question