[2 days left] What’s wrong with your cloud strategy? Learn why multicloud solutions matter with Nimble Storage.Register Now

x
?
Solved

Framework For Building Multi-threaded Web Scraper

Posted on 2014-12-23
3
Medium Priority
?
450 Views
Last Modified: 2014-12-30
I'm trying to find a framework that will allow me to create distributed/parallel tasks. At first I planned on using node.js to create a web scraper that will scrape multiple URLs in parallel but I am told it does not allow parallel processing. Right now I only plan on using 1 computer to process parallel jobs for learning purposes but my goal is to develop something that supports scaling across multiple nodes if I need more computers to carry out the tasks. Is there a better framework more suited for this? Any development language is fine.
0
Comment
Question by:OriNetworks
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 28

Assisted Solution

by:dpearson
dpearson earned 2000 total points
ID: 40520091
I think the answer depends on what you plan to do with the results of what you scrape.

It's really easy these days to distribute the work of loading the different URLs across multiple threads and hence multiple processes.  For example in Java you'd use an ExecutorService (here's an example: http://www.javacodegeeks.com/2013/01/java-thread-pool-example-using-executors-and-threadpoolexecutor.html)
and then you make http calls using some library like httpclient (http://hc.apache.org/httpclient-3.x/tutorial.html) to do the scraping.

No real need for a framework - the whole thing would be ~50 lines of code and you just spin up instances of this little Java app on different servers as needed.  You could scale to a huge size very easily.

That will work fine if all you want to do is scrape the URLs looking for some specific data.   Or if you just want to dump the results of what you scrape into a SQL database, then again no framework required.

But if you wanted to collate the data - e.g. counting the number of times a specific phrase occurred on those pages without dumping the results into a single common database or you want to take one page and break out all of the links it references and repeat that process recursively, then you start thinking this is more of a Map-Reduce problem and a framework might help with that.

Doug
0
 
LVL 17

Author Comment

by:OriNetworks
ID: 40522272
My mention of a framework can from the thought that someone would have already made a system for parallel processing of scripts to get as much work done as fast as possible on a particular machine. Since I only have one machine for testing I want to get as much work done as possible using as many CPU threads as possible. Of course the web scraping would be one task I wanted to process in parallel. Later I might want to create a processing layer to extract information from those results by just passing a different script to this 'framework'. If I attempted to do this in java then I think I would have to learn how to handle different threads for concurrent processing and then recreate this management of threads for each additional task I create in the future.

This is acceptable but I was trying to find a generic framework where I can tell it which script to run and up to how many instances I want to run.
0
 
LVL 28

Accepted Solution

by:
dpearson earned 2000 total points
ID: 40523221
OK I see.  In that case you may want to try something like PPSS (https://github.com/louwrentius/PPSS).  I've not used it personally, but I believe it's designed for this sort of simple application where you just want to run a set of scripts in parallel.

Hope that helps,

Doug
0

Featured Post

What’s Wrong with Your Cloud Strategy ?

Even as many CIOs are embracing a cloud-first strategy, the reality is that moving to the cloud is a lengthy process and the end-state is likely to be a blend of multiple clouds—public and private. Learn why multicloud solutions matter in this webinar by Nimble Storage.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Make the most of your online learning experience.
Q&A with Course Creator, Mark Lassoff, on the importance of HTML5 in the career of a modern-day developer.
Internet Business Fax to Email Made Easy - With eFax Corporate (http://www.enterprise.efax.com), you'll receive a dedicated online fax number, which is used the same way as a typical analog fax number. You'll receive secure faxes in your email, fr…
Starting up a Project
Suggested Courses

649 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question