Solved

Framework For Building Multi-threaded Web Scraper

Posted on 2014-12-23
3
412 Views
Last Modified: 2014-12-30
I'm trying to find a framework that will allow me to create distributed/parallel tasks. At first I planned on using node.js to create a web scraper that will scrape multiple URLs in parallel but I am told it does not allow parallel processing. Right now I only plan on using 1 computer to process parallel jobs for learning purposes but my goal is to develop something that supports scaling across multiple nodes if I need more computers to carry out the tasks. Is there a better framework more suited for this? Any development language is fine.
0
Comment
Question by:OriNetworks
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 27

Assisted Solution

by:dpearson
dpearson earned 500 total points
ID: 40520091
I think the answer depends on what you plan to do with the results of what you scrape.

It's really easy these days to distribute the work of loading the different URLs across multiple threads and hence multiple processes.  For example in Java you'd use an ExecutorService (here's an example: http://www.javacodegeeks.com/2013/01/java-thread-pool-example-using-executors-and-threadpoolexecutor.html)
and then you make http calls using some library like httpclient (http://hc.apache.org/httpclient-3.x/tutorial.html) to do the scraping.

No real need for a framework - the whole thing would be ~50 lines of code and you just spin up instances of this little Java app on different servers as needed.  You could scale to a huge size very easily.

That will work fine if all you want to do is scrape the URLs looking for some specific data.   Or if you just want to dump the results of what you scrape into a SQL database, then again no framework required.

But if you wanted to collate the data - e.g. counting the number of times a specific phrase occurred on those pages without dumping the results into a single common database or you want to take one page and break out all of the links it references and repeat that process recursively, then you start thinking this is more of a Map-Reduce problem and a framework might help with that.

Doug
0
 
LVL 17

Author Comment

by:OriNetworks
ID: 40522272
My mention of a framework can from the thought that someone would have already made a system for parallel processing of scripts to get as much work done as fast as possible on a particular machine. Since I only have one machine for testing I want to get as much work done as possible using as many CPU threads as possible. Of course the web scraping would be one task I wanted to process in parallel. Later I might want to create a processing layer to extract information from those results by just passing a different script to this 'framework'. If I attempted to do this in java then I think I would have to learn how to handle different threads for concurrent processing and then recreate this management of threads for each additional task I create in the future.

This is acceptable but I was trying to find a generic framework where I can tell it which script to run and up to how many instances I want to run.
0
 
LVL 27

Accepted Solution

by:
dpearson earned 500 total points
ID: 40523221
OK I see.  In that case you may want to try something like PPSS (https://github.com/louwrentius/PPSS).  I've not used it personally, but I believe it's designed for this sort of simple application where you just want to run a set of scripts in parallel.

Hope that helps,

Doug
0

Featured Post

Forrester Webinar: xMatters Delivers 261% ROI

Guest speaker Dean Davison, Forrester Principal Consultant, explains how a Fortune 500 communication company using xMatters found these results: Achieved a 261% ROI, Experienced $753,280 in net present value benefits over 3 years and Reduced MTTR by 91% for tier 1 incidents.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

There is no doubt that cloud is gaining importance. Many of you must have read about this technology and its growing importance. More and more organisations are embracing this technology not forgetting start-ups. The process begins by dipping …
Do you know what to look for when considering cloud computing? Should you hire someone or try to do it yourself? I'll be covering these questions and looking at the best options for you and your business.
Both in life and business – not all partnerships are created equal. Spend 30 short minutes with us to learn:   • Key questions to ask when considering a partnership to accelerate your business into the cloud • Pitfalls and mistakes other partners…
Simple Linear Regression

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question