Solved

Framework For Building Multi-threaded Web Scraper

Posted on 2014-12-23
3
331 Views
Last Modified: 2014-12-30
I'm trying to find a framework that will allow me to create distributed/parallel tasks. At first I planned on using node.js to create a web scraper that will scrape multiple URLs in parallel but I am told it does not allow parallel processing. Right now I only plan on using 1 computer to process parallel jobs for learning purposes but my goal is to develop something that supports scaling across multiple nodes if I need more computers to carry out the tasks. Is there a better framework more suited for this? Any development language is fine.
0
Comment
Question by:OriNetworks
  • 2
3 Comments
 
LVL 26

Assisted Solution

by:dpearson
dpearson earned 500 total points
Comment Utility
I think the answer depends on what you plan to do with the results of what you scrape.

It's really easy these days to distribute the work of loading the different URLs across multiple threads and hence multiple processes.  For example in Java you'd use an ExecutorService (here's an example: http://www.javacodegeeks.com/2013/01/java-thread-pool-example-using-executors-and-threadpoolexecutor.html)
and then you make http calls using some library like httpclient (http://hc.apache.org/httpclient-3.x/tutorial.html) to do the scraping.

No real need for a framework - the whole thing would be ~50 lines of code and you just spin up instances of this little Java app on different servers as needed.  You could scale to a huge size very easily.

That will work fine if all you want to do is scrape the URLs looking for some specific data.   Or if you just want to dump the results of what you scrape into a SQL database, then again no framework required.

But if you wanted to collate the data - e.g. counting the number of times a specific phrase occurred on those pages without dumping the results into a single common database or you want to take one page and break out all of the links it references and repeat that process recursively, then you start thinking this is more of a Map-Reduce problem and a framework might help with that.

Doug
0
 
LVL 17

Author Comment

by:OriNetworks
Comment Utility
My mention of a framework can from the thought that someone would have already made a system for parallel processing of scripts to get as much work done as fast as possible on a particular machine. Since I only have one machine for testing I want to get as much work done as possible using as many CPU threads as possible. Of course the web scraping would be one task I wanted to process in parallel. Later I might want to create a processing layer to extract information from those results by just passing a different script to this 'framework'. If I attempted to do this in java then I think I would have to learn how to handle different threads for concurrent processing and then recreate this management of threads for each additional task I create in the future.

This is acceptable but I was trying to find a generic framework where I can tell it which script to run and up to how many instances I want to run.
0
 
LVL 26

Accepted Solution

by:
dpearson earned 500 total points
Comment Utility
OK I see.  In that case you may want to try something like PPSS (https://github.com/louwrentius/PPSS).  I've not used it personally, but I believe it's designed for this sort of simple application where you just want to run a set of scripts in parallel.

Hope that helps,

Doug
0

Featured Post

Will my email signature work in Office 365?

You've built an email signature using raw HTML code in Office 365, but you can't review how it looks with Transport Rules. So you have to test it over and over again before it can be used. Isn't this a bit of a waste of your time? Wouldn't a WYSIWYG editor make it a lot easier?

Join & Write a Comment

When the confidentiality and security of your data is a must, trust the highly encrypted cloud fax portfolio used by 12 million businesses worldwide, including nearly half of the Fortune 500.
Whether you’re a college noob or a soon-to-be pro, these tips are sure to help you in your journey to becoming a programming ninja and stand out from the crowd.
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now