Multiple instances of Node.js script

I am starting on my first node.js project and I want to try making a web scraper. I plan on using a MongoDB instance to store urls and filters I want to scrape and I want to use node.js to process the tasks I send to it. I imagine the process to be as follows:

1. I manually add URLs to the MongoDB in a task queue table
2. My script constantly runs in a loop and checks the database for new tasks.
3. If a new task is found, the node.js script starts an instance of my downloader script to start downloading data from the task URL
4. While the first task is working, I want the main script to check if the database has any additional records and start the downloader script as a new instance. Lets assume I want up to 5 instances running at a time.
5. After an instance of the downloader script finishes, it stores the downloaded data for later processing by a different scripts and marks the task queue item as complete.

As a node.js beginner I still have much to learn but is there a special design pattern I should follow to allow this multi-threaded/asynchronous operation?
LVL 17
OriNetworksAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Dave BaldwinFixer of ProblemsCommented:
Node doesn't run threads.  http://nodejs.org/about/  And many sites these days use javascript to load the page content after the page is loaded in the browser.  That makes them near impossible to scrape because, even though 'Node.js' is written in javascript, it is not going to run the javascript that exists in a web page.
0
clockwatcherCommented:
Pages that are built dynamically with javascript are sometimes easier to scrape than traditionally rendered pages.  In many cases, json is now used to serialize and transfer data back and forth.  And json is much easier to deal with than HTML.  The difficult part nowadays is usually session management and endpoint determination.   But the process is just different-- wouldn't say it's any harder or easier.  Just depends on the page and what you're after capturing.

As for the question, I would suggest simply having a set of worker processes constantly running polling your MongoDB task table rather than having a master process that polls and then spawns new ones.  Your master process goes down and your whole system goes down.  If you're going to have a master process, you should have one that simply maintains the workers.  Keeping a count on the workers, and if one of them dies, spawns a new one.  It shouldn't be polling.  Let the workers do that.

Google around for a publisher/subscriber (pub/sub with multiple subscribers) set up for MongoDB using node.js.  They've got to have an example or two out there.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
JavaScript

From novice to tech pro — start learning today.