• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 251
  • Last Modified:

Multiple instances of Node.js script

I am starting on my first node.js project and I want to try making a web scraper. I plan on using a MongoDB instance to store urls and filters I want to scrape and I want to use node.js to process the tasks I send to it. I imagine the process to be as follows:

1. I manually add URLs to the MongoDB in a task queue table
2. My script constantly runs in a loop and checks the database for new tasks.
3. If a new task is found, the node.js script starts an instance of my downloader script to start downloading data from the task URL
4. While the first task is working, I want the main script to check if the database has any additional records and start the downloader script as a new instance. Lets assume I want up to 5 instances running at a time.
5. After an instance of the downloader script finishes, it stores the downloaded data for later processing by a different scripts and marks the task queue item as complete.

As a node.js beginner I still have much to learn but is there a special design pattern I should follow to allow this multi-threaded/asynchronous operation?
2 Solutions
Dave BaldwinFixer of ProblemsCommented:
Node doesn't run threads.  http://nodejs.org/about/  And many sites these days use javascript to load the page content after the page is loaded in the browser.  That makes them near impossible to scrape because, even though 'Node.js' is written in javascript, it is not going to run the javascript that exists in a web page.
Pages that are built dynamically with javascript are sometimes easier to scrape than traditionally rendered pages.  In many cases, json is now used to serialize and transfer data back and forth.  And json is much easier to deal with than HTML.  The difficult part nowadays is usually session management and endpoint determination.   But the process is just different-- wouldn't say it's any harder or easier.  Just depends on the page and what you're after capturing.

As for the question, I would suggest simply having a set of worker processes constantly running polling your MongoDB task table rather than having a master process that polls and then spawns new ones.  Your master process goes down and your whole system goes down.  If you're going to have a master process, you should have one that simply maintains the workers.  Keeping a count on the workers, and if one of them dies, spawns a new one.  It shouldn't be polling.  Let the workers do that.

Google around for a publisher/subscriber (pub/sub with multiple subscribers) set up for MongoDB using node.js.  They've got to have an example or two out there.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Build your data science skills into a career

Are you ready to take your data science career to the next step, or break into data science? With Springboard’s Data Science Career Track, you’ll master data science topics, have personalized career guidance, weekly calls with a data science expert, and a job guarantee.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now