Solved

Multiple instances of Node.js script

Posted on 2014-12-23
2
178 Views
Last Modified: 2016-02-10
I am starting on my first node.js project and I want to try making a web scraper. I plan on using a MongoDB instance to store urls and filters I want to scrape and I want to use node.js to process the tasks I send to it. I imagine the process to be as follows:

1. I manually add URLs to the MongoDB in a task queue table
2. My script constantly runs in a loop and checks the database for new tasks.
3. If a new task is found, the node.js script starts an instance of my downloader script to start downloading data from the task URL
4. While the first task is working, I want the main script to check if the database has any additional records and start the downloader script as a new instance. Lets assume I want up to 5 instances running at a time.
5. After an instance of the downloader script finishes, it stores the downloaded data for later processing by a different scripts and marks the task queue item as complete.

As a node.js beginner I still have much to learn but is there a special design pattern I should follow to allow this multi-threaded/asynchronous operation?
0
Comment
Question by:OriNetworks
2 Comments
 
LVL 83

Assisted Solution

by:Dave Baldwin
Dave Baldwin earned 100 total points
ID: 40515330
Node doesn't run threads.  http://nodejs.org/about/  And many sites these days use javascript to load the page content after the page is loaded in the browser.  That makes them near impossible to scrape because, even though 'Node.js' is written in javascript, it is not going to run the javascript that exists in a web page.
0
 
LVL 25

Accepted Solution

by:
clockwatcher earned 400 total points
ID: 40518737
Pages that are built dynamically with javascript are sometimes easier to scrape than traditionally rendered pages.  In many cases, json is now used to serialize and transfer data back and forth.  And json is much easier to deal with than HTML.  The difficult part nowadays is usually session management and endpoint determination.   But the process is just different-- wouldn't say it's any harder or easier.  Just depends on the page and what you're after capturing.

As for the question, I would suggest simply having a set of worker processes constantly running polling your MongoDB task table rather than having a master process that polls and then spawns new ones.  Your master process goes down and your whole system goes down.  If you're going to have a master process, you should have one that simply maintains the workers.  Keeping a count on the workers, and if one of them dies, spawns a new one.  It shouldn't be polling.  Let the workers do that.

Google around for a publisher/subscriber (pub/sub with multiple subscribers) set up for MongoDB using node.js.  They've got to have an example or two out there.
0

Featured Post

VMware Disaster Recovery and Data Protection

In this expert guide, you’ll learn about the components of a Modern Data Center. You will use cases for the value-added capabilities of Veeam®, including combining backup and replication for VMware disaster recovery and using replication for data center migration.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

The purpose of this article is to demonstrate how we can use conditional statements using Python.
Fine Tune your automatic Updates for Ubuntu / Debian
The viewer will learn how to count occurrences of each item in an array.
The viewer will learn the basics of jQuery including how to code hide show and toggles. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery…

920 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now