?
Solved

Multiple instances of Node.js script

Posted on 2014-12-23
2
Medium Priority
?
209 Views
Last Modified: 2016-02-10
I am starting on my first node.js project and I want to try making a web scraper. I plan on using a MongoDB instance to store urls and filters I want to scrape and I want to use node.js to process the tasks I send to it. I imagine the process to be as follows:

1. I manually add URLs to the MongoDB in a task queue table
2. My script constantly runs in a loop and checks the database for new tasks.
3. If a new task is found, the node.js script starts an instance of my downloader script to start downloading data from the task URL
4. While the first task is working, I want the main script to check if the database has any additional records and start the downloader script as a new instance. Lets assume I want up to 5 instances running at a time.
5. After an instance of the downloader script finishes, it stores the downloaded data for later processing by a different scripts and marks the task queue item as complete.

As a node.js beginner I still have much to learn but is there a special design pattern I should follow to allow this multi-threaded/asynchronous operation?
0
Comment
Question by:OriNetworks
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 
LVL 84

Assisted Solution

by:Dave Baldwin
Dave Baldwin earned 400 total points
ID: 40515330
Node doesn't run threads.  http://nodejs.org/about/  And many sites these days use javascript to load the page content after the page is loaded in the browser.  That makes them near impossible to scrape because, even though 'Node.js' is written in javascript, it is not going to run the javascript that exists in a web page.
0
 
LVL 25

Accepted Solution

by:
clockwatcher earned 1600 total points
ID: 40518737
Pages that are built dynamically with javascript are sometimes easier to scrape than traditionally rendered pages.  In many cases, json is now used to serialize and transfer data back and forth.  And json is much easier to deal with than HTML.  The difficult part nowadays is usually session management and endpoint determination.   But the process is just different-- wouldn't say it's any harder or easier.  Just depends on the page and what you're after capturing.

As for the question, I would suggest simply having a set of worker processes constantly running polling your MongoDB task table rather than having a master process that polls and then spawns new ones.  Your master process goes down and your whole system goes down.  If you're going to have a master process, you should have one that simply maintains the workers.  Keeping a count on the workers, and if one of them dies, spawns a new one.  It shouldn't be polling.  Let the workers do that.

Google around for a publisher/subscriber (pub/sub with multiple subscribers) set up for MongoDB using node.js.  They've got to have an example or two out there.
0

Featured Post

Enroll in August's Course of the Month

August's CompTIA IT Fundamentals course includes 19 hours of basic computer principle modules and prepares you for the certification exam. It's free for Premium Members, Team Accounts, and Qualified Experts!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article discusses how to implement server side field validation and display customized error messages to the client.
Recently I was talking with Tim Sharp, one of my colleagues from our Technical Account Manager team about MongoDB’s scalability. While doing some quick training with some of the Percona team, Tim brought something to my attention...
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.
Suggested Courses

764 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question