Complex Job scheduler software

Posted on 2014-07-22
Last Modified: 2014-10-23

I'm doing some investigation into building/designing an application that will process large amounts of Monte carlo simulations type jobs. These jobs will be of varying types, and through-put will vary according to time of the day/week/year. The Monte Carlo Applications will have an interface written in .net (the code behind may be in c#, c++ or another), and the actual interface implementation is open (Pending what i find). These jobs typically take < .5 sec and the idea is to amalgamate results from multiple nodes into 1 complete set of results that is transferred to the request user.

These are the items i am considering:
- Task Scheduler fault tolerance (Self explanatory)
- Task Scheduler features (e.g. Task priority, Alerting, Elastic demand capability, Bursting to the Cloud, Monitoring, Reporting)
-  How failures in worker nodes are defined and handled (Ideally i'd like a failure to be identified if a task took too long to complete and the node would be removed from the cluster)
- Jobs to be scheduled according to available cpu, and limits to be set on this.
- Interoperable with .net (I'd like to be able to automate any additional features required e.g. auto provision in cloud when resources exceed X limit)

I've had a brief look at:
 - (Looks very good - haven't been able to install locally to test, so any experiences appreciated
 - (Again haven't looked at it closely)
 - (Installed this locally and there is a .net sdk but this is massively out of date)

Basically I'm just looking for advice on how to tackle this problem, and whether to attempt to develope an in-house solution, combine several opensource solutions or go fully commercial.

Any examples/pointers appreciated.

Question by:basil365
  • 3
  • 3
LVL 26

Expert Comment

ID: 40221309
from what i gather, you should try and keep it as simple as possible.

first of all, i'd start by stripping as many requirements as possible :
- failures in worker nodes : i guess you can accept that some of the nodes do not answer fast enough, and you want nodes to come back up by themselves, so giving low priority to slow nodes should be enough
- job scheduling based on available cpu : i guess the above covers it as well, and you can safely ignore this requirement.

here is a very trivial example of something that looks workable

nodes : each harbour a demon that listens for queries on a tcp socket. a single tcp connection handles a single request at a time, so the client can perform a simple select or equivalent call in order to grab the answers as they come. pipelining may or may not be supported. each node can start rejecting queries based on a maximum number, ram shortage, cpu shortage, swap state, but this is probably not even needed to start with.

clients : each client reads the list of existing nodes from a simple backend (dns, sql, flat file...) with no extra information regarding the node capacity. each client sends a few more queries than needed (let's say 5% more can be deemed reasonable) and ignore useless answers. each client maintains a list of servers speeds. that list is ordered using the time the server last took to answer (no average, no history, no nothing). the time is decremented periodically. the servers pick up nodes randomly from the fastest servers. dead servers are just set to a high waiting time so they won't be retried too soon. new servers start with a response time of 0.

if you have enough clients, the load will be spread quite evenly among the servers and prioritised according to the actual server's response times (including external slowdown factors such as dns latency) rather than based on a complex mechanism based on centralised information regarding cpu or whatever.

i guess you can easily improve this by adding alerts (each client can send his own alerts, i see little to no reason for grouping the information), adding more complex features such as the possibility for a node to remove itself from the list for a while or permanently, ...

i assume the client part would be trivial to handle in .net but i may be wrong here. same applies to nodes, but it is probably not needed anyway.

if you need centralised management of the nodes list (which i think is a bad idea), you can use a nosql backend to store the list of nodes and their response times. some (no)sql backends will even handle the decay. if they do not, you can handle it by using the current timestamp + response_time / 10 (example for a 1/10th of a second decay per second), and have a task that removes stuff that have a priority < to current timestamp (which can run on any client/node, and if possible several of them)

hope some of the above helps.
i think this leads the way to very little coding, and quite a lot of power.

Author Comment

ID: 40228707

thanks for your response - if i decide to develop fully inhouse some of the above suggestions will be very useful - at the moment i'm more evaluating the options though so don't need to go into the very low level specifics.

To date I've discovered that the mechanism of submitting a job on traditional schedulers is to pass a binary to a machine and invoke it with specified parameters. To date i've also noticed that most of the ready made solutions are geared to run on UNIX and are not directly compatible with .net (Of course you can create wrappers for apis/etc) which isn't ideal.

My company's architecture team recognises the need to buy more ready made products instead of full inhouse dev so If at all possible (It will ultimately be my decision) i want to give that the best chance possible. Any similar experiences?
LVL 26

Accepted Solution

skullnobrains earned 500 total points
ID: 40230066
i may have misread the question : i understood you were expecting response times below 5 seconds overall and not just for each job for something that runs on the log term.

then i have a personal experience with something similar : i used a database as a backend, and clients would just loop around the table looking for the next line with something to do (1 line = 1 job), use advisory locks on the sql server so 2 nodes don't collide, and store the results directly in the same table. it is very likely that something similar can be devised for your needs, and would be .net-doable

i'm pretty positive that it is safer and more robust to put smartness on the nodes rather than on a central scheduler when dealing with such tasks

i ended up writing such a script because none of the schedulers i knew of at the time seemed to fit the bill (but i did not even consider many expensive ones), and testing even a couple of such software (ones i did not have bad echoes about), seemed more time consuming than writing one. but if you have reasons to use third party products, this is quite a different situation. i'm sorry, but i cannot point a good such software as all the not-too-expensive ones i worked with or heard of were either lamish or would not fit the bill without much additional work

the mechanism of submitting a job on traditional schedulers is to pass a binary to a machine and invoke it with specified parameters

from what i gather, this does not really fit your needs because you would rather need to always execute the same job with different parameters.

note that it is quite feasible to inject a script or executable and run it in windows as well (psexec for example does this quite neatly)... but again i could not point one except by googling (i heard good stuff about opcon but never used it, and i'm not sure it fits your needs).


note that if you have to write a program that would run on a single machine, and execute whatever treatment on a series of datasets, running it as a daemon and pulling datasets from files over the network or a database does not account for much more coding than the original program (10-20 additional lines) which may give you a good argument towards home-made
Best Practices: Disaster Recovery Testing

Besides backup, any IT division should have a disaster recovery plan. You will find a few tips below relating to the development of such a plan and to what issues one should pay special attention in the course of backup planning.


Author Comment

ID: 40231449
thanks for your response - good to get an opinion from your experience! At the moment i'm thinking there will be custom work to do on the worker nodes (Ideas above can be applied), with (hopefully) a paid for solution for resource management (Possibly similar to Yarn or Mesos)
LVL 26

Expert Comment

ID: 40237305
from what i gather, i really would not do resource management in your case, as i hardly see any benefit at all for such kind of tasks, but you're the one to make the decision.

maybe a more precise description of what you need to run would either give a better understanding as to why you need it, and possibly which kind, or better arguments against.


feel free to ask about any problem you experience along the way, and post about about your experience as i'm pretty sure your work will produce quite a lot of useful information to people with similar goals.

Author Closing Comment

ID: 40398781
Ended up building own solution inhouse

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
If you’re thinking to yourself “That description sounds a lot like two people doing the work that one could accomplish,” you’re not alone.
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

914 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now