Complex Job scheduler software


I'm doing some investigation into building/designing an application that will process large amounts of Monte carlo simulations type jobs. These jobs will be of varying types, and through-put will vary according to time of the day/week/year. The Monte Carlo Applications will have an interface written in .net (the code behind may be in c#, c++ or another), and the actual interface implementation is open (Pending what i find). These jobs typically take < .5 sec and the idea is to amalgamate results from multiple nodes into 1 complete set of results that is transferred to the request user.

These are the items i am considering:
- Task Scheduler fault tolerance (Self explanatory)
- Task Scheduler features (e.g. Task priority, Alerting, Elastic demand capability, Bursting to the Cloud, Monitoring, Reporting)
-  How failures in worker nodes are defined and handled (Ideally i'd like a failure to be identified if a task took too long to complete and the node would be removed from the cluster)
- Jobs to be scheduled according to available cpu, and limits to be set on this.
- Interoperable with .net (I'd like to be able to automate any additional features required e.g. auto provision in cloud when resources exceed X limit)

I've had a brief look at:
 - (Looks very good - haven't been able to install locally to test, so any experiences appreciated
 - (Again haven't looked at it closely)
 - (Installed this locally and there is a .net sdk but this is massively out of date)

Basically I'm just looking for advice on how to tackle this problem, and whether to attempt to develope an in-house solution, combine several opensource solutions or go fully commercial.

Any examples/pointers appreciated.

Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

from what i gather, you should try and keep it as simple as possible.

first of all, i'd start by stripping as many requirements as possible :
- failures in worker nodes : i guess you can accept that some of the nodes do not answer fast enough, and you want nodes to come back up by themselves, so giving low priority to slow nodes should be enough
- job scheduling based on available cpu : i guess the above covers it as well, and you can safely ignore this requirement.

here is a very trivial example of something that looks workable

nodes : each harbour a demon that listens for queries on a tcp socket. a single tcp connection handles a single request at a time, so the client can perform a simple select or equivalent call in order to grab the answers as they come. pipelining may or may not be supported. each node can start rejecting queries based on a maximum number, ram shortage, cpu shortage, swap state, but this is probably not even needed to start with.

clients : each client reads the list of existing nodes from a simple backend (dns, sql, flat file...) with no extra information regarding the node capacity. each client sends a few more queries than needed (let's say 5% more can be deemed reasonable) and ignore useless answers. each client maintains a list of servers speeds. that list is ordered using the time the server last took to answer (no average, no history, no nothing). the time is decremented periodically. the servers pick up nodes randomly from the fastest servers. dead servers are just set to a high waiting time so they won't be retried too soon. new servers start with a response time of 0.

if you have enough clients, the load will be spread quite evenly among the servers and prioritised according to the actual server's response times (including external slowdown factors such as dns latency) rather than based on a complex mechanism based on centralised information regarding cpu or whatever.

i guess you can easily improve this by adding alerts (each client can send his own alerts, i see little to no reason for grouping the information), adding more complex features such as the possibility for a node to remove itself from the list for a while or permanently, ...

i assume the client part would be trivial to handle in .net but i may be wrong here. same applies to nodes, but it is probably not needed anyway.

if you need centralised management of the nodes list (which i think is a bad idea), you can use a nosql backend to store the list of nodes and their response times. some (no)sql backends will even handle the decay. if they do not, you can handle it by using the current timestamp + response_time / 10 (example for a 1/10th of a second decay per second), and have a task that removes stuff that have a priority < to current timestamp (which can run on any client/node, and if possible several of them)

hope some of the above helps.
i think this leads the way to very little coding, and quite a lot of power.
basil365Author Commented:

thanks for your response - if i decide to develop fully inhouse some of the above suggestions will be very useful - at the moment i'm more evaluating the options though so don't need to go into the very low level specifics.

To date I've discovered that the mechanism of submitting a job on traditional schedulers is to pass a binary to a machine and invoke it with specified parameters. To date i've also noticed that most of the ready made solutions are geared to run on UNIX and are not directly compatible with .net (Of course you can create wrappers for apis/etc) which isn't ideal.

My company's architecture team recognises the need to buy more ready made products instead of full inhouse dev so If at all possible (It will ultimately be my decision) i want to give that the best chance possible. Any similar experiences?
i may have misread the question : i understood you were expecting response times below 5 seconds overall and not just for each job for something that runs on the log term.

then i have a personal experience with something similar : i used a database as a backend, and clients would just loop around the table looking for the next line with something to do (1 line = 1 job), use advisory locks on the sql server so 2 nodes don't collide, and store the results directly in the same table. it is very likely that something similar can be devised for your needs, and would be .net-doable

i'm pretty positive that it is safer and more robust to put smartness on the nodes rather than on a central scheduler when dealing with such tasks

i ended up writing such a script because none of the schedulers i knew of at the time seemed to fit the bill (but i did not even consider many expensive ones), and testing even a couple of such software (ones i did not have bad echoes about), seemed more time consuming than writing one. but if you have reasons to use third party products, this is quite a different situation. i'm sorry, but i cannot point a good such software as all the not-too-expensive ones i worked with or heard of were either lamish or would not fit the bill without much additional work

the mechanism of submitting a job on traditional schedulers is to pass a binary to a machine and invoke it with specified parameters

from what i gather, this does not really fit your needs because you would rather need to always execute the same job with different parameters.

note that it is quite feasible to inject a script or executable and run it in windows as well (psexec for example does this quite neatly)... but again i could not point one except by googling (i heard good stuff about opcon but never used it, and i'm not sure it fits your needs).


note that if you have to write a program that would run on a single machine, and execute whatever treatment on a series of datasets, running it as a daemon and pulling datasets from files over the network or a database does not account for much more coding than the original program (10-20 additional lines) which may give you a good argument towards home-made

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
The Ultimate Tool Kit for Technolgy Solution Provi

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy for valuable how-to assets including sample agreements, checklists, flowcharts, and more!

basil365Author Commented:
thanks for your response - good to get an opinion from your experience! At the moment i'm thinking there will be custom work to do on the worker nodes (Ideas above can be applied), with (hopefully) a paid for solution for resource management (Possibly similar to Yarn or Mesos)
from what i gather, i really would not do resource management in your case, as i hardly see any benefit at all for such kind of tasks, but you're the one to make the decision.

maybe a more precise description of what you need to run would either give a better understanding as to why you need it, and possibly which kind, or better arguments against.


feel free to ask about any problem you experience along the way, and post about about your experience as i'm pretty sure your work will produce quite a lot of useful information to people with similar goals.
basil365Author Commented:
Ended up building own solution inhouse
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Server Software

From novice to tech pro — start learning today.