Solved

Complex Job scheduler software

Posted on 2014-07-22
6
209 Views
Last Modified: 2014-10-23
Hi,

I'm doing some investigation into building/designing an application that will process large amounts of Monte carlo simulations type jobs. These jobs will be of varying types, and through-put will vary according to time of the day/week/year. The Monte Carlo Applications will have an interface written in .net (the code behind may be in c#, c++ or another), and the actual interface implementation is open (Pending what i find). These jobs typically take < .5 sec and the idea is to amalgamate results from multiple nodes into 1 complete set of results that is transferred to the request user.

These are the items i am considering:
- Task Scheduler fault tolerance (Self explanatory)
- Task Scheduler features (e.g. Task priority, Alerting, Elastic demand capability, Bursting to the Cloud, Monitoring, Reporting)
-  How failures in worker nodes are defined and handled (Ideally i'd like a failure to be identified if a task took too long to complete and the node would be removed from the cluster)
- Jobs to be scheduled according to available cpu, and limits to be set on this.
- Interoperable with .net (I'd like to be able to automate any additional features required e.g. auto provision in cloud when resources exceed X limit)

I've had a brief look at:
 - http://www.univa.com/products/grid-engine.php (Looks very good - haven't been able to install locally to test, so any experiences appreciated
 - http://www.adaptivecomputing.com/products/hpc-products/moab-hpc-basic-edition/ (Again haven't looked at it closely)
 - http://www.argent.com/products/js.php (Installed this locally and there is a .net sdk but this is massively out of date)

Basically I'm just looking for advice on how to tackle this problem, and whether to attempt to develope an in-house solution, combine several opensource solutions or go fully commercial.

Any examples/pointers appreciated.

thanks
0
Comment
Question by:basil365
  • 3
  • 3
6 Comments
 
LVL 26

Expert Comment

by:skullnobrains
Comment Utility
from what i gather, you should try and keep it as simple as possible.

first of all, i'd start by stripping as many requirements as possible :
- failures in worker nodes : i guess you can accept that some of the nodes do not answer fast enough, and you want nodes to come back up by themselves, so giving low priority to slow nodes should be enough
- job scheduling based on available cpu : i guess the above covers it as well, and you can safely ignore this requirement.

here is a very trivial example of something that looks workable

nodes : each harbour a demon that listens for queries on a tcp socket. a single tcp connection handles a single request at a time, so the client can perform a simple select or equivalent call in order to grab the answers as they come. pipelining may or may not be supported. each node can start rejecting queries based on a maximum number, ram shortage, cpu shortage, swap state, but this is probably not even needed to start with.

clients : each client reads the list of existing nodes from a simple backend (dns, sql, flat file...) with no extra information regarding the node capacity. each client sends a few more queries than needed (let's say 5% more can be deemed reasonable) and ignore useless answers. each client maintains a list of servers speeds. that list is ordered using the time the server last took to answer (no average, no history, no nothing). the time is decremented periodically. the servers pick up nodes randomly from the fastest servers. dead servers are just set to a high waiting time so they won't be retried too soon. new servers start with a response time of 0.

if you have enough clients, the load will be spread quite evenly among the servers and prioritised according to the actual server's response times (including external slowdown factors such as dns latency) rather than based on a complex mechanism based on centralised information regarding cpu or whatever.

i guess you can easily improve this by adding alerts (each client can send his own alerts, i see little to no reason for grouping the information), adding more complex features such as the possibility for a node to remove itself from the list for a while or permanently, ...

i assume the client part would be trivial to handle in .net but i may be wrong here. same applies to nodes, but it is probably not needed anyway.

if you need centralised management of the nodes list (which i think is a bad idea), you can use a nosql backend to store the list of nodes and their response times. some (no)sql backends will even handle the decay. if they do not, you can handle it by using the current timestamp + response_time / 10 (example for a 1/10th of a second decay per second), and have a task that removes stuff that have a priority < to current timestamp (which can run on any client/node, and if possible several of them)

hope some of the above helps.
i think this leads the way to very little coding, and quite a lot of power.
0
 

Author Comment

by:basil365
Comment Utility
Hi,

thanks for your response - if i decide to develop fully inhouse some of the above suggestions will be very useful - at the moment i'm more evaluating the options though so don't need to go into the very low level specifics.

To date I've discovered that the mechanism of submitting a job on traditional schedulers is to pass a binary to a machine and invoke it with specified parameters. To date i've also noticed that most of the ready made solutions are geared to run on UNIX and are not directly compatible with .net (Of course you can create wrappers for apis/etc) which isn't ideal.

My company's architecture team recognises the need to buy more ready made products instead of full inhouse dev so If at all possible (It will ultimately be my decision) i want to give that the best chance possible. Any similar experiences?
0
 
LVL 26

Accepted Solution

by:
skullnobrains earned 500 total points
Comment Utility
i may have misread the question : i understood you were expecting response times below 5 seconds overall and not just for each job for something that runs on the log term.

then i have a personal experience with something similar : i used a database as a backend, and clients would just loop around the table looking for the next line with something to do (1 line = 1 job), use advisory locks on the sql server so 2 nodes don't collide, and store the results directly in the same table. it is very likely that something similar can be devised for your needs, and would be .net-doable

i'm pretty positive that it is safer and more robust to put smartness on the nodes rather than on a central scheduler when dealing with such tasks

i ended up writing such a script because none of the schedulers i knew of at the time seemed to fit the bill (but i did not even consider many expensive ones), and testing even a couple of such software (ones i did not have bad echoes about), seemed more time consuming than writing one. but if you have reasons to use third party products, this is quite a different situation. i'm sorry, but i cannot point a good such software as all the not-too-expensive ones i worked with or heard of were either lamish or would not fit the bill without much additional work

the mechanism of submitting a job on traditional schedulers is to pass a binary to a machine and invoke it with specified parameters

from what i gather, this does not really fit your needs because you would rather need to always execute the same job with different parameters.

note that it is quite feasible to inject a script or executable and run it in windows as well (psexec for example does this quite neatly)... but again i could not point one except by googling (i heard good stuff about opcon but never used it, and i'm not sure it fits your needs).

---

note that if you have to write a program that would run on a single machine, and execute whatever treatment on a series of datasets, running it as a daemon and pulling datasets from files over the network or a database does not account for much more coding than the original program (10-20 additional lines) which may give you a good argument towards home-made
0
Complete VMware vSphere® ESX(i) & Hyper-V Backup

Capture your entire system, including the host, with patented disk imaging integrated with VMware VADP / Microsoft VSS and RCT. RTOs is as low as 15 seconds with Acronis Active Restore™. You can enjoy unlimited P2V/V2V migrations from any source (even from a different hypervisor)

 

Author Comment

by:basil365
Comment Utility
thanks for your response - good to get an opinion from your experience! At the moment i'm thinking there will be custom work to do on the worker nodes (Ideas above can be applied), with (hopefully) a paid for solution for resource management (Possibly similar to Yarn or Mesos)
0
 
LVL 26

Expert Comment

by:skullnobrains
Comment Utility
from what i gather, i really would not do resource management in your case, as i hardly see any benefit at all for such kind of tasks, but you're the one to make the decision.

maybe a more precise description of what you need to run would either give a better understanding as to why you need it, and possibly which kind, or better arguments against.

---

feel free to ask about any problem you experience along the way, and post about about your experience as i'm pretty sure your work will produce quite a lot of useful information to people with similar goals.
0
 

Author Closing Comment

by:basil365
Comment Utility
Ended up building own solution inhouse
0

Featured Post

Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
factorial example challenge 10 60
stringclean challenge 26 54
What is Python programming? 3 62
network timeout on mapped drive 3 25
This is an explanation of a simple data model to help parse a JSON feed
Although it can be difficult to imagine, someday your child will have a career of his or her own. He or she will likely start a family, buy a home and start having their own children. So, while being a kid is still extremely important, it’s also …
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now