I'm doing some investigation into building/designing an application that will process large amounts of Monte carlo simulations type jobs. These jobs will be of varying types, and through-put will vary according to time of the day/week/year. The Monte Carlo Applications will have an interface written in .net (the code behind may be in c#, c++ or another), and the actual interface implementation is open (Pending what i find). These jobs typically take < .5 sec and the idea is to amalgamate results from multiple nodes into 1 complete set of results that is transferred to the request user.
These are the items i am considering:
- Task Scheduler fault tolerance (Self explanatory)
- Task Scheduler features (e.g. Task priority, Alerting, Elastic demand capability, Bursting to the Cloud, Monitoring, Reporting)
- How failures in worker nodes are defined and handled (Ideally i'd like a failure to be identified if a task took too long to complete and the node would be removed from the cluster)
- Jobs to be scheduled according to available cpu, and limits to be set on this.
- Interoperable with .net (I'd like to be able to automate any additional features required e.g. auto provision in cloud when resources exceed X limit)
I've had a brief look at:
(Looks very good - haven't been able to install locally to test, so any experiences appreciated
(Again haven't looked at it closely)
(Installed this locally and there is a .net sdk but this is massively out of date)
Basically I'm just looking for advice on how to tackle this problem, and whether to attempt to develope an in-house solution, combine several opensource solutions or go fully commercial.
Any examples/pointers appreciated.