Packaging, recovery, and random fault injection
Gustavo Niemeyer
gustavo.niemeyer at canonical.com
Mon Apr 25 20:00:44 UTC 2011
> Longer term, it would be nice to get some additional help from the
> server team on getting these core ensemble dependencies packaged
> nicely.
Indeed. IIRC packaging an updated ZooKeeper was already pretty high
on the list of things, so let's talk to see if we can get that going
sooner rather than later.
> I'm proposing two separate tracks then.
>
> - Rebuild the ensemble ec2 images, to include working versions of
> zookeeper. In future getting current upstream versions of zookeeper
> into oneiric.
+1, let's talk about that.
> - Increase ensemble's fault tolerance of agents that die. Machine
> agents monitoring unit agents, and provisioning agents monitoring
> machine agents.
Can't we a dumb watchdog restarting the process in case it crashes,
rather than making the agents more complex?
Either way, having the provisioning agent fiddling with the machine
agent process certainly sounds a bit awkward, since they may be in
separate machines.
> Additionally unit agents maintaining on disk
> queues of pending hook executions that they can recover from on
> startup.
Sounds good.
--
Gustavo Niemeyer
http://niemeyer.net
http://niemeyer.net/blog
http://niemeyer.net/twitter
More information about the Ensemble
mailing list