Automatic retries of hooks

Wed Jan 20 17:06:33 UTC 2016

On Wed, Jan 20, 2016 at 2:42 PM, Dean Henrichsmeyer <dean at canonical.com>
wrote:

> Hi,
>
> It seems the original point James was making is getting missed. No one is
> arguing over the value of being able to retry and/or idempotent hooks.
> Yes, you should be able to retry them and yes nothing should break if you
> run them over and over.
>
> The point made is that Juju shouldn't be automatically retrying them. The
> argument of "no one knows what went wrong so Juju automatically retrying
> them is a better experience" doesn't work. The intelligence of the stack in
> question, regardless of what it is, goes in the charms. If you start
> conflating and mixing up where the intelligence goes then creating,
> running, and debugging those distributed systems will be a nightmare.
>

Hook errors *will* happen, and often for transient reasons. In handling
this, we can choose between "users retry without understanding the details"
and "juju retries without understanding the details" [0]. I'd be happy to
make the behaviour configurable, for the rare cases when the user *does*
understand the details and wants full and detailed control, but I don't
think that's the common case.

The magic should only be in Juju's ability to effectively drive the models
> and intelligence encoded in the charms. It shouldn't make assumptions about
> what that intelligence is or what those models require.
>

Stopping on hook error can only *prevent* those charms from applying their
intelligence. No more hooks to be run => no more opportunity to react. If a
charm wants to be smart about errors, it needs to detect the errors it
*knows* about, and react to those by setting status; and to move on
*without* failing the hook, thereby giving subsequent hooks an opportunity
to be smart.

Ultimately, it comes down to the fact that there's *always* another error
case you haven't considered. If you depend on the charmer to implement
retries for specific errors, that's essentially a whitelist, and they're
stuck playing whack-a-mole forever [1]. But if the charmer can depend on
external retries, they only have to worry about maintaining a
definitely-fatal blacklist and reporting those conditions in status.

Am I making any sense here?

Cheers
William

[0] or "the system stays broken forever", I suppose :).
[1] I imagine the rational approach there is to give up, and start
whitelisting by operation rather than by error; i.e. to accept that most
errors are unknown/transient and should be dumbly retried. And given that,
why should every charmer have to roll their own retries?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20160120/24042b70/attachment.html>