Automatic retries of hooks
John Meinel
john at arbash-meinel.com
Wed Jan 20 06:17:03 UTC 2016
There are classes of failures that a charm hook itself cannot handle. The
specific one Bogdan was working with is the fact that the machine itself is
getting restarted while the charm is in the middle of processing a hook.
There isn't any way the hook itself can handle that, unless you could raise
a very specific error that indicates you should be retried (so as it
notices its about to die, it raises the try-me-again error).
Hooks are supposed to be idempotent regardless, aren't they? So while we
paper over transient bugs in them, doesn't it make the system more
resilient overall?
John
=:->
On Tue, Jan 19, 2016 at 6:14 PM, James Page <james.page at ubuntu.com> wrote:
> Hi Bogdan
>
> On Thu, Nov 26, 2015 at 1:29 PM, Bogdan Teleaga <
> bteleaga at cloudbasesolutions.com> wrote:
>
>> This has been a WIP for a while now so maybe some of you have heard
>> about it.
>>
>> It all started out with us needing to have hook retried after a random
>> reboot and it evolved into retrying hooks upon any kind of failure.
>>
>> So as of now failing hooks will be retried automatically after a
>> certain time. The minimum wait time will be 20 seconds, while the
>> maximum will be 20 minutes and it's going to increase with a factor of
>> 2 for every failure. Also a small jitter is introduced for a bit of
>> randomness. Using juju resolved will overwrite this timer and cause it
>> to restart at the beginning.
>>
>> I've tested it for a while and it has proven to be relatively robust
>> in my tests. Probably having a CI test soonish would be recommended.
>>
>> The waiting amount has been chosen relatively arbitratily so if anyone
>> has comments or ideas for that, I'm open to suggestions. The
>> discussion for that should go
>> here(https://github.com/juju/juju/pull/3835), since apparently I
>> merged the branch with some values I used in testing and did not
>> change them back to the intended ones.
>>
>
> In the daily deluge of email I managed to miss your post to list, and
> stumbled upon this feature whilst exercising 1.26 alpha3 with some
> development work this week and assumed it was a bug:
>
> https://bugs.launchpad.net/juju-core/+bug/1535711
>
> I think this is a dangerous behaviour to introduce to Juju; a hook error
> should be a signal to an end user that something really bad happened, and
> that they need to dig in further (preferably with points from status
> messages); if the function that a hook is performing is re-tryable, that
> needs to be handled in charm and not by Juju IMHO.
>
> Specifically I was testing some changes to the odl-controller charm; this
> feature covered up a race in the charm hook code accessing the API of ODL,
> which I failed to notice the first few times I deployed (not paying
> attention due to multi-tasking), and then had me scratching my head as to
> what was going on when I started to notice the hook failure.
>
> Cheers
>
> James
>
> --
> Juju-dev mailing list
> Juju-dev at lists.ubuntu.com
> Modify settings or unsubscribe at:
> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20160120/d861d3f9/attachment.html>
More information about the Juju-dev
mailing list