Coordinating actions in a service

Tue May 12 08:55:39 UTC 2015

On 8 May 2015 at 20:35, Gustavo Niemeyer <gustavo at niemeyer.net> wrote:

> On Fri, May 8, 2015 at 10:24 AM, John Weldon <johnweldon4 at gmail.com> wrote:
>>
>> Hi Stuart;
>>
>> I think this is addressed in the proposed work for Actions 2.0
>>
>> In the current model you'd have to manage all of this yourself.  Actions
>> can only be targeted to specific units in the current implementation, so
>> you'd have to manage the distribution outside of actions (or else, as you
>> suggest, some sort of generalised semaphore service, and a way to find all
>> units of a service and queue up actions for all of them and have the actions
>> individually manage themselves whether to run or not)
>>
>> The plan is to allow actions to be targeted at 1) specific units in a
>> service, 2) leaders only, 3) all units in a service, or 4) a subset of units
>> in a service.  This would still be a little tricky for your use case, but
>> you could at least manage all the logic in an action targeted at only the
>> leader for example.
>
>
> None of these seem to fix the issue, though. The requirement is to execute
> on all units, but not at the same time. Also, the leader has no way to
> communicate with peer units without the hook terminating, right? There
> should be a way to postpone the result of an action to a moment past the end
> of the hook, but I actually think the proper way to fix this is to avoid the
> avalanche in the first place, by default: when dispatching an action to all
> units of a service, roll out in a sane way rather than doing all at once.

If I can run an action on the leader, and if the leader can run
actions on its peers, then I have a mechanism to do all sorts of weird
coordination (the leader kicks off actions on peers as it deems
appropriate, collates the results and returns them). I do agree that a
good implementation of service level actions would handle most use
cases without this level of complexity. Two options (One unit at a
time, and all units simultaneously) meets most needs. N units at a
time or N% units at a time is more esoteric.

The leader being able to run actions on other units is a good thing
(and I think planned?). I'd like my charm to schedule a weekly repair
job on the cluster. At the moment, each unit has a cron job (spread
out throughout the week) that runs the repair on the individual node.
If would be much nicer to have a cronjob that does 'if is-leader, then
run-action repair-cluster'.

-- 
Stuart Bishop <stuart.bishop at canonical.com>