Guarantee order of relation-joined & relation-changed?

Fri Jul 5 11:43:54 UTC 2013

On Thu, Jul 4, 2013 at 8:28 PM, Gustavo Niemeyer <gustavo at niemeyer.net> wrote:
> On Thu, Jul 4, 2013 at 8:59 AM, Stuart Bishop
> <stuart.bishop at canonical.com> wrote:
> (...)
>> Is this problem valid, or is there something already preventing this race?
>
> There's no race in the traditional sense of the term. Whatever the
> ordering, the system communicates any changes to remote units so they
> get notified of the new state.

Right. In the case I described, the remote units will eventually get
correct data. With luck, they haven't entered a failure state during
the window by trying to use the incorrect data.

The issue is that charm authors need to realize when their designs
have races, and the charms need to be redesigned to avoid the races,
or implement a protocol to avoid the races. The submitted fix to the
PostgreSQL charm is implementing a protocol (a client unit should not
attempt to connect until it is listed in the list of allowed units).
This is simple enough and will probably become a common pattern once
automated testing becomes commonplace and these ordering issues are
tripped over more often.

> Andreas is also right that different situations will have a different
> orderings that are most convenient.

Yes, I wonder if there should be some way of the charm defining
priority. Or a way of hooks blocking until particular conditions are
met. Or a hook being able to fail in such a way that it will be
automatically retried later. Or a hook being able to trigger an
arbitrary hook to be run on a particular unit in a particular
relation.

For example, I just noticed another race in the PostgreSQL charm. You
setup a PostgreSQL service with multiple units, and one is the master
and the others are hot standbys. Removing the master unit causes a
failover. All this works fine on my proposed branch. However, if a new
client relation is added while the failover process is happening,
there is a reasonable chance that it will never be setup correctly -
only the master unit can create the new user and database, and during
the election period there is no master. It is an obscure case, and I
suspect most charm developers will not bother to fix it and instead
just say "don't do that" rather than increase the complexity of the
charm further.

> We talked about ordering before in a different context, for example:
>
> https://lists.ubuntu.com/archives/juju/2012-February/001259.html

You talk about relation-set and relation-get needing to be run outside
of hooks. However, I don't think that is a requirement for me. It
might help with long running operations (eg. adding a new unit to the
PostgreSQL unit may require transferring terabytes of data and will
take some time - being able to do that out of band may be helpful).

The existing abilities of relation-set can be used to solve the
problem. If I or someone chooses to fix the problem rather than just
sweep it under the carpet, I think we end up with *all* the charm
logic in two hooks - install, and a mega hook that is invoked for
everything else. This mega hook iterates over all the relations and
all the units collecting all the state, calculates the state the unit
should be in and resets it. If this actually changed anything,
relation-changed hooks get invoked on all the relevant units and the
process continues until a steady state is reached.

It seems that as a charm grows in complexity, it needs to devolve into
this model. It may mean that juju could be simplified considerably -
by the time your charm is complex enough to worry about the difference
between relation-joined and relation-changed and relation-broken and
relation-departed and config-changed and upgrade-charm, you may well
have reached the point where everything is easier and more reliable to
treat them all as just reset-universe.

-- 
Stuart Bishop <stuart.bishop at canonical.com>