relation-get - missing_unit behavior

Tue Jun 25 16:30:38 UTC 2013

On Mon, Jun 24, 2013 at 10:50 AM, Stuart Bishop <stuart.bishop at canonical.com
> wrote:

> I do think it is an edge case. I don't know if gaining consistency
> here is worth making bugs harder to find when bogus values are passed
> to relation-get.
>

I think it's a bit worse than that. As far as I can tell, the use case in
the bug is going to be more than somewhat racy regardless, because it's
perfectly possible -- even likely -- that u/0 and u/1 would read each
other's initial relation data at the same time; and each conclude that the
other is empty, and each set up their own auth. Without handling for this
situation, problems will surely arise.

Here's an attempted generalisation of your position:

I have a charm named "mydb" with relations "server" and "peer". I want to
be able to deploy a "db" service using that charm, and relate multiple
client services to the "db:server" endpoint. Each unit of "db" should
present the same credentials for a given client service, but different
client services must not share credentials despite sharing the endpoint.

This can be handled with, I think, 3 hooks; and no need to read peer
settings in non-peer relations [0]. In short, first:

peer-relation-changed (propagate most-authoritative data to local unit)
server-relation-joined (search for existing data, write speculatively if
not found)
server-relation-broken (clean up irrelevant data)

The algorithm depends somewhat on the ability to order units of db
*numerically*, not alphabetically. The steady state is the same [1] either
way, but it will converge an awful lot faster if we sort them properly.
Given that, assume the following definitions:

A unit is considered "newer" than another if the numeric part of its id is
greater than the other's.
A unit's "OKP" is its Oldest Known Peer, ie the first element of the sorted
output of "relation-list `relation-ids peer`". The LKP can be different for
different units at different times.
A unit's "known relations" are those for which at least one remote unit has
entered scope, and no -broken hook has yet run.
A unit's "LRS" are its Local Relation Settings -- the things it's allowed
to change. A unit has one LRS for each relation it's in.
An "auth key" is a string, specific to a relation, of the form
auth:server:$RELATION_ID, that's used to store the credentials in the
various units' peer LRS.

...and implement something like this:

peer-relation-changed:

  * if the remote peer unit is newer than the LKP, exit
  * copy every auth key from the OKP's peer relation settings into the peer
LRS
  * for each of the known relations, copy the value for that relation's
auth key into that relation's LRS
  * reconfigure/restart/whatever the software on the local unit to accept
the new credentials, and exit

server-relation-joined:

  * if the relation's LRS already contain credentials, exit
  * if the peer LRS has an auth key for the relation already, copy into
relation LRS, reconfigure, exit
  * if the local unit is newer than its OKP, and its OKP has the right auth
key, copy into peer LRS and relation LRS, reconfigure, exit
  * generate credentials, store in relation LRS and peer LRS, reconfigure,
exit

server-relation-broken:

  * if the peer LRS contains the relation's auth key, delete it

...and that's it. I think that this:

  * minimizes the frequency of credential changes.
  * handles any order of units joining -- if there are no peer settings,
the server-relation-joined just hands the job of setting them over to the
peer relation, which is the thing that's fundamentally responsible for
managing agreement across the cluster anyway; .
  * doesn't abuse the access to peers in non-peer relations.
  * deals gracefully with delayed and missing most-authoritative-units --
the peers that know about each other work perfectly happily together, and
if a more authoritative one shows up they all see it quickly; but if
there's a broken more-authoritative-unit hanging around somewhere and not
actually joining the relation it doesn't affect the active units at all.

Caveat: I haven't actually implemented this myself. It is, however, the
sort of thing I'd vaguely expect to see a generic implementation of
slipping into charm-tools, if such a thing hasn't already happened. I'd
imagine that you are not the first person to encounter this problem, but
then I can't actually name another example myself...

I'll be updating a bug with a precis of the above, and a more detailed
explanation of why I think the original approach misuses the tools.

Cheers
William

[0] that capability is, I think, probably a bug, on the basis that it
encourages unhelpful practices and doesn't enable any new behaviour I can
think of
[1] for a usefully fuzzy definition of "same", anyway
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju/attachments/20130625/21a1a859/attachment.html>