Machine agents uninstall themselves upon worker.ErrTerminateAgent.

Mon May 9 01:28:18 UTC 2016

On Sat, May 7, 2016 at 1:37 AM William Reade <william.reade at canonical.com>
wrote:

> On Fri, May 6, 2016 at 5:50 PM, Eric Snow <eric.snow at canonical.com> wrote:
>
>> See https://bugs.launchpad.net/juju-core/+bug/1514874.
>
>
So I think this issue is fixed in 2.0, but looks like the changes never got
backported to 1.25. From your options, we do have (the opposite of) a
DO_NOT_UNINSTALL file (it's actually called
"/var/lib/juju/uninstall-agent"; only if it exists do we uninstall).

(And now that I think of it, we're only writing uninstall-agent for the
manual provider's bootstrap machine, and not other manual machines, so
we're currently leaving Juju bits behind on manual machines added to an
environment.)

Mainly, we just shouldn't do it. The only *actual reason* for us to do this
> is to support the manual provider; any other machine that happens to be
> dead will be cleaned up by the responsible provisioner in good time.
> There's a file we write to enable the suicidal behaviour when we enlist a
> manual machine, and we *shouldn't* have ever written it for any other
> reason.
>
> But then people started adding post-death cleanup logic to the agent, and
> the only way to trigger that was to write that file; so they started
> writing that file on normal shutdown paths, so that they could trigger the
> post-death logic in all cases, and I can only assume they decided that
> nuking the installation was acceptable precisely *because* the provisioner
> would be taking down the machine anyway. And note that because there *is* a
> responsible provisioner that will be taking the machine down, that shutdown
> logic might or might not happen, rendering it ultimately pretty unreliable
> [0] as well as spectacularly terrible in the lp:1514874 cases. /sigh
>
> So, narrowly, fixing this involves relocating the more-widely-useful
> (in-container loop device detach IIRC?) code inside MachineAgent.uninstall;
> and then just not ever writing the file that triggers actual uninstall,
> putting it back to where it was intended: a nasty but constrained hack to
> avoid polluting manual machines (which are only "borrowed") any more than
> necessary.
>
> (A bit more broadly: ErrTerminateAgent is kinda dumb, but not
> *particularly* harmful in practice [1] without the file of death backing it
> up. The various ways in which, e.g. api connection failure can trigger it
> are a bit enthusiastic, perhaps, but if *all it did was terminate the
> agent* it'd be fine: someone who, e.g. changed their agent.conf's password
> would invoke jujud and see it stop -- can't do anything with this config --
> but be perfectly able to work again if the conf were fixed. Don't know what
> the deal is with the correct-password-refused thing, though: that should be
> investigated further.)
>
> The tricky bit is relocating the cleanup code in uninstall -- I suspect
> the reason this whole sorry saga kicked off is because whoever added it was
> in a hurry and thought they didn't have a place to put this logic (inside
> the machiner, *before* setting Dead, would be closest to correct -- but
> that'd require us to synchronise with any other workers that might use the
> resources we want to clean up, and I suspect that would have been the
> roadblock).
>

The reason it's done at the last moment is to avoid having dangling
database entries. If we uninstall the agent (i.e. delete /var/lib/juju,
remove systemd/upstart), then if the agent fails before we get to
EnsureDead, then the entity will never be removed from state.

As an alternative, we could (should) only ever write the
/var/lib/juju/uninstall-agent file from worker/machiner, first checking
there's no assigned units, and no storage attached.

Andrew, I think you had more detail last time we discussed this: is there
> anything else in uninstall (besides loop-device stuff) that needs to run
> *anywhere* except a manual machine? and, what will we actually need to sync
> with in the machiner? (or, do you have alternative ideas?)
>

No, I don't think there is anything else to be done in uninstall, apart
from loop detach and manual machine cleanup. I'm not sure about moving the
uninstall logic to the machiner, for reasons described above. We could
improve the current state of affairs, though, by only writing the
uninstall-agent file from the machiner

FWIW, the loop stuff can be dropped when the LXC container support is
removed. Nobody ever added support for loop in the LXD provider, and I
think we should implement support for it differently to how it was done for
LXC anyway (losetup on host, expose to container; as opposed to expose all
loop devices to all LXD containers and losetup in container).

Cheers,
Andrew

Cheers
> William
>
> [0] although with the watcher resolution of 5s, it's actually got a pretty
> good chance of running.
> [1] still utterly terrible in theory, it would be lovely to see explicit
> handover between contexts -- e.g. it is *not* the machiner's job to return
> ErrTerminateAgent, it is whatever's-responsible-for-the-machiner's job to
> handle a local ErrMachineDead and convert that as necessary for the context
> it's running in.
>
> PS Oh, uh, couple of comments on the below:
>
> 1. do nothing
>>     This would be easy :) but does not help with the pain point.
>> 2. be smarter about retrying (e.g. retry connecting to API) when
>> running into fatal errors
>>     This would probably be good to do but the effort might not pay for
>> itself.
>> 3. do not clean up (data dir, init service, or either)
>>     Leaving the init service installed really isn't an option because
>> we don't want
>>     the init service to try restarting the agent over and over.
>> Leaving the data dir
>>     in place isn't good because it will conflict with any new agent
>> dir the controller
>>     tries to put on the instance in the future.
>>
>
> we don't need to worry about the init service because we're set up not to
> be restarted if we return 0, which we should on ErrTerminateAgent.
>
>
>> 4. add a DO_NOT_UNINSTALL or DO_NOT_CLEAN_UP flag (e.g. in the
>> machine/model config or as a sentinel file on instances) and do not
>> clean up if it is set to true (default?)
>>     This would provide a reasonable quick fix for the above bug, even if
>>     temporary and even if it defaults to false.
>> 5. disable (instead of uninstall) the init services and move the data
>> dir to some obvious but out-of-the-way place (e.g.
>> /home/ubuntu/agent-data-dir-XXMODELUUIDXX-machine-X) instead of
>> deleting it
>>     This is a reasonable longer term solution as the concerns
>> described for 3 are
>>     addressed and manual intervention becomes more feasible.
>> 6. same as 5 but also provide an "enable-juju" (or "revive-juju",
>> "repair-juju", etc.) command *on the machine* that would re-enable the
>> init services and restore (or rebuild) the agent's data dir
>>     This would make it even easier to manually recover.
>> 7. first try to automatically recover (from machine or controller)
>>     This would require a sizable effort for a problem that shouldn't
>> normally happen.
>> 8. do what the unit agent does
>>     I haven't looked closely enough to see if this is a good fit.
>>
>> I'd consider 4 with a default of false to be an acceptable quick fix.
>> Additionally, I'll advocate for 6 (or 5) as the most appropriate
>> solution in support of manual recovery from "fatal" errors.
>>
>
> I don't think we want any of this -- it's all a consequence of
> inappropriately triggering the uninstall stuff in the first place.
>
>
>>
>> -eric
>>
>>
>> * Could this lead to collisions if the instance is re-purposed as a
>> different machine?  I suppose it could also expose sensitive data when
>> likewise re-purposed, since it won't necessarily be in the same model
>> or controller.  However, given the need for admin access that probably
>> isn't a likely problem.
>>
>
> If substrates are leaking data between separate instances, the substrate
> is screwed up, and there's not much we can do about it. The manual provider
> has no such guarantees, and that's the only justification for trying to
> clean up so drastically in the first place.
>
>
>>
>> --
>> Juju-dev mailing list
>> Juju-dev at lists.ubuntu.com
>> Modify settings or unsubscribe at:
>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20160509/f7d79722/attachment.html>