Machine agents uninstall themselves upon worker.ErrTerminateAgent.

Mon May 9 08:38:24 UTC 2016

On Mon, May 9, 2016 at 9:56 AM, Andrew Wilkins <andrew.wilkins at canonical.com
> wrote:

> The reason it's done at the last moment is to avoid having dangling
>> database entries. If we uninstall the agent (i.e. delete /var/lib/juju,
>> remove systemd/upstart), then if the agent fails before we get to
>> EnsureDead, then the entity will never be removed from state.
>>
>
> The *only* thing that should happen after setting dead is the uninstall --
> anything else that's required to happen before cleanup *must* happen before
> setting dead, which *means* "all my responsibilities are 100% fulfilled".
>

> I don't think I suggested above that we should do anything else other
than uninstall?

Sorry I misunderstood -- I was focused on the loop-device stuff that had
grown in the uninstall logic.

> The *only* justification for the post-death logic in the manual case is
>> because there's no responsible provisioner component to hand over to -- and
>> frankly I wish we'd just written that to SSH in and clean up, instead of
>> taking on this ongoing hassle.
>>
>
>>
> As an alternative, we could (should) only ever write the
>>> /var/lib/juju/uninstall-agent file from worker/machiner, first checking
>>> there's no assigned units, and no storage attached.
>>>
>>
>> Why would we *ever* want to write it at runtime? We know if it's a manual
>> machine at provisioning time, so we can write the File Of Death OAOO. All
>> the other mucking about with it is the source of these (serious!) bugs.
>>
>
> The point is not to distinguish between manual vs. non-manual. Yes, we can
> write something that records that fact OAOO.
>
> The point of "write from the machiner" was to signal that the machine is
> actually dead, and removed from state, vs. "my credentials are invalid,
> better shut down now".
>

Yeah, we definitely shouldn't return ErrTerminateAgent from apicaller on
mere invalid-credentials. (No worker should return ErrTerminateAgent *at
all*, really -- not their job. Having a couple of different workers that
can return ErrAgentEntityDead, which can then be interpreted by the next
layer? Fine :).)

> So we can write a file to confine uninstall to manual machines -- that
> much is easy, I don't think anyone will disagree with doing that. But we
> should not ignore the bug that prompted this thread, even if it's confined
> to manual machines.
>

Right; and, yeah, it's almost certainly better to leak manual machines than
it is to uninstall them accidentally -- so long as we know that's the
tradeoff we're making ;).

> Andrew, I think you had more detail last time we discussed this: is there
>>>> anything else in uninstall (besides loop-device stuff) that needs to run
>>>> *anywhere* except a manual machine? and, what will we actually need to sync
>>>> with in the machiner? (or, do you have alternative ideas?)
>>>>
>>>
>>> No, I don't think there is anything else to be done in uninstall, apart
>>> from loop detach and manual machine cleanup. I'm not sure about moving the
>>> uninstall logic to the machiner, for reasons described above. We could
>>> improve the current state of affairs, though, by only writing the
>>> uninstall-agent file from the machiner
>>>
>>
>> Strong -1 on moving uninstall logic: if it has to happen (which it does,
>> in *rare* cases that are *always* detectable pre-provisioning), uninstall
>> is where it should happen, post-machine-death; and also strong -1 on
>> writing uninstall-agent in *any* circumstances except manual machine
>> provisioning, we have had *way* too many problems with this "clever"
>> feature being invoked when it shouldn't be.
>>
>
> I don't want to belabour the point, but just to be clear: the
> uninstall-agent file exists to record the fact that he machine is in fact
> Dead, and uninstall should go ahead. That logic was put in specifically to
> prevent the referenced bug. We can and should improve it to only do this
> for manual machines.
>

Meta: a name like "uninstall-agent" is really misleading if it actually
means "machine-X is definitely dead". I never had the slightest indication
that was what it was meant to mean. But in that case, why *don't* we write
the file on certain API connection failures? There is logic that
*explicitly* checks if the machine is Dead -- surely that's a trigger too?

FWIW, the loop stuff can be dropped when the LXC container support is
>>> removed. Nobody ever added support for loop in the LXD provider, and I
>>> think we should implement support for it differently to how it was done for
>>> LXC anyway (losetup on host, expose to container; as opposed to expose all
>>> loop devices to all LXD containers and losetup in container).
>>>
>>
>> +1000 to that. So... can't we just (1) fix the manual provisioning to
>> write the file; (2) drop all other use of uninstall-agent; (3) drop the
>> lxc-specific logic in uninstall -- and then we're done?
>>
>
> For first steps, I think so. But I do think we should fix
> https://bugs.launchpad.net/juju-core/+bug/1514874 for manual machines as
> well. So:
>
>  (1) record (in agent config?) that a machine is manual
>  (2) only ever do anything uninstall-related for manual machines
>  (3) only ever do uninstall-related things if the machine actually is Dead
>  (4) drop lxc-specific logic from uninstall *when LXC support is removed*
>

Generally SGTM, but confused re (4) -- doesn't that have to be
fixed/moved/removed first earlier, so that we can switch off the suicide
behaviour in other cases without regressing?

Cheers
William
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20160509/463a76f1/attachment-0001.html>