Updating state on agent upgrade

Wed Sep 25 07:42:58 UTC 2013

Moving this to the list, in case others have input.

Cheers,
Andrew

On Wed, Sep 25, 2013 at 1:22 PM, Andrew Wilkins <
andrew.wilkins at canonical.com> wrote:

> On Wed, Sep 25, 2013 at 10:54 AM, Tim Penhey <tim.penhey at canonical.com>wrote:
>
>> On 25/09/13 08:15, William Reade wrote:
>> > On Tue, Sep 24, 2013 at 11:12 AM, Andrew Wilkins
>> > <andrew.wilkins at canonical.com <mailto:andrew.wilkins at canonical.com>>
>> wrote:
>> >
>> >     Hi William, Tim,
>> >
>> >     I'm looking at adding a couple of new MachineJobs as requested, to
>> >     handle local-storage and firewaller. Here's what I'm thinking:
>> >
>> >     - Add a "BootstrapMachineJobs" field to
>> >     environs/cloudinit.MachineConfig; if nil, set the current default in
>> >     cmd/jujud/bootstrap.go. This will be written to the bootstrap
>> >     agent.conf, and consumed by jujud bootstrap-state.
>> >
>> > Question: machine jobs, or something more like environment capability
>> > flags? Because, really, that's why we need the custom jobs.
>> >
>> >  * we should only firewall if the environment supports that. [0]
>> >  * we should only run http storage if the environment doesn't provide it
>> > itself.
>> >  * plausibly, in the case of the null provider, we should actually not
>> > even run an environ provisioner, and not bother to implement the
>> > InstanceBroker methods.
>>
>> I think I agree with William here, these are more associated with a
>> provider rather than just agent config.
>>
>> Null provider and local provider want to run http storage, but not
>> firewallers.
>>
>> I also think it is fair that we shouldn't run an environ provisioner for
>> the null provider.  And with the addition of the precheck methods, there
>> should never get to be a machine in state that would require the null
>> provider to try.  Let's not start what we don't need.
>>
>
> SGTM
>
>
>>  > All the above are pieces of info about the environment we could record
>> > clearly in state, and which should apply to any manager node we start in
>> > an HA context. Furthermore, across even non-management nodes, we can
>> > know it's not even worth bothering to run any non-environ provisioner if
>> > the environment can't supply new addresses; if we've got environment
>> > capabilities recorded in state, we can know what needs to be done at the
>> > time of machine creation.
>>
>> I'm not even sure that this information needs to be in state.  At least
>> for the first cut of it.
>>
>> Also, we do have a problem with considering that having multiple environ
>> provisioners as they are currently defined is going to cause race
>> conditions on starting / stopping containers unless we add extra
>> metadata to state so one provisioner doesn't try to stop a machine
>> another is starting.  Actually given the HA story, it is better to have
>> two working collaboratively than a fail-over we have to manage.
>>
>> > This does  *then* imply that the existing machine-creation methods are
>> > themselves talking the wrong language: rather than specifying jobs
>> > explicitly, we should be specifying... roles, maybe? ...and combining
>> > roles with environment capabilities internally to state.
>>
>> Well, the whole point of the jobs listing for the machines IS a
>> reflection of the roles that the machine has.  We just need to have more
>> fine-grained roles, rather than "manage everything", we add a few.
>>
>
> I think that conceptually, "capability" makes sense for some things more
> than job/role. In particular, "has the ability to manage firewalls" seems
> better expressed as a capability than as a job. However, I don't think it's
> really worthwhile changing code to match. A capability can be expressed as
> a job, even if it's *slightly* awkward. The fact that we're giving a
> machine-agent the job "ManageFirewall" implies that it has that capability.
>
>
>> >     - Update agent.Conf's format-1.16 to read this, and
>>
>> FYI - not that we need this, if you change format-1.16, you also need to
>> change the migrate method, or put this in the attribute map.
>>
>
> Yep. As discussed on IRC, this could just as well be done with the
> key/value map. I kind of don't like adding required things into a key/value
> map, but on on the other hand this is bootstrap-specific, and not something
> the machine-agent proper cares about. Not changing the format is good, too.
>
>
>> >     - Update manual bootstrap to set machine jobs, including
>> >     JobHostEnvironStorage and excluding JobManageEnvironFirewall.
>> >     - Update local provider to add JobHostEnvironStorage job.
>> >
>> >
>> > So if we have roles+capabilities, the machine agent stays nice and
>> > simple -- we just inject a machine with the "manager" role, which then
>> > gets its jobs calculated according to the environ's capabilities. But
>> > ofc we do still have to inject the capabilities at cloud-init time. Bah
>> ;).
>>
>> We don't need to add the capabilities to the config.  We could add them
>> to the information that the machine api gets back.  However, since the
>> machine agents don't know what the environment is, it takes us back to
>> storing the roles (jobs) in state.
>>
>> >     So there's a couple of things that need to happen on upgrade:
>> >      - For local provider, add the JobHostEnvironStorage job to machine
>> >     0 if it doesn't have it.
>> >      - For non-local, non-null provider, add JobManageEnvironFirewall to
>> >     machine 0 if it doesn't have it.
>> >
>> >     Is there existing code that does this? Where's appropriate? I know
>> >     there's agent.conf migration, but I don't think that's really
>> >     appropriate for this kind of upgrade. Environ.Validate could
>> >     potentially do this, by checking old/new tools versions, connecting
>> >     to state if it's machine 0 and making necessary changes.
>> >
>> >
>> > We don't have good practice wrt upgrades. Given that the state package
>> > is not completely insulated behind the API, and so we can never
>> > guarantee that some agent or client is not going to swoop in and start
>> > changing the database, we've just been making very tentative additions
>> > and sometimes getting even those wrong. FWIW, the decision to upgrade
>> > *is* now taken behind the API, so we have some degree of control we did
>> > not before, but it's still not foolproof.
>>
>> We need a state side, server upgrade process defined.  Enough of this
>> ad-hoc jiggery-pokery.
>>
>> We also need a defined process for upgrades.  I'm not sure how close we
>> are to this right now, but I think we need something like this:
>>
>> 1) Put the API server into a state where it continues to serve requests,
>> but doesn't accept new connections.
>> 2) The tool version is updated causing all machine agents to kill
>> themselves.
>> 3) We need some form of state-side lock to allow only one state server
>> to modify the underlying structure, and a defined process of functions
>> to run to modify the state documents to the next version. [1]
>>
>> This process needs to be defined, and stable, such that we don't delete
>> it all when the next minor branch commit is done.
>>
>> 4) When the state servers have been upgraded, we then kick off the api
>> servers, which the machine agents can then connect to.
>>
>
> This sounds sane to me.
>
>
>>  > I think it's probably simplest to do a one-shot post-upgrade job-update
>> > operation in the machine agents that have jobs which are changing
>> > meaning (for versions 1.15/1.16). The machine agents each have control
>> > over their machine's documents, and they're the only things that react
>> > interestingly to machine jobs regardless, so they're perfectly suited to
>> > updating the documents; and the machine agents *are* where we apply the
>> > hacks today, so it's quite convenient to have the same component make
>> > the appropriate fixes to state before being retired for 1.17 and
>> onwards.
>>
>> See above, and I don't think we should be retiring the code too soon.
>>
>> > So maybe: add and make an UpdateJobs API call, somewhere before we call
>> > MachineAgent.APIWorker and get the Jobs we're expected to run, and
>> > schedule the code to be deleted after 1.16; old code will still read the
>> > jobs it expects, new code won't run until the additions have been made,
>> > and everyone will be happy. I think.
>> >
>> > BTW, the idea of Environ.Validate connecting to state breaks my brain a
>> > little, I'd very strongly prefer not to do that.
>> >
>> > Not sure if all that is helpful, or whether it just obscures things.
>> > Ping me in the morning and we can talk if necessary.
>>
>> I also have something that is going to be needed to be installed on all
>> machines as part of the upgrade procedure.
>>
>> New installs will have cpu-checker package installed, and will have done
>> some rudimentary checks when the machine agent has come up, however we
>> need a place to add new packages that are required to be installed, or
>> new apt-sources defined (like the cloud-tools archive).
>>
>> Perhaps this whole piece of work fits under the "major version upgrades"
>> headline, as once we have this process and procedure in place, major
>> versions just become a number we may change periodically as any version
>> may update the state document structure.
>>
>
> Yep. After I sent the email yesterday, I began thinking that this upgrade
> functionality is going to be exactly what's needed for updating the state
> schema. I've got a few things to finish off (authenticated httpstorage is
> half down; still need to document manual provisioning). Pending
> cloud-installer work, I can start looking into this in a bit more detail.
>
> Vague ideas at the moment:
> - Add a version to the state database (I suppose there'd need to be some
> kind of metadata document collection), to track required schema changes.
> - Add a state/upgrade package, which keeps a full history of
> point-to-point schema updates required. We iterate through version changes,
> applying upgrade steps one at a time. Everything must be done in a
> transaction, naturally.
> - One API server will (with a global lock):
>    * First upgrade the state database. All other code can be written to
> assume the current database schema.
>    * Invoke an EnvironUpgrader interface method, optionally implemented by
> an Environ. This interface defines a method for upgrading some
> provider-specific aspects of the environment (e.g. going through and adding
> jobs to all of the state-server machines). The EnvironUpgrader will
> similarly need to keep track of versions, and point-to-point upgrades.
>
> [1] I think the stance of only supporting upgrades to the next public
>> release is crackful.  Consider a user that has a working juju install
>> and has not needed to poke it for ages.  They are on 2.2.0.  They read
>> the latest juju docs that show new amazing juju-gui stuff that requires
>> 2.10.0.  We should not make them go 2.2 -> 2.4 -> 2.6 -> 2.8 -> 2.10 as
>> that is clearly a terrible end user experience.
>>
>
> From a user POV, that sounds pretty horrible.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20130925/552ce3fc/attachment-0001.html>