Deployment Oversight

Merlijn Sebrechts merlijn.sebrechts at gmail.com
Mon Nov 28 17:21:24 UTC 2016


I wouldn't want to be in your shoes in a pre-snappy world... I'm amazed
that Ubuntu still works so well in the ocean.

We found a way to mitigate most of the issues: run everything exclusively
in LXC containers. This gave us the standard cloud image that all these
Charms are being tested on. This approach had two issues:

1. Network: lxc containers can't connect to containers on other hosts and
can't resolve each others hostnames. DNS might be a bigger issue than you
think. Not a single big data framework can handle un-resolvable hostnames.
2. Reliability:

- We experienced many crashed state servers on 1.x manual environments.
- Random failure of the lxc template download[1] (This problem reappeared 1
week after closing the issue. We didn't reopen it because we started moving
to MAAS).

- Random failure of installing lxc packages on the host. At first I thought
this was due to outdated host images, but this problem was intermittent,
which doesn't make much sense..

- Fixes for 1. were hard to create, and didn't work reliable.

Getting to the point where lxc was successfully installed and working was
hard and unreliable. About 1/2 deploys failed. However, once an environment
got there, it was very pleasant to work with.

What I suggest is that you stop trying to make Juju work in 'the ocean' and
focus the manual environment efforts on one thing: a multi-machine LXD
provider. *Fix the LXD networking and DNS issues and tell everyone to only
use LXD containers in a manual environment.* For many people, the manual
provider is their starting point into the Juju world, and running
everything in LXD containers is a very good starting point.

[1] https://bugs.launchpad.net/juju-core/1.25/+bug/1610880


2016-11-28 16:55 GMT+01:00 Mark Shuttleworth <mark at ubuntu.com>:

>
> Super difficult to document 'the ocean', there will always be fraying at
> the edges that what worked on clouds fails in the manual case.
>
> Mark
>
>
> On 28/11/16 15:49, Rick Harding wrote:
>
> That's very true on the items that are different. I wonder if we could
> work with the CPC team and note the things that are assumed promises when
> using cloud images so that it'd be easy to build a "patch" for manually
> provisioned machines. If we know specific packages or configuration is
> there on our images it should be do-able to help have some sort of
> "manual-init" script that could try to bring things in line.
>
> Merlijn, do you have any notes on the changes that you were suffering
> through? Was there anything that didn't fit the "using your own ubuntu
> install vs a CPC certified image"?
>
> On Sun, Nov 27, 2016 at 1:26 AM John Meinel <john at arbash-meinel.com>
> wrote:
>
>> From what I can tell, there are a number of places where these manual
>> machines differ from our "standard" install. I think the charms can be
>> written defensively around this, but its why you're running into more
>> issues than you normally would.
>>
>>    1. 'noexec' for /tmp. I've heard of this, but as layer-ruby wants to
>>    build something, where *should* it build something. Maybe we could do
>>    something in /var, but it does seem like the intermediate files are all
>>    temporary (thus why someone picked /tmp). I don't have any details on
>>    layer-ruby
>>    2. python-yaml not installed. Most of the places where we run juju
>>    uses 'cloud-init' in order to set up the machine for the first time, and
>>    I'm pretty sure cloud-init has a dependency on python-yaml (cause its how
>>    some of the cloud-init config is written). Again, charms can just include
>>    python-yaml as a dependency, I'm guessing they just didn't notice because
>>    all the other places they tested it was already there.
>>
>> John
>> =:->
>>
>>
>> On Sun, Nov 27, 2016 at 4:45 AM, Merlijn Sebrechts <
>> merlijn.sebrechts at gmail.com> wrote:
>>
>> I feel you, James
>>
>> We've been battling with weird issues / compatibility problems with the
>> manual provider on private infra for the past year. Just finding out where
>> the problem is requires diving deep into the internals of Juju and the
>> Charms. In the end, we patched our own servers heavily and had to patch
>> ~30% of the Charms we tried. This slowed us down so much that we just gave
>> up and moved to MAAS. We're having a lot less problems now..
>>
>>
>>
>> 2016-11-27 0:03 GMT+01:00 James Beedy <jamesbeedy at gmail.com>:
>>
>> Was a bit flustered earlier when I sent off this email, I've looked a bit
>> closer at each of the individual problems, thought I would report back with
>> my findings.
>>
>> 1. Job for systemd-sysctl.service failed because the control process
>> exited
>>     - This is an error I'm seeing when installing juju (not sure if this
>> is adding to any other issues or not), didn't look into it much, but filed
>> a bug here -> https://bugs.launchpad.net/juju/+bug/1645025
>>
>> 2. ERROR juju.state database.go:231 using unknown collection
>> "remoteApplications"
>>     - This seems to only exist in 2.0.1, installed from juju/stable ppa,
>> when I reverted back to 2.0.0, this went away.
>>
>> Charm/Layer Issues
>>
>> 3. Problem with Ruby: ["env: './configure': Permission denied"]
>>     - Both of my charms were utilizing layer-ruby. When deployed to lxd,
>> and EC2, I don't seem to get this error, but deploying on this
>> private/dedicated infra doesn't like python running `./configure` I feel
>> (could also be permissions on /tmp, but I tried moving the upacking and
>> configuring to another dir, and still got this error).
>>     - Filed bug here -> https://github.com/battlemidget/juju-layer-ruby/
>> issues/12
>>     - Removing layer-ruby was my fix here, this allowed my charms to
>> deploy w/o error.
>>
>> 4.  Elasticsearch
>>     - Seems the es charm can't find the yaml module (possibly a python3.5
>> thing)???
>>     - Filed bug here -> https://bugs.launchpad.net/
>> charms/+source/elasticsearch/+bug/1645043
>>     - My workaround here, just to get the app deployed, was to deploy
>> elasticsearch to a lxd container on one of my hosts. Of course this isn't
>> an answer for anything more then POC, but worked to allow me to
>> deploy/troubleshoot the rest of my bundle.
>>
>>
>> Aside from the remaining elasticsearch issue, I was able to get my stack
>> deployed -> http://paste.ubuntu.com/23540146/
>>
>> My earlier baffled and confused cry for help seems now just revolve
>> around getting es to deploy.
>>
>> My apologies for reaching out in such a way earlier before diving into
>> what was going on, hopefully we can work out whats going on with my infra
>> <-> ES.
>>
>> Thanks
>>
>> --
>> Juju mailing list
>> Juju at lists.ubuntu.com
>> Modify settings or unsubscribe at: https://lists.ubuntu.com/
>> mailman/listinfo/juju
>>
>>
>>
>> --
>> Juju mailing list
>> Juju at lists.ubuntu.com
>> Modify settings or unsubscribe at: https://lists.ubuntu.com/
>> mailman/listinfo/juju
>>
>>
>> --
>> Juju-dev mailing list
>> Juju-dev at lists.ubuntu.com
>> Modify settings or unsubscribe at: https://lists.ubuntu.com/
>> mailman/listinfo/juju-dev
>>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju/attachments/20161128/c9f26e27/attachment.html>


More information about the Juju mailing list